BRAMSHILL Speech Collection
    ===========================

    This file describes the organisation and format of the
    BRAMSHILL speech collection.

    The BRAMSHILL collection is a set of CD-ROMs containing
    digitised and transcribed recordings of free conversation.
    Each item is a recording of one half of a two-speaker
    conversation. Most of the recordings include a standard set
    of test sentences as well as the conversation.

1.  Directory Structure
    ===================

    Each CD-ROM has three top level directories.

    The DOC directory contains this and other text files
    describing the collection.

    The INDEX directory contains catalogue information.

    The SPEAKERS directory contains the data. Within SPEAKERS
    there is a separate sub-directory for each speaker. All the
    material for any given speaker is contained on a single
    CD-ROM.

                  <root>
                     |
                     |
   ====================================
   |                 |                |
   |                 |                |
  DOC              INDEX           SPEAKERS
                                      |
                                      |
                       =============================
                       |                           |
                       |                           |
                      Snnn   ....   ....   ....  Smmm
                       |
                       |
           =================================================
           |           |          |         |      |     |
           |           |          |         |      |     |
       Snnn1.DAT  Snnn1.TMT  Snnn2.DAT Snnn2.TMT  .... ....

2.  File Naming Convention
    ======================

    Each speaker in the collection has been allocated a four
    character identifier, consisting of the letter S and 3
    digits, for example S324.

    Each recording has been allocated a five character identifier
    consisting of the speaker's identifier plus an additional
    digit, for example S3242.

    Speech data files have the extension .DAT and the transcription
    files have the extension .TMT.

    An item is composed of one data file and one transcription
    file. For example an item might consist of the pair of files
    S3241.DAT and S3241.TMT.


3.  The Transcriptions
    ==================

    For transcription purposes, the recordings were divided into
    "utterances". These utterances are short sections of speech,
    typically a phrase or short sentence. The division into
    utterances was carried out by an automated process which
    attempted to place utterance boundaries in the pauses between
    words. Occasionally, however, in cases such as long runs of
    unbroken speech, the utterance boundaries might occur in
    inappropriate locations. This should be remembered when using
    the transcriptions.

    The transcription files contain the start time, length and
    corresponding text of each utterance. Time and length are given
    in 0.1 second units.

    A number of conventions were adopted in the transcriptions:

        The speech was transcribed verbatim. No attempt was made
        to correct grammar, fill in missing words etc.

        Proper dictionary words were used unless the
        pronunciation was radically different, in which case a
        new slang word was introduced. For example, "that're"
        would normally be transcribed as "that are", but the word
        "gonna" would be introduced rather than transcribing as
        "going to".

        Contractions such as "it's" for "it is" and "they're" for
        "they are" were transcribed, but in cases of doubt the
        preference was for the uncontracted version.

        Hesitation sounds were transcribed from a limited set of
        conventional representations including "uh" and
        "um". No attempt was made to distinguish between sounds
        such as "erm" and "hmm" or between "uh" and "ah".

        Conventional English punctuation was included where
        possible but much of the material is ungrammatical and
        the punctuation is no more than an aid to readability.
        Punctuation was limited to .,?!:; and ellipses (...).

        Spoken numbers were spelled out. Spoken letters were
        transcribed as single upper case letters. So a vehicle
        registration number might be transcribed as "D seven
        three six K N Y".

        Unclear sections were enclosed in double parentheses.
        Completely unintelligible passages were represented as a
        single space between double parentheses "(( ))".

        Where a speaker breaks off in mid word the part word was
        transcribed followed by a hyphen and a comma or other
        punctuation mark. For example "phot-," (for broken
        "photograph").

        Related letter/word sequences were transcribed in a style
        such as "T -shirt" or "O K -ing".

        Apart from the special cases described above, hyphens are
        not used.

        Proper names were capitalised, for example "... the Great
        North Road ...".
        
        If the speakers referred to text visible in the photographs
        and there was possible ambiguity, the text was transcribed
        in upper case. For example, "I can see CANDY FLOSS" refers
        to lettering, rather than the actual candy floss.

        Non-speech sounds were shown in square brackets. For
        example, "[cough]". Continuous sounds were represented by
        a start/end pair such as "[bell] ... [\bell]".

        Occasionally the transcribers recorded comments in
        braces. For example "{very loud}".

        Where it was possible to identify a change of discussion
        topic a double `at' marker "@@" was included. Most
        transcriptions include at least one "@@" between the
        standard test sentences and the conversational part of
        the recording.

        A few short sections of the recordings have been replaced
        by binary zeroes to protect the anonymity of the
        speakers. Such sections were transcribed with the single
        comment "{ZERO}".

    Every lexical word from the transcriptions is contained in
    the dictionary supplied in the INDEX directory. Contractions,
    part-words, slang words, hesitation sounds and the non-speech
    sounds such are all treated as words in their own right in
    the dictionary.

    A definitive list of the non-verbal sounds such as "[cough]"
    can be obtained by searching the dictionary for words
    starting with "[" and ending with "]".

4   Data File Formats
    =================

4.1 Speech Data Files
    =================

    The speech data was sampled at 10 kHz and stored as 16-bit
    2's complement integers, least significant byte first. Each
    recording is stored in a single file (extension .DAT). The
    first 1024 bytes of the file contain an ASCII header. The
    remainder of the file is the speech data. The data section of
    a speech file is an unstructured byte stream. There is no
    record or block structure.

    The first 1024 bytes contain a header in the NIST SPHERE
    format. This is an ASCII text based format. An example header
    follows---

    NIST_1A
       1024
    database_id -s9 BRAMSHILL
    database_version -s3 1.0
    utterance_id -s5 S1231
    channel_count -i 1
    sample_count -i 6000000
    sample_rate -i 10000
    sample_min -i -28127
    sample_max -i 25763
    sample_n_bytes -i 2
    sample_byte_format -s2 01
    sample_sig_bits -i 16
    transcriber -s3 DWK
    end_head

    The first two lines are the standard header introduction.
    These are each eight characters long (including the "newline"
    terminator). The last line is the standard header terminator.
    The remainder of the header is padded to 1024 bytes with
    "newline" characters. This means that the header can be
    inspected easily by using a utility program such as "more".

    The body of the header consists of a set of "triples" each
    having the general form "name type value". In the BRAMSHILL
    database, only two type specifiers are used---

        -i  - Integer
        -sn - String, length n characters

    Each triple is terminated by a single "newline" character

    The header fields are---

    database_id        Database name, always "BRAMSHILL"
    database_version   Database Version, for example "1.0"
    utterance_id       Item identifier, for example "S1231"
    channel_count      Always 1
    sample_count       Number of samples in the data file
    sample_rate        Sample rate, always 10000
    sample_min         Minimum sample value
    sample_max         Maximum sample value
    sample_n_bytes     Bytes per sample, always 2
    sample_byte_format Sample byte order, always "01", LSB first
    sample_sig_bits    Always 16
    transcriber        Code identifying the transcriber

4.2 Transcription Files
    ===================

    The transcription files (extension .TMT) are ASCII text files.
    The first line of each transcription file is a header
    identifying the transcribed item. The header line has the
    format---

           Transcription of BRAMSHILL item S1232

    Following the header line is a series of utterance
    transcriptions.

    Utterance transcriptions consist of two integers and the
    transcription text separated by single spaces. Each utterance
    transcription is terminated by a new line.

    The two integers define the start time (relative to the start
    of the file) and duration of the utterance in 0.1 second
    units.

    An example utterance transcription follows---

    3512 35 There is a clock in the right hand side of the picture.

    Each utterance transcription is contained on a single line.
    The newline character terminates an utterance transcription.
    This approach is used because it minimises the parsing
    required for machine processing of transcription files.
    However, this means that a few utterances will be longer than
    a typical terminal line. The file TRFMT.C in the
    documentation directory contains the source code of an
    example program in C which will re-format transcriptions with
    a defined right hand margin.


5   Index File Formats
    ==================

    The INDEX directory contains 4 files---

        ITEMS.IDX    - List of items in the collection
        SPEAKERS.IDX - List of speakers
        PAIRS.IDX    - Item pairing list
        DICT.TXT     - Transcription Dictionary

    These are all ASCII text files. They are intended for loading
    into a database package or for processing into a locally
    useful format. Therefore, they are simply structured and
    amenable to automatic processing, rather than being formatted
    for direct inspection.

5.1 Speaker List
    ============

    The speaker list, SPEAKERS.IDX, is an ASCII text file
    containing such information as is known about each speaker.
    During the original data collection, information was not
    recorded consistently, and in many cases it was not recorded
    at all. So with the exception of the speaker ID, all fields
    are regarded as optional.

    Each speaker is described by 9 lines of text. Each line
    contains a single attribute of the speaker. The file as a
    whole contains a multiple of 9 lines. With the exception of
    the first line in each case, the remaining lines may be empty
    if the information is not available.

    The content of each line is listed below---

        1       Speaker Id, e.g. S321

        2       Sex, M or F

        3       Age (integer years)

        4       Height (integer cm)

        5       Weight (integer kg)

        6       Other observations. Sometimes
                collar size is given (in cm).

        7       Birth and domicile information.
                Not always available and not consistently
                recorded when it is. However, it is
                included because in some cases it may help
                with accent etc.

        8       Comments on appearance. Sometimes information
                about speaker's build is given.

        9       Comments on accent and voice. Neither consistent
                nor universal, but possibly useful.

    Where birth and domicile information is available (field 7) it
    is recorded as items in the form <region>:<duration>.
    The <duration> field may be an age range or a number of years,
    or it may be missing altogether. For example---

               London:0-7 Wales:6 Scotland

    indicates a subject who lived in London from ages 0 to 7, lived
    for 6 years (at unknown ages) in Wales and spent some
    unspecified time in Scotland.

    Accent information (field 9) was not recorded consistently. It
    is presented in the form <primary>/<qualifier>/<qualifier> with
    <qualifier>s in decreasing order of significance. For
    example---

               Lancashire/Wigan
               Yorkshire/Slight
               unusual

5.2 Item List
    =========

    The item list, ITEMS.IDX, is an  ASCII text file containing
    such information as is known about each item. Each item is
    described by 5 lines of text. Each line contains a single
    attribute of the item. The file as a whole contains a
    multiple of 5 lines. Lines 4 and 5 may be empty if the
    information is not available.

    The content of each line is listed below---

        1       Item Id, e.g. S3212

        2       Speaker Id, e.g. S321

        3       Number of the disk containing the item, integer

        4       Picture code A, B, C or R

        5       Comment (if any)

    The picture code describes the topic of conversation.
    Participants were given a set of photographs to discuss. The
    sets were---

       Set A - 8 pairs of photographs. 4 of a fairground scene,
       1 market scene, 1 high street, 1 at a swimming pool and
       1 railway station.

       Set B - 7 pairs of photographs. 2 fairground, 3 market,
       2 air show and 1 high street.

       Set C - 8 pairs of photographs. 3 fairground, 2 market,
       2 air show, 1 high street.

       Set R - 9 pairs of photographs. 3 fairground, 2 market,
       3 air show, 1 high street.

5.3 Pairs List
    ==========

    The pairs file, PAIRS.IDX, is an ASCII text file.

    Each line in the file contains two item identifiers separated
    by a single space. The two items represent the two sides of
    the same conversation.

    Note that, during digitisation of the recordings, long
    silences were automatically removed. One consequence of this
    is that the two items comprising a conversation are not
    necessarily time aligned.

    There are some items for which the pairing could not be
    identified. Such items do not appear in PAIRS.IDX.


5.4 Dictionary
    ==========

    The transcription dictionary is that used for spelling
    checking during item transcription. All the item
    transcriptions can be "spell checked" against this
    dictionary without error.

    The dictionary includes the actual words from the
    transcriptions, including slang words, part words and the
    list of non-speech sounds. It also includes words from the
    transcription comments, so it is possible that there are
    words in the dictionary for which there is no spoken example
    in any item.

    The dictionary is a sorted ASCII text file with one word per
    line.