BRAMSHILL Speech Collection

This file describes the organisation and format of the BRAMSHILL speech collection.

The BRAMSHILL collection is a set of CD-ROMs containing digitised and transcribed recordings of free conversation. Each item is a recording of one half of a two-speaker conversation. Most of the recordings include a standard set of test sentences as well as the conversation.

1. Directory Structure

Each CD-ROM has three top level directories.

The DOC directory contains this and other text files describing the collection.

The INDEX directory contains catalogue information.

The SPEAKERS directory contains the data. Within SPEAKERS there is a separate sub-directory for each speaker. All the material for any given speaker is contained on a single CD-ROM.


                  
                     |
                     |
   ====================================
   |                 |                |
   |                 |                |
  DOC              INDEX           SPEAKERS
                                      |
                                      |
                       =============================
                       |                           |
                       |                           |
                      Snnn   ....   ....   ....  Smmm
                       |
                       |
           =================================================
           |           |          |         |      |     |
           |           |          |         |      |     |
       Snnn1.DAT  Snnn1.TMT  Snnn2.DAT Snnn2.TMT  .... ....

2. File Naming Convention

Each speaker in the collection has been allocated a four character identifier, consisting of the letter S and 3 digits, for example S324.

Each recording has been allocated a five character identifier consisting of the speaker's identifier plus an additional digit, for example S3242.

Speech data files have the extension .DAT and the transcription files have the extension .TMT.

An item is composed of one data file and one transcription file. For example an item might consist of the pair of files S3241.DAT and S3241.TMT.

3. The Transcriptions

For transcription purposes, the recordings were divided into "utterances". These utterances are short sections of speech, typically a phrase or short sentence. The division into utterances was carried out by an automated process which attempted to place utterance boundaries in the pauses between words. Occasionally, however, in cases such as long runs of unbroken speech, the utterance boundaries might occur in inappropriate locations. This should be remembered when using the transcriptions.

The transcription files contain the start time, length and corresponding text of each utterance. Time and length are given in 0.1 second units.

A number of conventions were adopted in the transcriptions:

The speech was transcribed verbatim. No attempt was made to correct grammar, fill in missing words etc.
Proper dictionary words were used unless the pronunciation was radically different, in which case a new slang word was introduced. For example, "that're" would normally be transcribed as "that are", but the word "gonna" would be introduced rather than transcribing as "going to".
Contractions such as "it's" for "it is" and "they're" for "they are" were transcribed, but in cases of doubt the preference was for the uncontracted version.
Hesitation sounds were transcribed from a limited set of conventional representations including "uh" and "um". No attempt was made to distinguish between sounds such as "erm" and "hmm" or between "uh" and "ah".
Conventional English punctuation was included where possible but much of the material is ungrammatical and the punctuation is no more than an aid to readability. Punctuation was limited to .,?!:; and ellipses (...).
Spoken numbers were spelled out. Spoken letters were transcribed as single upper case letters. So a vehicle registration number might be transcribed as "D seven three six K N Y".
Unclear sections were enclosed in double parentheses. Completely unintelligible passages were represented as a single space between double parentheses "(( ))".
Where a speaker breaks off in mid word the part word was transcribed followed by a hyphen and a comma or other punctuation mark. For example "phot-," (for broken "photograph").
Related letter/word sequences were transcribed in a style such as "T -shirt" or "O K -ing".
Apart from the special cases described above, hyphens are not used.
Proper names were capitalised, for example "... the Great North Road ...".
If the speakers referred to text visible in the photographs and there was possible ambiguity, the text was transcribed in upper case. For example, "I can see CANDY FLOSS" refers to lettering, rather than the actual candy floss.
Non-speech sounds were shown in square brackets. For example, "[cough]". Continuous sounds were represented by a start/end pair such as "[bell] ... [\bell]".
Occasionally the transcribers recorded comments in braces. For example "{very loud}".
Where it was possible to identify a change of discussion topic a double `at' marker "@@" was included. Most transcriptions include at least one "@@" between the standard test sentences and the conversational part of the recording.
A few short sections of the recordings have been replaced by binary zeroes to protect the anonymity of the speakers. Such sections were transcribed with the single comment "{ZERO}".

Every lexical word from the transcriptions is contained in the dictionary supplied in the INDEX directory. Contractions, part-words, slang words, hesitation sounds and the non-speech sounds such are all treated as words in their own right in the dictionary.

A definitive list of the non-verbal sounds such as "[cough]" can be obtained by searching the dictionary for words starting with "[" and ending with "]".

4 Data File Formats

4.1 Speech Data Files

The speech data was sampled at 10 kHz and stored as 16-bit 2's complement integers, least significant byte first. Each recording is stored in a single file (extension .DAT). The first 1024 bytes of the file contain an ASCII header. The remainder of the file is the speech data. The data section of a speech file is an unstructured byte stream. There is no record or block structure.

The first 1024 bytes contain a header in the NIST SPHERE format. This is an ASCII text based format. An example header follows---

    NIST_1A
       1024
    database_id -s9 BRAMSHILL
    database_version -s3 1.0
    utterance_id -s5 S1231
    channel_count -i 1
    sample_count -i 6000000
    sample_rate -i 10000
    sample_min -i -28127
    sample_max -i 25763
    sample_n_bytes -i 2
    sample_byte_format -s2 01
    sample_sig_bits -i 16
    transcriber -s3 DWK
    end_head

The first two lines are the standard header introduction. These are each eight characters long (including the "newline" terminator). The last line is the standard header terminator. The remainder of the header is padded to 1024 bytes with "newline" characters. This means that the header can be inspected easily by using a utility program such as "more".

The body of the header consists of a set of "triples" each having the general form "name type value". In the BRAMSHILL database, only two type specifiers are used---

        -i  - Integer
        -sn - String, length n characters

Each triple is terminated by a single "newline" character

The header fields are---


    database_id        Database name, always "BRAMSHILL"
    database_version   Database Version, for example "1.0"
    utterance_id       Item identifier, for example "S1231"
    channel_count      Always 1
    sample_count       Number of samples in the data file
    sample_rate        Sample rate, always 10000
    sample_min         Minimum sample value
    sample_max         Maximum sample value
    sample_n_bytes     Bytes per sample, always 2
    sample_byte_format Sample byte order, always "01", LSB first
    sample_sig_bits    Always 16
    transcriber        Code identifying the transcriber

4.2 Transcription Files

The transcription files (extension .TMT) are ASCII text files. The first line of each transcription file is a header identifying the transcribed item. The header line has the format---


           Transcription of BRAMSHILL item S1232

Following the header line is a series of utterance transcriptions.

Utterance transcriptions consist of two integers and the transcription text separated by single spaces. Each utterance transcription is terminated by a new line.

The two integers define the start time (relative to the start of the file) and duration of the utterance in 0.1 second units.

An example utterance transcription follows---

3512 35 There is a clock in the right hand side of the picture.

Each utterance transcription is contained on a single line. The newline character terminates an utterance transcription. This approach is used because it minimises the parsing required for machine processing of transcription files. However, this means that a few utterances will be longer than a typical terminal line. The file TRFMT.C in the documentation directory contains the source code of an example program in C which will re-format transcriptions with a defined right hand margin.

5 Index File Formats

The INDEX directory contains 4 files---

ITEMS.IDX - List of items in the collection
SPEAKERS.IDX - List of speakers
PAIRS.IDX - Item pairing list
DICT.TXT - Transcription Dictionary

These are all ASCII text files. They are intended for loading into a database package or for processing into a locally useful format. Therefore, they are simply structured and amenable to automatic processing, rather than being formatted for direct inspection.

5.1 Speaker List

The speaker list, SPEAKERS.IDX, is an ASCII text file containing such information as is known about each speaker. During the original data collection, information was not recorded consistently, and in many cases it was not recorded at all. So with the exception of the speaker ID, all fields are regarded as optional.

Each speaker is described by 9 lines of text. Each line contains a single attribute of the speaker. The file as a whole contains a multiple of 9 lines. With the exception of the first line in each case, the remaining lines may be empty if the information is not available.

The content of each line is listed below---

        1       Speaker Id, e.g. S321

        2       Sex, M or F

        3       Age (integer years)

        4       Height (integer cm)

        5       Weight (integer kg)

        6       Other observations. Sometimes
                collar size is given (in cm).

        7       Birth and domicile information.
                Not always available and not consistently
                recorded when it is. However, it is
                included because in some cases it may help
                with accent etc.

        8       Comments on appearance. Sometimes information
                about speaker's build is given.

        9       Comments on accent and voice. Neither consistent
                nor universal, but possibly useful.

Where birth and domicile information is available (field 7) it is recorded as items in the form :. The field may be an age range or a number of years, or it may be missing altogether. For example---

London:0-7 Wales:6 Scotland

indicates a subject who lived in London from ages 0 to 7, lived for 6 years (at unknown ages) in Wales and spent some unspecified time in Scotland.

Accent information (field 9) was not recorded consistently. It is presented in the form // with s in decreasing order of significance. For example---

Lancashire/Wigan
Yorkshire/Slight
unusual

5.2 Item List

The item list, ITEMS.IDX, is an ASCII text file containing such information as is known about each item. Each item is described by 5 lines of text. Each line contains a single attribute of the item. The file as a whole contains a multiple of 5 lines. Lines 4 and 5 may be empty if the information is not available.

The content of each line is listed below---


        1       Item Id, e.g. S3212

        2       Speaker Id, e.g. S321

        3       Number of the disk containing the item, integer

        4       Picture code A, B, C or R

        5       Comment (if any)

The picture code describes the topic of conversation. Participants were given a set of photographs to discuss. The sets were---

       Set A - 8 pairs of photographs. 4 of a fairground scene,
       1 market scene, 1 high street, 1 at a swimming pool and
       1 railway station.

       Set B - 7 pairs of photographs. 2 fairground, 3 market,
       2 air show and 1 high street.

       Set C - 8 pairs of photographs. 3 fairground, 2 market,
       2 air show, 1 high street.

       Set R - 9 pairs of photographs. 3 fairground, 2 market,
       3 air show, 1 high street.

5.3 Pairs List

The pairs file, PAIRS.IDX, is an ASCII text file.

Each line in the file contains two item identifiers separated by a single space. The two items represent the two sides of the same conversation.

Note that, during digitisation of the recordings, long silences were automatically removed. One consequence of this is that the two items comprising a conversation are not necessarily time aligned.

There are some items for which the pairing could not be identified. Such items do not appear in PAIRS.IDX.

5.4 Dictionary

The transcription dictionary is that used for spelling checking during item transcription. All the item transcriptions can be "spell checked" against this dictionary without error.

The dictionary includes the actual words from the transcriptions, including slang words, part words and the list of non-speech sounds. It also includes words from the transcription comments, so it is possible that there are words in the dictionary for which there is no spoken example in any item.

The dictionary is a sorted ASCII text file with one word per line.