This disc contains the first portion of training data for the Macrophone Corpus of American English Telephone Speech. This portion consists of 800 calls, grouped into subdirectories of 100 calls each, under the "train" directory.
Also on this disc are the documentation files, the data base tables, and the NIST SPHERE software package that can be used to access the waveform data.
The waveform files are stored in compressed form. The SPHERE "w_decode" utility can be used to uncompress the waveform data, which will revert to the original 8-bit mu-law sample format.
Please refer to the file "mcphondb.doc" in the "docs" directory of this disc for details on the organization of the data.
The Macrophone Corpus contains a total of 204,160 utterances from 5005 telephone calls. Each call provides between 1 and 44 utterances, and each utterance is stored in a separate waveform file, with a SPHERE header that describes the content of the file (see "finalrep.doc" in the "docs" directory for an explanation of the header contents). The 44 possible utterances in each call comprise 15 disctinct response types (e.g. "yes/no", "digits", "natural number", etc).
Each call in the corpus is represented by a directory, and each call directory contains only the useable utterance waveform files from that call. Many calls have fewer than 44 utterance files becuase the caller misspoke the response, there was significant noise or other interference with the response, or the caller hung up prematurely.
The calls are partitioned into sets for training, development testing, and evaluation testing. The training set contains 4005 calls, the development test set contains 500 calls, and the evaluation test data are split into 5 sets of 100 calls per set. Each of the evaluation sets has been encrypted using a distinct encryption key. Researchers who wish to conduct formal evaluations of speech recognition systems using one or more of the evaluation test sets should contact the Linguistic Data Consortium to obtain the appropriate encryption key(s).
The training and test sets are segregated by the first level of directory structure, into the "train", "test" directories. The "test" directory is further subdivided into "devtst", "eval_1", ... "eval_5" directories.
A typical path for a training waveform file looks like this:
train/00/tr0002m/yn02.wav
where:
"train" is the top-level partition;
"00" represents a sub-grouping into 100 calls (these are the
first two digits of the two digits of the 4-digit
sequencenumber in the name of the call directory);
"tr0002m" is the name of the call directory;
"yn02.wav" is the name of the waveform file.
The first two letters of the call directory name encodes the partition to which the call has been assigned ("tr", "dt", "et1" ... "et5"). The last character of the call name indicates the gender/age grouping of the caller, as follows:
"b" = boy (under 16 years old) "g" = girl (under 16 years old) "c" = child (under 16 years old, gender not known) "m" = male (16 or older) "f" = female (16 or older) "a" = gender unknown, age is 16+ or unknownThe 4-digit sequence number in the call directory name was arbitrarily assigned during the production process. The sequence begins at "0000" for the training partition and for each of the test partitions.
The individual waveform file names encode the response type of the utterance and the sequence number of the prompt that elicited the utterance. The file-name strings and the associated response types are shown in the following table:
Prompt # Filename Response type 1 "yn01", yes/no 2 "yn02", yes/no 3 "yn03", yes/no 4 "yn04", yes/no 5 "natnum05", natural_number 6 "date06", date 7 "time07", time 8 "date08", date 9 "place09", place 10 "digits10", digits 11 "appwrd11", application_word 12 "ATIS12", ATIS 13 "digits13", digits 14 "place14", place 15 "spword15", spelled_word 16 "natnum16", natural_number 17 "nmatad17", name_at_address 18 "natnum18", natural_number 19 "TIMIT19", TIMIT 20 "time20", time 21 "appwrd21", application_word 22 "WSJ22", WSJ 23 "dlramt23", dollar_amount 24 "nmatag24", name_at_agency 25 "dlramt25", dollar_amount 26 "date26", date 27 "TIMIT27", TIMIT 28 "appwrd28", application_word 29 "dlramt29", dollar_amount 30 "appwrd30", application_word 31 "digits31", digits 32 "TIMIT32", TIMIT 33 "nmatad33", name_at_address 34 "place34", place 35 "fract35", fraction 36 "dlramt36", dollar_amount 37 "appwrd37", application_word 38 "spword38", spelled_word 39 "nmatad39", name_at_address 40 "ATIS40", ATIS 41 "natnum41", natural_number 42 "WSJ42", WSJ 43 "appwrd43", application_word 44 "yn44" yes/noIn accordance with this list, a given call directory will contain, for example, at most five files whose names begin with "yn".
ADDITIONAL MATERIALS
In addition to the waveform data, each Macrophone disc contains a complete set of documentation (in the "docs" directory), a copy of the NIST SPHERE software package for use in uncompressing and accessing the waveform data (in the "sphere" directory), and a set of data base tables that provides systematic information about the calls, the callers, and the utterances (in the "tables" directory). There is a "readme" file in each of these directories, which users should consult for additional information.
For convenience, the contents of these three supplemental directories are replicated on each disc in the Macrophone corpus.