Last modified November 30, 1995.

Top-level Readme file for Macrophone disc 1

This disc contains the first portion of training data for the Macrophone Corpus of American English Telephone Speech. This portion consists of 800 calls, grouped into subdirectories of 100 calls each, under the "train" directory.

Also on this disc are the documentation files, the data base tables, and the NIST SPHERE software package that can be used to access the waveform data.

The waveform files are stored in compressed form. The SPHERE "w_decode" utility can be used to uncompress the waveform data, which will revert to the original 8-bit mu-law sample format.

Please refer to the file "mcphondb.doc" in the "docs" directory of this disc for details on the organization of the data.


Futher information is available from the following links or the overview file below:
File MCPHONDB.DOC:

Overall data base structure.

The Macrophone Corpus contains a total of 204,160 utterances from 5005 telephone calls. Each call provides between 1 and 44 utterances, and each utterance is stored in a separate waveform file, with a SPHERE header that describes the content of the file (see "finalrep.doc" in the "docs" directory for an explanation of the header contents). The 44 possible utterances in each call comprise 15 disctinct response types (e.g. "yes/no", "digits", "natural number", etc).

Each call in the corpus is represented by a directory, and each call directory contains only the useable utterance waveform files from that call. Many calls have fewer than 44 utterance files becuase the caller misspoke the response, there was significant noise or other interference with the response, or the caller hung up prematurely.

The calls are partitioned into sets for training, development testing, and evaluation testing. The training set contains 4005 calls, the development test set contains 500 calls, and the evaluation test data are split into 5 sets of 100 calls per set. Each of the evaluation sets has been encrypted using a distinct encryption key. Researchers who wish to conduct formal evaluations of speech recognition systems using one or more of the evaluation test sets should contact the Linguistic Data Consortium to obtain the appropriate encryption key(s).

The training and test sets are segregated by the first level of directory structure, into the "train", "test" directories. The "test" directory is further subdivided into "devtst", "eval_1", ... "eval_5" directories.

A typical path for a training waveform file looks like this:

train/00/tr0002m/yn02.wav

where:
"train" is the top-level partition;
"00" represents a sub-grouping into 100 calls (these are the first two digits of the two digits of the 4-digit sequencenumber in the name of the call directory);
"tr0002m" is the name of the call directory;
"yn02.wav" is the name of the waveform file.

The first two letters of the call directory name encodes the partition to which the call has been assigned ("tr", "dt", "et1" ... "et5"). The last character of the call name indicates the gender/age grouping of the caller, as follows:

	"b" = boy (under 16 years old)
	"g" = girl (under 16 years old)
	"c" = child (under 16 years old, gender not known)
	"m" = male (16 or older)
	"f" = female (16 or older)
	"a" = gender unknown, age is 16+ or unknown
The 4-digit sequence number in the call directory name was arbitrarily assigned during the production process. The sequence begins at "0000" for the training partition and for each of the test partitions.

The individual waveform file names encode the response type of the utterance and the sequence number of the prompt that elicited the utterance. The file-name strings and the associated response types are shown in the following table:

   Prompt #   Filename	Response type
	  1  "yn01",      yes/no
	  2  "yn02",      yes/no
	  3  "yn03",      yes/no
	  4  "yn04",      yes/no
	  5  "natnum05",  natural_number
	  6  "date06",    date
	  7  "time07",    time
	  8  "date08",    date
	  9  "place09",   place
	 10  "digits10",  digits
	 11  "appwrd11",  application_word
	 12  "ATIS12",    ATIS
	 13  "digits13",  digits
	 14  "place14",   place
	 15  "spword15",  spelled_word
	 16  "natnum16",  natural_number
	 17  "nmatad17",  name_at_address
	 18  "natnum18",  natural_number
	 19  "TIMIT19",   TIMIT
	 20  "time20",    time
	 21  "appwrd21",  application_word
	 22  "WSJ22",     WSJ
	 23  "dlramt23",  dollar_amount
	 24  "nmatag24",  name_at_agency
	 25  "dlramt25",  dollar_amount
	 26  "date26",    date
	 27  "TIMIT27",   TIMIT
	 28  "appwrd28",  application_word
	 29  "dlramt29",  dollar_amount
	 30  "appwrd30",  application_word
	 31  "digits31",  digits
	 32  "TIMIT32",   TIMIT
	 33  "nmatad33",  name_at_address
	 34  "place34",   place
	 35  "fract35",   fraction
	 36  "dlramt36",  dollar_amount
	 37  "appwrd37",  application_word
	 38  "spword38",  spelled_word
	 39  "nmatad39",  name_at_address
	 40  "ATIS40",    ATIS
	 41  "natnum41",  natural_number
	 42  "WSJ42",     WSJ
	 43  "appwrd43",  application_word
	 44  "yn44"       yes/no
In accordance with this list, a given call directory will contain, for example, at most five files whose names begin with "yn".

ADDITIONAL MATERIALS

In addition to the waveform data, each Macrophone disc contains a complete set of documentation (in the "docs" directory), a copy of the NIST SPHERE software package for use in uncompressing and accessing the waveform data (in the "sphere" directory), and a set of data base tables that provides systematic information about the calls, the callers, and the utterances (in the "tables" directory). There is a "readme" file in each of these directories, which users should consult for additional information.

For convenience, the contents of these three supplemental directories are replicated on each disc in the Macrophone corpus.