OVERALL DATA BASE STRUCTURE

The Macrophone Corpus contains a total of 204,160 utterances from 5005
telephone calls.  Each call provides between 1 and 44 utterances, and
each utterance is stored in a separate waveform file, with a SPHERE
header that describes the content of the file (see "finalrep.doc" in
the "docs" directory for an explanation of the header contents).  The
44 possible utterances in each call comprise 15 disctinct response
types (e.g. "yes/no", "digits", "natural number", etc).

Each call in the corpus is represented by a directory, and each call
directory contains only the useable utterance waveform files from that
call.  Many calls have fewer than 44 utterance files becuase the
caller misspoke the response, there was significant noise or other
interference with the response, or the caller hung up prematurely.

The calls are partitioned into sets for training, development testing,
and evaluation testing.  The training set contains 4005 calls, the
development test set contains 500 calls, and the evaluation test data
are split into 5 sets of 100 calls per set.  Each of the evaluation
sets has been encrypted using a distinct encryption key.  Researchers
who wish to conduct formal evaluations of speech recognition systems
using one or more of the evaluation test sets should contact the
Linguistic Data Consortium to obtain the appropriate encryption
key(s).

The training and test sets are segregated by the first level of
directory structure, into the "train", "test" directories.  The "test"
directory is further subdivided into "devtst", "eval_1", ... "eval_5"
directories.

A typical path for a training waveform file looks like this:

	train/00/tr0002m/yn02.wav

where:
	"train" is the top-level partition; 
	"00" represents a sub-grouping into 100 calls (these are the
		first two digits of the two digits of the 4-digit
		sequencenumber in the name of the call directory);
	"tr0002m" is the name of the call directory;
	"yn02.wav" is the name of the waveform file.

The first two letters of the call directory name encodes the partition
to which the call has been assigned ("tr", "dt", "et1" ... "et5").
The last character of the call name indicates the gender/age grouping
of the caller, as follows:

	"b" = boy (under 16 years old)
	"g" = girl (under 16 years old)
	"c" = child (under 16 years old, gender not known)
	"m" = male (16 or older)
	"f" = female (16 or older)
	"a" = gender unknown, age is 16+ or unknown

The 4-digit sequence number in the call directory name was arbitrarily
assigned during the production process.  The sequence begins at "0000"
for the training partition and for each of the test partitions.

The individual waveform file names encode the response type of the
utterance and the sequence number of the prompt that elicited the
utterance.  The file-name strings and the associated response types
are shown in the following table:

   Prompt #   Filename	Response type
	  1  "yn01",      yes/no
	  2  "yn02",      yes/no
	  3  "yn03",      yes/no
	  4  "yn04",      yes/no
	  5  "natnum05",  natural_number
	  6  "date06",    date
	  7  "time07",    time
	  8  "date08",    date
	  9  "place09",   place
	 10  "digits10",  digits
	 11  "appwrd11",  application_word
	 12  "ATIS12",    ATIS
	 13  "digits13",  digits
	 14  "place14",   place
	 15  "spword15",  spelled_word
	 16  "natnum16",  natural_number
	 17  "nmatad17",  name_at_address
	 18  "natnum18",  natural_number
	 19  "TIMIT19",   TIMIT
	 20  "time20",    time
	 21  "appwrd21",  application_word
	 22  "WSJ22",     WSJ
	 23  "dlramt23",  dollar_amount
	 24  "nmatag24",  name_at_agency
	 25  "dlramt25",  dollar_amount
	 26  "date26",    date
	 27  "TIMIT27",   TIMIT
	 28  "appwrd28",  application_word
	 29  "dlramt29",  dollar_amount
	 30  "appwrd30",  application_word
	 31  "digits31",  digits
	 32  "TIMIT32",   TIMIT
	 33  "nmatad33",  name_at_address
	 34  "place34",   place
	 35  "fract35",   fraction
	 36  "dlramt36",  dollar_amount
	 37  "appwrd37",  application_word
	 38  "spword38",  spelled_word
	 39  "nmatad39",  name_at_address
	 40  "ATIS40",    ATIS
	 41  "natnum41",  natural_number
	 42  "WSJ42",     WSJ
	 43  "appwrd43",  application_word
	 44  "yn44"       yes/no

In accordance with this list, a given call directory will contain, for
example, at most five files whose names begin with "yn".


			 ADDITIONAL MATERIALS

In addition to the waveform data, each Macrophone disc contains a
complete set of documentation (in the "docs" directory), a copy of the
NIST SPHERE software package for use in uncompressing and accessing
the waveform data (in the "sphere" directory), and a set of data base
tables that provides systematic information about the calls, the
callers, and the utterances (in the "tables" directory).  There is a
"readme" file in each of these directories, which users should consult
for additional information.

For convenience, the contents of these three supplemental directories
are replicated on each disc in the Macrophone corpus.