Release Notes for the ATIS-2 MADCOW Speech Corpus NIST Discs 12-1.1, 12-2.1, 12-3.1 and 12-4.1 David Graff Linguistic Data Consortium October, 1993 General remarks: ---------------- Those people who have had access to earlier releases of the ATIS-2 corpus will notice that there has been some re-arrangement of the directory structures and naming conventions in this release. These changes in organization have been imposed to bring the corpus into alignment with the current conventions for continued ATIS data collection, and to provide a reasonably coherent overall structure for this data set. The structural changes included the following: - switching from 2 to 3 characters for all speaker designations - switching from 3 to 2 characters for all session-utterance ids. - grouping all speaker directories together by collection site (in the training partition) or test date (in the test partition) Every attempt has been made to ensure that the changes in speaker and utterance identifications were propagated into the contents of the files themselves: the waveform file headers and the session log files have had all references to speaker-ids and utterance-ids updated accordingly. In the case of data collected by AT&T, in which the old 2-character speaker-ids have had an initial "z" added as the first character (e.g. "0h" -> "z0h"), all file names and all references to file names have also been changed accordingly. (For all other collection sites, the 2- to 3-character speaker-id conversion involved adding "0" as the third character, and removing a leading "0" from the utterance-id, with the result that file name changes were not required.) Transcription and annotation data: ---------------------------------- The files on disc 12-1.1, under directory "atis2/text", represent the latest available release of text data for the ATIS-2 corpus. This set of transcriptions and annotations incorporates error corrections and updates that were made as recently as September 30, 1993. In addition, the word occurrence histogram (stored in "doc/lexicon.doc") reflects all transcript corrections made up to that date. Those familiar with earlier releases of this corpus may have noticed that some text files had been given incorrect file names: there was sometimes an "s" in the eighth character position, where there should have been an "x". These files have been renamed appropriately in order to coincide with the conventions for file naming. Speech data: ------------ Four of the six data collection sites provided speech recordings drawn simultaneously from two microphones, a head-mounted Sennheiser and a desk-mounted Crown. In all cases where only one microphone was used, that microphone was the head-mounted Sennheiser. The table below summarizes the quantity and locations of single- and two-channel recording sessions: Site / # of 1-Ch. # of 2-Ch. Total Test-date Sessions Sessions Sessions ATT 161 0 161 BBN 255 0 255 CMU 103 75 178 MIT 227 94 321 NIST 0 7 7 SRI 52 123 175 Feb-92 60 62 122 Nov-92 56 61 117 Total 914 422 1336 Note that the two channels of a given recording are stored in separate waveform files, and the microphone is identified both in the file name and the file's SPHERE header (please refer to files "dir_spec.doc" and "wav_spec.doc"). In comparing the contents of the waveform directories (under "atis2/spon") against those of the text directories (under "atis2/text"), it will be noticed that there are 101 utterances (in 76 recording sessions) for which we have text files without the associated waveform files. In most of these cases, the text files themselves are essentially empty, serving as "place-holders" to indicate that an "utterance" did in fact occur (and was registered in the log file for the session), but no speech was recorded. Many of these "unrecorded utterances" may be attributed to problems or confusion in the speaker's use of a "press-to-speak" recording mechanism at the recording console (e.g. the speaker triggered a recording by accident, and said nothing). However, there are two cases in which the speech data for known utterances has been lost. These are listed below (i.e. the following are utterances for which valid text data exist, but the waveform data had been irrevocably corrupted or lost at some point prior to publication): Part. Site Spkr Sesn Utt.# Missing file name train bbn e70 2 7 atis2/spon/train/bbn/e70/2/e70072ss.wav train cmu ik0 3 1 atis2/spon/train/cmu/ik0/3/ik0013ss.wav (Both of these cases are from single-channel recording sessions.) Participation in ATIS-2 Evaluations ----------------------------------- It should be noted that this publication lacks some of the documentation and materials that would be required for users to replicate the MADCOW Benchmark Tests as administered by NIST. Those users who are interested in conducting an evaluation of their own speech recognition or natural language processing systems in accordance with NIST testing protocols should contact NIST directly for further information.