README for the ATIS-2 MADCOW Speech Corpus

October, 1993

The ATIS-2 Corpus is a 4-cdrom set containing recordings of spontaneous speech from 453 speakers, collected at six different research laboratories around the United States. The first disc of the set (the one containing this README file), provides all the documentation and text data associated with the recordings, while the latter 3 discs contain all and only the recorded speech data.

You will notice that each disc (except this one) contains a file in its root directory called "12_n_1.dir" (where "n" is the disc number). This file is simply a sorted listing of all the directories and files on the disc, and can be used to quickly check the disc's contents -- you can keep all these ".dir" files on line to quickly determine which disc to mount for retrieving a given file, speaker, or site.

The recorded speech is organized with one utterance per waveform file, and all utterances from a single recording session are grouped in a session directory. All sessions by a given speaker are likewise grouped into a single speaker directory. Speakers are grouped according to recording site (or test date).

The transcription and annotation data are organized in a manner parallel to the speech data. There is a single log file for each session, plus separate transcription and annotation files for each utterance in the session, all of which are in a single session directory. As with the speech data, sessions are grouped together into speaker directories, and speakers are grouped according to recording site. (All this material is in the directory "atis2/text" on this disc.)

The text data also includes the tables of airline travel information and associated materials used to build the relational data base, which was used to provide answers to the verbal requests of the speakers. (This material is in the "atis2/rdb3.3" directory on this disc.)

The documentation (contained in the "doc" directory on this disc) includes the specifications for the directory organization of the corpus, and for the internal format of the waveform files and various annotation files. It also provides information on the speakers, the techniques for data collection and evaluation, and two papers (in PostScript format) that describe the ATIS project in greater detail; the bibliographic references for these two papers are given below:

Hirschman, L., et al., "Multi-Site Data Collection for a Spoken Language Corpus", Proc. DARPA Speech and Natural Language Workshop, Morgan Kaufmann, Arden House Conference Center, Harriman, NY, February 1992.

Hirschman, L., et al., "Multi-Site Data Collection and Evaluation in Spoken Language Understanding", Proc. ARPA Workshop on Human Language Technology, Morgan Kaufmann, Merrill Lynch Conference Center, Plainsboro, NJ, March 1993.

(These papers are stored in the "doc" directory as "hirsch92.ps" and "hirsch93.ps", respectively. Being PostScript files, they can be output directly to a PostScript-capable printer to produce hard copy. If you do not have access to such a printer or to the sources cited above, you may contact the Linguistic Data Consortium and request paper copies by mail. We did not include the PostScript files on this page.

A brief guide to the other documentation files is given below:

	File name	Content
	---------	-------
	*_spec.doc 	Format specifications for files & directories
	*_inst.doc 	Instructions given at each recording site
	pofi*.doc	Information on "principles of interpretation"
	sro2lsn.*	Scripts for reformatting transcription files
	clas_sum.doc	Summary of utterance classifications by site
	lexicon.doc	Histogram of word occurrences in speech data
	min_max.doc	Description/definition of "reference answers"
	spkrasgn.txt	List of speaker designations used by sites
	spkrinfo.log	Data base of information about speakers
	subj_ques.doc	Questionnaires given to speakers at each site

	e2e_eval/*	Directory containing documentation and other
			materials pertaining to the "End-to-end"
			evaluation, which was conducted at 4 sites.
The materials and data in this corpus were prepared for publication by David Graff at the Linguistic Data Consortium, with assistance from John Garofolo and Jon Fiscus at NIST.