Air Travel Information Service Phase III (ATIS3) Speech and Natural Language Understanding Corpora NIST Speech Discs 17-1.1 - 17-3.1 LDC Release, July, 1994 The Air Travel Information Service (ATIS) domain was selected as a common research domain to facilitate the development and common evaluation of speech understanding systems within the Advanced Research Projects Agency Spoken Language Technology Program. ATIS0 The collection of speech corpora to support research and development in the ATIS domain began in 1989 with the collection of the ATIS0 pilot corpus using a "wizard-based" system at Texas Instruments. A relational version of the flight information for 10 cities/11 airports in the Official Airline Guide (OAG) was used as the knowledge base. Initial ARPA/NIST Spoken Language System (SLS) benchmark tests were conducted in 1990 using the pilot corpus. The ATIS0 training and test data are available on NIST Speech Discs 5-1.1 - 5-6.1. ATIS1: In 1990, a small amount of data identified as ATIS1 was collected at SRI using the TI data collection paradigm, but it was never released. ATIS2: The effort expanded in 1991 with the collection and contribution of ATIS corpora by all of the participating ARPA contractors: Bolt Berenek and Newman (BBN), Carnegie Mellon University (CMU), MIT Laboratory for Computer Science (MIT-LCS), SRI International, and AT&T Bell Laboratories (a non-(D)ARPA volunteer participant) using fully or semi-automated prototype ATIS systems. Prior to the data collection, SRI modified the schema of the relational database to allow a greater variety of queries. The multi-site corpora were collected and transcribed at each of the above sites and then pooled at NIST to form the ATIS2 corpus. The group of data collectors and other sites participating in evaluations using the data formed a committee known as the Multi-site Atis Data COllection Working (MADCOW) group to coordinate the collection of, annotation of, and evaluation on the data. ARPA/NIST Spoken Language System benchmark tests were conducted in 1991 and 1992 using a subset of the multi-site data. The ATIS2 training and test data are available on NIST Speech Discs 12-1.1 - 12-4.1. ATIS3: In 1992, the MADCOW group decided to expand the relational database to incorporate flight information for a greater number of cities while maintaining the previous database schema. To accomplish this, SRI and Unisys acquired an up-to-date release of the OAG and ported the data for 46 cities/52 airports from the OAG into the established relational form. The 46 cities in the new database include the cities in the previous database plus two Canadian cities plus another 34 cities in the OAG with the largest metropolitan areas. In 1993-1994, the ATIS3 46-city corpus was collected at BBN, CMU, MIT, SRI, and at the National Institute of Standards and Technology (NIST). As before, a portion of the pooled data was set aside and used in the December 1993 ARPA Spoken Language System Benchmark Tests conducted by NIST. The remaining corpora was released as training data. All of the ATIS3 corpus has been transcribed and much of the training data has been annotated with query classifications and reference answers at SRI. This set of discs contains the complete ATIS3 training corpus and the 1993 ARPA/NIST ATIS evaluation test set. Disc 17-1.1 contains the documentation, transcriptions, and annotations for the corpus. Discs 17-2.1 and 17-3.1 contain the corresponding waveforms in an identical directory hierarchy. The training corpora consists of 7,388 queries, 3,512 of which are annotated with categorizations and reference answers. The December 1993 test corpus consists of 967 queries, all of which are fully annotated. This disc, 17-1.1, contains all available general on-line documentation for ATIS3 as well. A summary of the contents of each of the discs in this set is as follows: 17-1.1: atis3/ Transcriptions, annotations, and documentation for the ATIS3 training and December 1993 test corpora. comp/ NIST comparator for scoring CAS-formatted answers output from ATIS NL/SLS systems against reference answers. rdb4_0/ ATIS 46-city/52-airport relational database. score/ NIST speech recognition scoring software. Includes dynamic programming string-alignment scoring code and statistical significance tests. sphere/ NIST SPeech HEader REsources toolkit. Provides command- line and programmer interface to NIST-headered speech waveform files. Also provides for automatic decompression of the Shorten-compressed waveform files on these discs. 17-2.1: atis3/ Waveforms for the ATIS3 training corpora collected at BBN, CMU, MIT, and SRI. 17-3.1: atis3/ Waveforms for the ATIS3 training corpora collected at NIST and for the December 1993 multi-site test corpora. General information files named "readme.doc" have been included in the high-level directories and throughout the documentation directory ("atis3/doc") on this disc, 17-1.1, and describe the contents of the directories. The following lists the chronology of the (D)ARPA/NIST ATIS Benchmark Tests to date: June 1990 - Pilot Test (ATIS0), Reported at June 1990 Workshop November 1990 - Official Test (ATIS0), Reported at February 1991 Workshop October 1991 - Multi-Site Dry Run (ATIS2), Unpublished January 1992 - Multi-Site Test (ATIS2), Reported at March 1992 Workshop October 1992 - End-to-end Dry Run (ATIS2), Unpublished November 1992 - Multi-Site Test (ATIS2), Reported at March 1993 Workshop December 1993 - Multi-Site Test (ATIS3), Reported at March 1994 Workshop The following papers contain a more detailed description of the ATIS paradigm and corpora. PostScript copies of these papers have been included in the "atis3/doc" directory of this disc for your convenience. Hemphill, C.T., et al., "The ATIS Spoken Language Systems Pilot Corpus", Proc. DARPA Speech and Natural Language Workshop, Morgan Kaufmann Publishers, June 1990. (tiatis90.ps) Hirschman, L., et al., "Multi-Site Data Collection for a Spoken Language Corpus", Proc. DARPA Speech and Natural Language Workshop, Morgan Kaufmann Publishers, February 1992. (madcow92.ps) Hirschman, L., et al., "Multi-Site Data Collection and Evaluation in Spoken Language Understanding", Proc. ARPA Human Language Technology Workshop, Morgan Kaufmann Publishers, March 1993. (madcow93.ps) Dahl, D., et al., "Expanding the Scope of the ATIS Taslk: The ATIS-3 Corpus", Proc. ARPA Human Language Technology Workshop, Morgan Kaufmann Publishers, March 1994. (madcow94.ps) The collection of the ATIS3 corpus and the December 1993 ATIS benchmark tests were sponsored by the Advanced Research Projects Agency Software and Intelligent Systems Technology Office (ARPA-SISTO). The corpus was annotated by SRI international and collated, documented and produced on CD-ROM by the National Institute of Standards and Technology under the sponsorship of the Linguistic Data Consortium.