ATIS(0) - Spontaneous Speech Pilot Corpus.

DARPA Air Travel Information System (ATIS0)

Spontaneous Speech Pilot Corpus and Relational Database

NIST Disc CD5-1.1

This speech database is the first in a series of recordings of "natural speech", in the Air Travel Information System (ATIS) domain. Queries collected for these corpora are spoken, without scripts or other constraints, to ATIS, a computerized simulation of a database system that includes a simplified version of the Official Airline Guide (OAG)(CR). A human "wizard" simulating the speech recognizer of the future gives the impression of a speech-recognizing computer system. The sessions on this disc are those that were used in the DARPA-sponsored evaluation of spoken language systems that was reported in June 1990. For a more detailed description of this corpus, see the paper by Hemphill et al. included as documentation on this disc, in the file "atis0/doc/corpus.doc".

For each session, a problem (aka "scenario") was posed to a subject, such as "find the cheapest way to fly from Atlanta to Dallas by next Thursday". The subject's queries to the computer system and the computer system's responses were saved as data.

Directory Structure:

The directories and files under "/atis0" are structured as follows:

bbncomp/: Directory containing the source code for the BBN comparator
           software.
doc/: Online documentation;  This directory contains the following
      files:
     cas1.doc:  BNF definition of the Common Answer 
                Specification (CAS), in LateX form.
     class1.doc:  The ATIS utterance classification system.
     corpus.doc:  Reprint of the paper by Hemphill et al. from 
                  the proceedings of the June 1990 DARPA Speech 
                  and Natural Language Workshop.  This paper
                  describes the corpus and the ATIS collection
                  paradigm.
     interp1.doc: Principles for interpreting ATIS queries with
                  respect to the relational database.
     spkrinfo.doc:Table of speaker codes and their sex, age, and
                  and dialect region.

june90_nl/: Directory containing information and data used to conduct the
            DARPA ATIS natural language benchmark test in June 1990.  See 
            "june90_nl/readme.doc" for more information.
nistcomp/: Directory containing the source code for the NIST comparator
           software.  See "nistcomp/readme.doc" for more information.
rdb1/: Directory containing the ATIS0 relational database 
       (version 1).
readme.doc: This file.
sphere: Directory containing the NIST SPeech Header REsources (SPHERE)
        utilities (version 1.5) for manipulating the SPHERE-headered 
        speech waveform files.  See "sphere/readme.doc" for more 
        information.

spon/: Directory containing the spontaneous speech collected in the 
       "wizard" experiment and related transcriptions.
       The "spon" directory is divided into speaker subdirectories.
       The speaker directories are further divided into session
       subdirectories.  Each speaker-session directory contains
       all of the files pertaining to one ATIS session.
       See below for a description of ATIS file names and types.

ATIS File Names and Types:

All files under the "spon" directory have the following format:

ATIS-FILE ::= <UTTERANCE-ID>.&lgTYPE>

    where,
    UTTERANCE-ID ::= <AA><BBB><C><D><E>"

        where,
        AA ::= "01" to "zz" (speaker identification code)
        BBB ::= "000" to "zzz" (sentence text code)
        C ::= "1" to "z" (session code)
        D ::= (speaking mode code)
              "s" (for spontaneous productions) |
              "r" (for read versions of spontaneous productions) |
              "c" (for common read productions)
        E ::= (microphone code)
              "s" (for Sennheiser) |
              "c" (for Crown) |
              "x" (does no apply)

    TYPE ::= (file type)
             "cat" (class categorization) |
             "nli" (natural language input) |
             "ptx" (prompting text) |
             "ref" (reference CAS) |
             "snr" (SNOR transcription) |
             "sql" (reference SQL query) |
             "sro" (speech recognizer output) |
             "log" (session log) |
             "wav" (SPHERE-headered speech waveform) |
             "win" (wizard input to NL-Parse ([to produce reference SQL query])

The waveform and transcription data for each utterance are located in separate files with common utterance ID's. The session log file (.log) contains a transcription of an entire session and is the only file which contains the sentence text code, "000". All of the file types excluding the speech waveform (.wav) files are ASCII text files.

June 1990 DARPA Benchmark Test

Each of the sessions has been classified as either a "training" session or a "test" session. The training sessions are available to spoken language systems developers to use in any way they want. The test sessions were held back until just before the June 1990 evaluation, and more test data is being held for future evaluations. Here is the classification into training and test sets that was used for the June '90 workshop evaluation:

    Training Set Speaker i.d.'s: b0, b1, b2, b3, b4, b5, b6,
               b7, b8, b9, ba, bb, bc, be, bg, bh, bi, bj, bk,
               bl, bn, bo, bq, br, bs, bt, bu, bv, bx, by, bz.

    Test Set Speaker i.d.'s: bd, bf, bm, bp, bw.

Each query was classified along several dimensions of difficulty: ambiguous, context-sensitive, etc. File "atis0/doc/class1.doc" is the document defining the classification scheme. A super-class, "class A", was defined as being just plain-vanilla queries that were not put into any of the marked classes, and for the June workshop, results were reported on them alone.

To reduce vagueness and ambiguity, a set of principles for interpreting and answering queries were agreed on; they are presented in file "atis0/doc/interp1.doc" below. Answers were expressed in the Common Answer Specification (CAS) language, a definition of which is included in file "atis0/doc/cas1.doc". Sites could use the SNOR (Standard Normalized Orthographic Representation) transcriptions of the queries as input, so that only the NLP portion of their systems was tested. (One site also presented results starting from speech instead of orthography.) The only scoring was done by comparing the CAS answers supplied by the sites to canonical CAS answers.

A lengthy description of the first version of the CAS answer formatting language can be found in the paper by Boisen et al. in the Proceedings of the [DARPA] Speech and Natural Language Workshop, October 1989, and as SLS Note #4, distributed by BBN. Several small changes were agreed on for the June '90 evaluation, and the resulting CAS format is specified in BNF form in "atis0/doc/cas1.doc" (as a LateX file). Two "comparator" programs that compare and score two CAS representations are included on this disc, one in Lisp, developed by BBN, and an experimental one in "C", developed by NIST. Several small changes are being made in the "C" code as a result of analyzing the June results, and corrected code will be released on a later disc when testing is complete and post-June amendments to the CAS have been incorporated.