Charles T. Hemphill, John J. Godfrey, George R. Doddington
Speech research has made tremendous progress in the past using the following paradigm:
The ATIS corpus provides an opportunity to develop and evaluate speech systems that understand spontaneous speech. This corpus differs from its predecessor, the Resource Management corpus (Price et al, 1988), in at least four significant ways.
The ATIS database consists of data obtained from the Official Airline Guide (OAG, 1990), organized under a relational schema. The database remained fixed throughout the pilot phase. It contains information about flights, fares, airlines, cities, airports, and ground services, and includes twenty-five supporting tables. The large majority of the questions posed by subjects can be answered from the database with a single relational query.
To collect the kind of English expected in a real working system, we simulate one. The subject, or ``travel planner,'' is in one room, with those running the simulation in another. The subject speaks requests over a microphone and receives both a transcription of the speech and the answer on a computer screen. A session lasts approximately one hour, including detailed preliminary instructions and an exit questionnaire.
Two ``wizards'' carry out the simulation: one transcribes the query while the other produces the answer. The transcriber interprets any verbal editing by the subject and removes dysfluencies in order to produce an orthographic transcription of what the subject intended to say. At the same time, the answerer uses a natural language-oriented command language to produce an SQL expression that elicits the correct answer for the subject. On-line utilities maintain a complete log of the session, including time stamps.
At the conclusion of the session, the utterances are sorted into categories to determine those utterances suitable for objective evaluation. Finally, each utterance receives three different transcriptions. First, a checked version of the transcription produced during the session provides an appropriate input string for evaluating text-based natural language systems. Second, a slightly expanded version of this serves as a prompt in collecting a read version of the spontaneously spoken sentences. Finally, a more detailed orthographic transcription represents the speech actually uttered by the subject, appropriate for use in acoustic modeling.
2. Corpus Collection
About one session a day was conducted, using subjects recruited from within Texas Instruments. A typical session included approximately 20 minutes of introduction, 40 minutes of query time and 10 minutes for follow-up. Each session resulted in two speech files for each query and a complete log of the session.
2.1 Session Introduction
The subjects were given the following instructions, both orally and in writing:
Subjects were informed about the contents of the relational database in a one page summary. The summary described the major database entities in fairly general terms to avoid influencing the vocabulary used during the session. To avoid some misconceptions in advance, subjects were told that the database did not contain information about hotels or rental cars.
The subject was next assigned a travel planning scenario, systematically chosen from a set of six scenarios designed to exercise various aspects of the database. For example, some scenarios focused on flight time constraints while others concentrated on fares. The scenarios did not specify particular times or cities in an effort to make the scenario more personal to the subject. The following example illustrates this:
Finally, subjects were given instructions regarding the operation of the system. The ``system'', from the subjects perspective, consisted of a 19 inch color monitor running the X Window System, and a head-mounted Sennheiser (HMD 410-6) microphone. A desk mounted Crown (PCC-160 phase coherent cardioid) microphone was also used to record the speech. The ``office'' contained a sparc-station cpu and disk to replicate office noise, and a wall map of the United States to help subjects solve their scenarios.
The monitor screen was divided into two regions: a large, scrollable window for system output and a smaller window for speech interaction. The system used a ``push-to-talk'' input mechanism, whereby speech collection occurred while a suitably marked mouse button was depressed. Subjects were given the opportunity to cancel an utterance for a period of time equal to the length of the utterance.
A single sentence was used for all subjects to illustrate the push-to-talk mechanism and interaction with the system:
2.2 Session Queries
After the introduction, subjects were given approximately 40 minutes to complete the task described in the scenario. If they finished early, subjects were instructed to select another scenario or to explore the capabilities of the system. After the 40 minutes, subjects were given the opportunity to continue, finally ending the session by saying ``all done''.
Once the actual session started, subjects cycled through thinking, querying, waiting, and writing. While the thinking portion of the session actually required the most time, the query portion required the most resources.
Several things happened at once as a given subject spoke a query. While speech from both the head-mounted and desk-mounted microphones was recorded, one wizard began to transcribe the speech and the other wizard began to answer the query. A playback capability could be used if needed by the transcription wizard. The answer wizard was constrained not to send the answer before the transcription wizard finished the transcription. Typically, the subject received the typed transcription a few seconds after speaking and the answer approximately 20 seconds later.
Each wizard each had their own X Window terminal. The transcription wizard used a gnuemacs-based tool that checked the spelling of the transcription and sent the transcription to both the answer wizard and the subject. Despite the transcription wizard's best efforts, some transcription mistakes did reach the subject: occasionally words were omitted, inserted, or substituted (e.g., ``fight'' for ``flight'').
The answer wizard used a tool called NLParse (Hemphill et al, 1987) to form the answer to the subjects queries. This tool used a natural language-oriented command language to produce a set of tuples for the answer. NLParse provides a set of menus to help convey the limited coverage to the wizard. In practice, the answer wizard knew the coverage and used typing with escape completion to enter the appropriate NLParse command. NLParse provides several advantages as a wizard tool:
The answer wizard's terminal also included a gnuemacs-based utility that created a session log. This included the transcription, the NLParse input, the resulting SQL expression, and the set of tuples constituting the answer. The answer wizard sent only the set of tuples to the subject.
2.3 The ATIS Database
The ATIS database was designed to model as much of a real-world resource as possible. In particular, we tried to model the printed OAG in a straightforward manner. With this approach, we could rely on travel data expertise from Official Airline Guides, Incorporated. We also used the data directly from the OAG and did not invent any data - something that is difficult to accomplish in a realistic manner. Additionally, the printed OAG was available to all sites and provided a form of documentation for the database.
The relational schema were designed to help answer queries in an intuitive manner, with no attempt to maximize the speech collected (e.g., by supplying narrow tables as answers). Toward this end, entities were represented with simple sets or lists in the most direct way.
2.4 Session Follow-Up
After the query phase of the session, subjects were given a brief questionnaire to let us know what they thought of the system. This consisted of the following ten questions with possible answers of ``yes'' ``maybe/sometimes'', ``no'' or ``no opinion'':
3. Corpus Processing
After data collection, a rather elaborate series of processing steps was required before the subject's utterances actually became part of the corpus. A session resulted in a set of speech files and a session log that formed the raw materials for the corpus.
To facilitate use of the corpus, three transcriptions were provided with each query. A more detailed transcription document specifies the details of these, with the rationale explained below.
Not all queries were equally suited for evaluating spoken language systems. Accordingly, each query received a classification to help circumscribe the types of queries desired for training and testing. The classifications themselves were determined through a committee and defined several dimensions:
An interpretation document was defined, which specifies the details of how to interpret an ATIS query, both for the answer wizard and for the SLS sites. For example, for consistency it was ruled that a flight serving a snack would be considered as a flight with a meal. The document provides a mapping of concepts expressed in English to concepts encoded in the relational database. The NLParse commands reflect these conventions and were included in the corpus to facilitate maintenance since it was usually easier to determine the correctness of the reference answer by looking at the NLParse command rather than the resulting SQL expression. In the event of an erroneous answer, correction occurs by simply amending the NLParse command.
3.4 Reference SQL
The pilot corpus includes the ANSI-standard SQL expression that produced the reference answer, which is the ``final word'' on the interpretation of a subject's query. It also provides some degree of database independence. For example, as long as the relational schema remain fixed, we can add new cities to the database, rerun the SQL against the database, and produce a new corpus that includes the new cities. This works as long as the evaluation criteria excludes context-dependent queries.
3.5 Reference Answer
The reference answer consists of the set of tuples resulting from the evaluation of the reference SQL with respect to the official ATIS database. This is actually redundant, but makes scoring easier for most sites. The tuples are formatted according the Common Answer Specification (CAS) format (Boisen et al, 1989). This format amounts to representing the answer in Lisp syntax to aid in automatic scoring.
3.6 Corpus Files
All of the items mentioned above were formatted into files and shipped to the National Institute of Standards and Technology (NIST). NIST then distributed the corpus to interested sites. A file format document exists to help sites install the data.
Forty-one sessions containing 1041 utterances were collected over 8 weeks, nine of which were designated as training material by NIST. Each session consisted of 25.4 queries per session on average. Table 1 describes the utterance statistics for each Pilot Distribution (PD).
PD Weeks Sessions Utt Utt/Sess 1 2 9 234 26.0 2 2 10 245 24.5 3 2 10 236 23.6 4 1 7 197 28.1 5 1 5 129 25.8 total 8 41 1041 25.4 Table 1: Session Utterance StatisticsTable 2 describes the time statistics for each PD. Each session consisted of approximately 40 minutes of query time with an average rate of 39.1 queries per hour. The average time between queries of 1.5 minutes included subject thinking time, and about 22 seconds for the wizard to send the answer to the subject after the transcription.
PD Min Ave Min/Utt Sec/Ans Utt/Hr 1 355 39.4 1.5 23.5 39.6 2 354 35.4 1.4 21.2 41.5 3 391 39.1 1.7 24.2 36.2 4 302 43.1 1.5 19.6 39.1 5 196 39.1 1.5 21.6 39.5 total 1598 39.0 1.5 22.1 39.1 Table 2: Session Time StatisticsThe average utterance length (in words) varied according to the transcription: 10.2 for NL_input, 11.7 for SR_output (expanded lexical items and dysfluencies), and 11.3 for NL_SNOR (expanded lexical items). Eighteen percent of the utterances contained some form of dysfluency.
Of the 1041 utterances collected, 740 were judged evaluable according to the June 1990 criteria: not classified as context-dependent, ambiguous, ill-formed, unanswerable, or noncooperative. These results are shown in Table 3, broken down according to PD. The table also shows that if we relax these criteria to exclude only ambiguous and unanswerable utterances, the yield would increase from 71% to 80%.
PD Utt J-unevl %J-evl relax %evl 1 234 88 62 73 68 2 245 73 70 52 79 3 236 47 80 32 86 4 197 58 70 27 86 5 129 35 73 19 85 total 1041 301 71 203 80 Table 3: Session Yield of Evaluable UtterancesSubjects generally enjoyed the sessions, as reflected in Table 4 (the tally includes two subjects not included in the corpus). The answers to questions were typically not provided quickly enough, as might be expected in a simulation. Some subjects defined an acceptable response time as under 5 seconds. Of the subjects that thought a human was interpreting the questions, some knew in advance, some misinterpreted the question (``Did the system BEHAVE as if a human was interpreting your questions?''), and some were tipped-off by the amazing ability of the system to recognize speech in the face of gross dysfluencies.
Q Yes Maybe/Sometimes No No Opinion 1 27 16 0 0 2 32 10 1 0 3 31 9 2 0 4 2 19 22 0 5 29 10 4 0 6 26 15 1 0 7 24 4 4 7 8 40 1 1 1 9 26 13 3 1 10 8 7 22 5 Table 4: Answers to the QuestionnaireSubjects also supplied general comments. Some subjects felt uncomfortable with computers or the system:
The ATIS SLS pilot corpus has proved that objective evaluation of spoken language systems is both possible and beneficial. The pilot corpus has also served to clarify many points in the data collection procedure. In this effort, we have learned that a spontaneous speech corpus is more expensive to collect than a read speech one, but provides an opportunity to evaluate spoken language systems under realistic conditions. Above all, we hope that this corpus and its successors further research in spoken language systems.
This work was supported by the Defense Advanced Research Projects Agency and monitored by the Naval Space and Warfare Systems Command under Contract No. N00039-85-C-0338. The views and conclusions expressed here do not represent the official policies, either expressed or implied, of the Defense Advanced Research Projects Agency or the United States Government.
We gratefully acknowledge the publishers of the Official Airline Guide for travel data and consulting help. We thank the subjects for their participation, Jane McDaniel for her invaluable assistance in all phases of the corpus collection, and the many members of the various committees for their expert advice.
Boisen, Sean, Lance A. Ramshaw, Damaris Ayuso, and Madeleine Bates, ``A Proposal for SLS evaluation,'' in Proceedings of the DARPA Speech and Natural Language Workshop, October 1989.
Hemphill, Charles T., Inderjeet Mani, and Steven L. Bossie, ``A Combined Free-Form and Menu-Mode Natural Language Interface'', Abridged Proceedings of the Second International Conference on Human-Computer Interaction, Honolulu, Hawaii, 1987.
Official Airline Guides, Official Airline Guide, North American Edition with Fares, Oakbrook, Illinois, Volume 16, No. 7, January 1, 1990.
Price, P.J., W.M. Fisher, J. Bernstein, D.S. Pallett, ``The DARPA 1000-Word Resource Management Database for Continuous Speech Recognition'', Proceedings of ICASSP, 1988.