ARPA SLS Multi-site ATIS3 Data December 1994 Benchmark Test Material Test Overview and Procedures March, 1995 * * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * * * * * If you intend to implement the protocols for the ARPA December '94 ATIS * * Benchmark Tests, please read this document in its entirety before * * proceeding and do not examine the included transcriptions, annotations, * * session logs, or documentation unless such examination is specifically * * permitted in the guidelines for the test(s) being run. Index files * * have been included which specify the exact data to be used for each * * test. To avoid testing on erroneous data, please refer to these files * * when running the tests. * * * * * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * * Contents -------- 1.0 Overview 2.0 Subdirectories 3.0 Online Documentation 4.0 Test Set Indices 5.0 December 1993 ATIS Test Overview 5.1 Test Data Distribution 5.2 Test Protocols 5.3 Initial Scoring 5.4 Adjudication 5.5 Final Scoring 6.0 Test Scoring 6.1 Scoring ATIS SPREC Tests 6.1.1 Preparation of Hypothesized Transcripts 6.1.2 Scoring SPREC Results 6.2 Scoring ATIS NL and SLS Tests 6.2.1 Preparation of CAS NL/SLS Output and Hypothesis Answers 6.2.2 Scoring NL/SLS Results 1.0 Overview ------------- This directory contains the waveform and textual data and documentation necessary to implement the December 1994 suite of ARPA ATIS benchmark tests: Speech Recognition Sennheiser mic. waveforms (SPREC-S), Speech Recognition Crown mic. waveforms (SPREC-C). Natural Language (NL), Spoken Language System Sennheiser mic. waveforms (SLS-S), Spoken Language System Crown mic. waveforms (SLS-C), The test data consists of 981 utterances from 131 subject-scenarios spoken by 24 subjects. The data was collected at 5 sites (BBN, CMU, MIT, SRI, NIST) and each collection site is approximately evenly represented in terms of number of utterances (approximately 200 utterances from each site). The data is organized using conventional MADCOW ATIS directory and file structures. The following filetypes are included on the discs: .log - session log .sro - detailed transcription .lsn - lexical SNOR transcription (used in scoring SPREC tests nd as input for NL tests) .wav - utterance waveform (*s.wav - Sennheiser mic, *c.wav - Crown mic. - used as input for SPREC and SLS tests) .cat - query categorization (used in determining scorable queries) .win - wizard input to NL-Parse .sql - SQL generated by NL-Parse for minimal answer .sq2 - SQL generated by NL-Parse for maximal answer .ref - Minimal CAS reference answer (used in scoring NL and SLS tests) .rf2 - Maximal CAS reference answer (used in scoring NL and SLS tests) NOTE: IF YOU INTEND TO REPLICATE THE CONDITIONS of THE ORIGINAL DECEMBER 1994 ARPA ATIS TESTS, THE .lsn FILES ARE TO BE USED AS INPUT FOR NATURAL LANGUAGE TESTS AND THE .wav FILES ARE TO BE USED AS INPUT FOR SPEECH RECOGNITION AND SPOKEN LANGUAGE SYSTEM TESTS. ALL OTHER FILETYPES ARE INCLUDED FOR POST-TEST DIAGNOSTICS AND SCORING ONLY AND SHOULD *NOT* BE CONSULTED UNTIL TESTING IS COMPLETE. See Sections 4.0 and 5.0 for specifics on implementing the tests. 2.0 Subdirectories ------------------- The "atis3/sp_tst/nov94" directory contains the following subdirectory: initial/ - directory containing pre-adjudicated test waveforms and transcriptions used as input for the SPREC, NL, and SLS tests. final/ - directory containing post-adudicated transcriptions, query categorizations, and reference answers for scoring SPREC, NL, and SLS output. 3.0 Online Documentation ------------------------- The following files are included in the "atis3/sp_tst/dec94" directory: crown.ndx - index file containing list of 505 December 1994 ATIS Crown microphone recordings for use in implementing secondary microphone SPREC and SLS tests crt_dirs.sh - UNIX shell script to create ATIS directory structure from an index file dates.txt - list of scenario-sessions and the date they were recorded. This information is to be used during the test to establish the system date for each scenario-session. nl.ndx - index file containing list of 981 December 1994 ATIS .lsn transcriptions for use in implementing the NL test senn.ndx - index file containing list of 981 December 1994 ATIS Sennheiser microphone recordings for use in implementing primary microphone SPREC and SLS tests See the directory "atis3/doc" speaker information and for general ATIS3 documentation. 4.0 Test Set Indices --------------------- If you intend to run a NL test, you should use only the .lsn transcription files under the "initial" directory as input to your system. To insure using the correct files, refer to the list of files in the index file, "nl.ndx". To implement the SPREC Sennheiser or SLS Sennheiser tests, refer to the index file, "senn.ndx", which contains the path/file spec for the Sennheiser microphone waveform files. To implement the SPREC Crown or SLS Crown tests, refer to the index file, "crown.ndx", which contains the path/file spec for the Crown microphone waveform files. Note that only a subset (505 utterances) of the data in the test set was collected using Crown microphones as well as Sennheiser microphones. Note that the Sennheiser microphone index contains 5 less waveform files than there are .sro files due to empty utterances and to avoid confusion, the empty .lsn files have been removed as well. 5.0 December 1994 ATIS Test Overview ------------------------------------- The following is an overview of the procedures used in implementing the December 1994 ARPA ATIS tests. 5.1 Test Data Distribution --------------------------- The test material was distributed on a single recordable CD-ROM (T9-1.1), to the sites participating in the tests on November 30, 1994. This revised disc contains the same test material, but with adjudicated transcriptions, categorizations, and reference answers. 5.2 Test Protocols ------------------- The tests were conducted according to established MADCOW protocols. As of this publication, the test protocols have not yet been put into writing. 5.3 Initial Scoring -------------------- Results of the tests were due at NIST on Friday, December 9, 1994. This allowed the test sites 10 days to process the test corpora and package it for scoring at NIST. Test sites were permitted to run each test only once and results received after the December 9 deadline were be marked as "late". The results of the initial scoring run by NIST were to be made available to the test sites on Wednesday, December 14, 1994. The transcriptions and annotations used in the preliminary scoring are not included on this disc. 5.4 Adjudication ----------------- During the period between Wednesday, December 14 and Wednesday, December 21 1994, sites were permitted to "contest" the transcriptions, reference answers, and categorizations used in scoring their SPREC, NL, and SLS output. NIST/SRI "adjudicators" considered each request and made decisions on whether or not to make the suggested modifications to transcriptions and/or annotations. When transcriptions were deemed to be ambiguous, alternative transcriptions were employed. Thus, some utterances have multiple "correct" transcriptions. 5.5 Final Scoring ------------------ After the adjudication was complete and the final revisions were made to the transcriptions and annotations, a final "official" scoring run was made on all the test results on Friday, January 13, 1995. The official results of the December 1994 ARPA ATIS Tests are to be published in the Proceedings of the ARPA Spoken Language Technology Workshop, January 22-25, 1995. The transcriptions and annotations used in the final scoring are located in the "final" directory on this disc. 6.0 Test Scoring ----------------- This section describes the process used by NIST in scoring the December 1994 ATIS Natural Language (NL), Spoken Language Systems (SLS) and Speech Recognition (SPREC) tests. Sections 6.1.1 and 6.1.2 provide instructions on running a SPREC test. The SPREC test is scored using the NIST speech recognition scoring package supplied on this disc under the "/score" directory. Please install the scoring package by following the instructions in the file '/score/readme.doc'. For a complete description of the NIST scoring package and its use, see the file, "/score/doc/score.rdm", on this disc. Sections 6.2.1 and 6.2.2 provide instructions on running NL and SLS tests. The NL and SLS tests are scored using the NIST CAS answer comparator (comp4.exe) supplied in the "/comp" directory on this disc. Please install the comparator according to the instructions in the documentation file, "/comp/readme.doc". 6.1 Scoring ATIS SPREC Tests ----------------------------- This section describes the procedure NIST used to score SPREC results of the December 1994 ARPA ATIS tests. 6.1.1 Preparation of Hypothesized Transcripts ---------------------------------------------- In order for the NIST scoring software to properly score the output produced by SPeech ReCognition (SPREC) systems, the system-generated hypothesized transcripts must be formatted according to the Lexical Standard Normal Orthographic Representation, (LSN) format. To produce reference LSN transcriptions in ATIS, the lexical SNOR format is derived by filtering the detailed Speech Recognizer Output (SRO) transcription format to maintain only the lexical information required in scoring simple speech recognition output. The LSN format can be understood by looking at the SRO specifications in "/atis3/doc/sro_spec.doc" and performing the following simplifications (from the "sro2lsn" program in "/score/bin"): 1) remove edit cues and leave the remaining words 2) add spaces before and after alternation markers 3) delete the helpful interpretation marks 4) delete non-lexical acoustic events in square brackets 5) remove angle brackets from verbally deleted words 6) remove the stars from mispronounced words 7) delete false start words ending (or beginning) with a hyphen 8) replace any empty alternations with @ 9) collapse runs of spaces, delete initial and final spaces 10) convert everything to uppercase The recognized transcriptions must be put into the LSN format to be scored properly against the reference transcriptions. Prior to scoring, the SPREC transcriptions must be concatenated into a single file with one utterance-transcription per line and the utterance ID in parens at the end of the line. For example: SHOW ME THE FLIGHTS FROM BOSTON TO DENVER (ZZZ011SS) WHAT IS THE FARE FOR FLIGHT ONE TWO THREE (ZZZ021SS) WHAT MEALS ARE SERVED ON THAT FLIGHT (ZZZ031SS) . . . 6.1.2 Scoring SPREC Results ---------------------------- In order to simplify scoring SPREC output, a UNIX shell script has been created which performs all of the necessary scoring package housekeeping tasks. The script, "wgscore", is located in the directory, "/score/bin", on this disc. The script will take as input the concatenated SPREC hypothesis transcription file and can be run with various options. When implementing wgscore, a directory based on the concatenated SPREC hypothesis transcription filename is created which contains hyp-ref alignments and various summaries. See the manual page for "wgscore" in "/score/doc/man/man1/wgscore.1" for instructions on its use. 6.2 Scoring ATIS NL and SLS Tests ---------------------------------- This section describes the procedure NIST will use to score NL, and SLS results of the December 1994 ARPA ATIS tests. 6.2.1 Preparation of CAS NL/SLS Output and Hypothesis Answers -------------------------------------------------------------- To run a Natural Language (NL) or full Spoken Language System (SLS) test, first create an index file containing the full path and file specs of the files to be processed (.lsn files for NL, .wav files for SLS). Index files have been created for the December 1994 ARPA tests ("nl.ndx" [for NL] and "senn.ndx" and "crown.ndx" [for SLS] in the this directory). Next, create an output directory on magnetic disk for your system output (HYP_DIR) and duplicate the / paths in the index file under this directory. A UNIX shell script, "crt_dirs.sh", has been provided in this directory to aid this step. The syntax for the script is: crt_dirs.sh Next, process the files specified in the index through your NL or SLS system and create a file for each answer under the appropriate // directory. Be sure to use the system dates for each scenario as specified in the file, "dates.txt", in this directory. For your output files, use the same basenames as the input files, but assign a unique extension to the answer files you generate such as ".nl", ".sls-senn", ".sls-crown", etc. In order to be scored using the NIST comparator, your output must be formatted according to the ARPA Common Answer Specification (CAS). The CAS document is located in the file, "cas_cpec.doc", in the "atis3/doc" directory on this disc. An example output directory from a nonexistent NL system has been created Under "example/" which contains sample ".nl" system output files in the proper directory and file structure for scoring. 6.2.2 Scoring NL/SLS Results ----------------------------- You can score your results in one step using the UNIX shell script, "scor_cas.sh", located in the "/comp" directory on this disc. Note that the NIST comparator, "comp4.exe", which is also located in the "/comp" directory must first be installed before running "scor_cas.sh". See the file, "readme.doc", under "/comp" for installation instructions. The syntax for "scor_cas.sh" is: scor_cas.sh where, REF_DIR is the path for the directory containing the //CAS-reference-answers hierarchy. HYP_DIR is the path for the directory containing the //CAS-hypothesis-answers hierarchy (HYP_DIR). HYP_EXT is the name of the extension you have given to your CAS hypothsis answer files (.nl, .sls-senn, etc.). Warning: Make sure that this extension is unique, since all files with this extension will be scored. COMP_DIR is the path for the NIST comparator, "comp4.exe". Example: scor_cas //atis3/sp_tst/nov93/final //hyp_dir \ nl //comp/comp4.exe (above should be all on 1 line) where, CDROM is the path for your CD-ROM drive where this disc is located YOUR_DISK is the path for your local magnetic disk drive where your output and the executables for comp4.exe is stored (your output and comp4.exe can be stored on different disk). Upon completion, the script will generate two files in the current directory: class-a..score - scores and summary for Class-A queries class-d..score - scores and summary for Class-D queries scor_cas.sh performs several steps and creates several intermediate files in the current directory and employs the NIST comparator to actually score the results. If you would like to experiment with the comparator directly, see the file, "/comp/readme.doc" for a detailed description of the comparator and its use. An example execution is included below. The options used, "-ncfs", and "-pd3" are required to duplicate the settings used in the December 1994 scoring. comp4.exe -ncfs -pd3