November 1992 ARPA Continuous Speech Recognition Benchmark Tests Corpora and Instructions NIST Speech Discs 11-13.1, 11-14.1, and 11-15.1 Public Release, June 1994 * * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * * * * * If you intend to implement the protocols for the November '92 ARPA CSR * * Benchmark Tests, please read the file, "csrnov92.doc", in the top-level * * directory of NIST Speech Disc 11-13.1 in its entirety before proceeding * * and do not examine the included transcriptions, calibration recordings, * * adaptation recordings, or documentation unless such examination is * * specifically permitted in the guidelines for the test(s) being run. * * Index files have been included which specify the exact data to be used * * for each test. To avoid testing on erroneous data, please refer to * * these files when running the tests. * * * * * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * * Contents -------- 1.0 Introduction 2.0 November 1992 CSR Test Specifications 3.0 CD-ROM Organization 3.1 Directory Structure 3.2 Test Corpora Summary 3.3 Filenaming Formats 3.4 Data Types 3.4.1 Waveforms (.wv?) 3.4.2 Detailed Orthographic Transcriptions (.dot) 3.4.3 Lexical SNOR Transcriptions (.lsn) 3.4.4 Prompting Texts (.ptx) 3.5 Indices 4.0 Baseline Training and Language Model Data 5.0 Test Scoring 5.1 Preparation of Hypothesized Transcripts 5.3 Scoring Results 5.4 System Descriptions 1.0 Introduction ----------------- This 3-disc set contains the test material and documentation for the November 1992 ARPA Continuous Speech Recognition Wall Street Journal (CSR-WSJ) Benchmark Tests. This material is to be used in conjuction with the WSJ0 training and development test material (NIST speech discs 11-1.1 - 11-12.1) which is available separately. The first disc, Disc 11-13.1, contains the documentation, instructions, and software for implementing the tests. The remaining 2 discs, Discs 11-14.1 and 11-15.1, contain the waveforms, prompts, and transcriptions which comprise the test material. The Continuous Speech Recognition Wall Street Journal Phase I (CSR-WSJ0) Corpus was designed by the (D)ARPA CSR Corpus Coordinating Committee (CCCC) in 1990-1991 and was collected at the Massachusetts Institute of Technology Laboratory for Computer Science (MIT-LCS), SRI International, and Texas Instruments (TI) in late 1991. A "Dry Run" test was implemented on a set of corpora similar to this collection at the end of 1991 and preliminary results were reported at the February 1992 (D)ARPA Speech and Natural Language Workshop. An official CSR test was then conducted in November 1992 on this test material and results were reported at the January 1993 (D)ARPA Spoken Language Systems Technology Workshop and at the March 1993 ARPA Human Language Technology Workshop. The test material on these discs supports read 5,000-word and 20,000-word WSJ vocabulary CSR tests as well as tests using spontaneous dictation. In addition, each set of test material is partitioned into utterances with and without verbal punctuation. And, each utterance was recorded with two microphones, a Sennheiser close-talking microphone (channel 1) and a secondary microphone of varying type (channel 2). To minimize the storage requirements for the corpora, the waveforms have been compressed using the SPHERE-embedded "Shorten" lossless compression algorithm which was developed at Cambridge University. The use of "Shorten" has approximately halved the storage requirements for the corpora. The NIST SPeech HEader REsources (SPHERE) software with embedded Shorten compression is included in the top-level directory of disc 11-13.1 in the "sphere" directory and can be used to decompress and manipulate the waveform files. Disc 1 in the set (NIST speech disc 11-13.1) contains documentation for the corpora on the other two discs as well as the MIT Lincoln Laboratory WSJ '87-89 language models, a collation of the speech waveform file headers and a program to search them, and indices for each suggested test. The NIST speech recognition scoring package and SPHERE toolkit have been included as well in the top-level directory of Disc 11-13.1. General information files named, "readme.doc", have been included in most high-level directories on Disc 11-13.1 and describe the contents of the directories. The collection and publication of the test corpora and implementation of the November '92 ARPA CSR Benchmark Tests have been sponsored by the Advanced Research Projects Agency Software and Intelligent Systems Technology Office (ARPA-SISTO) and the Linguistic Data Consortium (LDC). The corpus was designed by the ARPA CCCC. MIT Lincoln Laboratory developed the text selection tools and the WSJ '87-89 language models. The corpus was collected at MIT-LCS, SRI, and TI and produced on CD-ROM by the National Institute of Standards and technology (NIST) and the November '92 ARPA CSR Benchmark Tests were administered by NIST. 2.0 November 1992 CSR Test Specifications ------------------------------------------ The following is a copy of the final specifications for the November 1992 ARPA CSR benchmark tests as posted in an email message from Francis Kubala, the CCCC chair, on September 10, 1992. Information not pertainent to the data on these discs has been removed. ------------------------- PART I -- CONTROLLED TEST ------------------------- [A] TWO TEST BASELINES. ----------------------- * Required Common Baseline: 5K vocabulary test data from the pilot corpus, closed vocabulary bigram language model supplied by Lincoln. * Conditionally Required Baseline for 20K Results: (required only when optional 20K results are presented) 20K vocabulary test data from the pilot corpus, open vocabulary bigram language model supplied by Lincoln. -- Common Conditions for the Baselines: Sennheiser mic, 8 SI speakers, 320 utterances, no verbalized punctuation, static SI test (utterance order or speaker ID not used). [B] CONTROLLED TRAINING FOR THE BASELINE TEST. ---------------------------------------------- * Sites are free to choose either SI-84, SI-12, or SI-3 data from the WSJ pilot corpus -- 12 hours of speech, 7200 utterances per dataset. [C] SCORING. ------------ * Standard word-based alignment algorithm used for scoring. * Separate performance numbers for each of the two test baselines. [D] ENCOURAGED TESTS. --------------------- The majority of the tests to be performed by a number of sites will fall in this area. Common test sets and, in some cases, common problem definitions are strongly encouraged. Below are some of the more salient conditions, with the first three being strongly recommended by the SLS Coordinating Committee: * SPONTANEOUS SPEECH -- Test on 8 SI speakers, 40 NVP utts per speaker, Sennheiser mic, 20K open vocabulary. * ALTERNATE MICROPHONE -- A well-formed alternate-microphone experiment requires results from 4 comparable conditions: test-on-same-mic and test-on-alt-mic, both with and without a proposed adaptation algorithm. The recommended test is on the 8 SI speakers, read speech, 40 NVP utts per speaker, Sennheiser and alternate mics, 5K closed vocabulary. * OPEN VOCABULARIES -- Test on 8 SI speakers, read speech, 40 NVP utts per speaker, Sennheiser mic, either 5K or 20K vocabulary. * ADAPTATION -- The four recommended adaptation scenarios are differentiated by the side information available to the system. Here is a summary of the 4 adaptation scenarios specifying the side-information that will be made available to each: 1. Incremental, Unsupervised: speaker ID, utterance chronology, speech from portion of test session already recognized available for examination. 2. Incremental, Supervised: speaker ID, utterance chronology, speech and correct answer from the portion of test session already recognized available for examination, 3. Rapid Enrollment: speaker ID, sample of speaker's speech from session other than test available for examination. 4. Transcription: speaker ID, entire test session available for examination. The primary performance number for each adaptation condition (including the incremental modes in particular) is the usual average word error computed over the entire test set for a given condition. * OTHER CONDITIONS that can measure specific CSR capabilities are also strongly encouraged. Participating sites can design their own conditions to showcase specific technologies. 3.0 CD-ROM Organization ------------------------ The test corpora for the November 1992 ARPA CSR Benchmark Tests is contained on 2 CD-ROMs (Disc 11-14.1 and 11-15.1). The documentation for the tests is contained on Disc 11-13.1. The test material on Discs 11-14.1 and 11-15.1 is composed of read CSR Wall Street Journal Pilot (WSJ0) data collected at MIT, SRI, and TI and spontaneous/read dictation data collected at SRI. The data is formatted according to the standard CSR conventions and uses the same file and directory formats as the WSJ0 training data on Discs 11-1.1 - 11-12.1. However, the prompts and transcriptions are included along side the test waveforms on the two discs, unlike the WSJ0 training data in which the texts are collated on a single disc separate from the waveforms. In addition to the online documentation for the tests, Disc 11-13.1 contains software packages useful in processing the speech corpora and tabulating speech recognition scores. The top-level directory of Disc 11-13.1 contains the following major subdirectories: hgrep/ Utility to search a collated SPHERE header contents file. wsj0/ Test corpora documentation. score/ NIST speech recognition scoring software. Includes dynamic string-alignment scoring code and statistical significance tests. sphere/ NIST SPeech HEader REsources toolkit. Provides command- line and programmer interface to NIST-headered speech waveform files. Also provides for automatic decompression of Shorten-compressed WSJ1 waveform files. General information files named "readme.doc" have been included in each of the high-level directories and throughout the documentation directory ("wsj0/doc") on Disc 11-13.1 and describe the contents of the directories. Three text files are included in the root directory of each of the 2 discs which contain descriptors for the contents of the discs. The file, ".dir" contains a list of all directories and files on the disc. The file, "discinfo.txt" and ".txt" both contain a high-level description of the corpora on the corpora on the disc. The static filename, "discinfo.txt" is used across all discs; and a variable filename determined by the disc ID are unique for each disc - this allows flexibility in using the information. The following is an example of the contents of one of these sets of files (filenames - discinfo.txt and 11_14_1.txt): disc_id: 11_14_1 data_types: si_et_05:8:1302, si_et_20:8:1312, si_et_ad:8:640, si_et_jd:8:1280 channels: 1,2 The first field, "disc_id", indentifies the disc number. The second field, "data_types", contains entries for subcorpora (directories) separated by commas with each subfield containing an entry identifying the subcorpora, number of speakers, and number of waveforms. The third field, "channel_ids", contains a comma-separated list of the channels contained on the disc. This field has a value of "1" (Sennheiser) or "2" (Other mic.) for this corpora. 3.1 Directory Structure ------------------------ The following depicts the directory structure of the corpora on Discs 11-14.1 and 11-15.1: top level: wsj0/ (Phase 1 (pilot) corpus) 2nd level: / si_et_05/ (Speaker-Independent, 5K Vocabulary, Read WSJ test data) Disc si_et_20/ (Speaker-Independent, 20K Vocabulary, Read WSJ test data) 11-14.1 si_et_ad/ (Speaker-Independent, Adaptation Utterances) \ si_et_jd/ (Speaker-Independent, Spontaneous Journalist Dictation test data) / sd_et_05/ (Speaker-Dependent, 5K Vocabulary, Read WSJ test data) Disc sd_et_20/ (Speaker-Dependent, 20K Vocabulary, Read WSJ test data) 11-15.1 si_et_jr/ (Speaker-Independent, Read version of spontaneous from \ "si_et_jd" on Disc 11-14.1). speaker level: / (speaker-ID, where XXX = "001" to "zzz", base 36) data level: (corpora files, see below for format and types) 3.2 Test Corpora Summary ------------------------- Disc Directory Utts Files Data set(s) ------ --------- ---- ----- ---------------------------------------------- 11-14.1 si_et_05/ 330 660 SI 5K read NVP (8 spkrs X ~40 utts) Senn/2nd 321 642 SI 5K read VP (8 spkrs X ~40 utts) Senn/2nd --- ---- 651 1302 (total in directory) si_et_20/ 333 666 SI 20K read NVP (8 spkrs X ~40 utts) Senn/2nd 323 646 SI 20K read VP (8 spkrs X ~40 utts) Senn/2nd --- ---- 656 1312 (total in directory) si_et_ad/ 320 640 SI adapt. sents. (8 spkrs X ~40 utts) Senn/2nd Note: this is not test data si_et_jd/ 320 640 SI spon. NVP (8 spkrs X ~40 utts) Senn/2nd 320 640 SI spon. VP (8 spkrs X ~40 utts) Senn/2nd --- ---- 640 1280 (total in directory) 11-15.1 sd_et_05/ 310 620 SD 20K read NVP (12 spkrs X ~25 utts) Senn/2nd 300 600 SD 20K read VP (12 spkrs X ~25 utts) Senn/2nd --- ---- 610 1220 (total in directory) sd_et_20/ 312 624 SD 5K read NVP (12 spkrs X ~25 utts) Senn/2nd 308 616 SD 5K read VP (12 spkrs X ~25 utts) Senn/2nd --- ---- 620 1240 (total in directory) si_et_jr/ 320 640 SI read/spon. NVP (8 spkrs X ~40 utts) Senn/2nd 320 640 SI read/spon. VP (8 spkrs X ~40 utts) Senn/2nd --- ---- 640 1280 (total in directory) Note: The corpora for the November 1992 CSR tests was evenly split between including and not including verbal punctuation. "VP" indicates that the data contains verbal punctuation, "NVP" indicates that the data does not contain verbal punctuation. The verbal punctuation mode is indicated in the "speech type" code in the filenames (see below). 3.3 Filenaming Formats ----------------------- The filenames and filetypes follow standard CSR WSJ conventions. Data types are differentiated by unique filename extensions. All files associated with the same utterance have the same basename. All filenames are unique across all WSJ corpora. Speech waveform (.wv[1-2]) files are utterance-level files and prompt (.ptx) and transcription (.dot and .lsn) files are session-level files and, therefore, contain texts for multiple waveform files. The filename format is as follows: . where, UTTERANCE-ID ::= where, SSS ::= 001 | ... | zzz (base-36 speaker ID) T ::= (speech type code) c (Common read no verbal punctuation) | s (Spontaneous no/unspecified verbal punctuation) | a (Adaptation read) | r (Read version of spontaneous no verbal puncutation) | o (cOmmon read with verbal puncuation) | p (sPontaneous with verbal punctuation) | e (rEad version of spontaneous with verbal punctuation ) EE ::= 01 | ... | zz (base-36 session ID) UU ::= 01 | ... | zz (base-36 within-session sequential speaker utterance code - always "00" for .ptx, .dot and .lsn session-level files) XXX ::= (data type) .wv1 (channel 1 - Sennheiser waveform) .wv2 (channel 2 - Other mic waveform) .ptx (prompting text for read material) .dot (detailed orthographic transcription) .lsn (Lexical SNOR transcription derived from .dot) 3.4 Data Types --------------- 3.4.1 Waveforms (.wv?) ----------------------- The waveforms are SPHERE-headered, digitized, and compressed using the lossless Cambridge University "Shorten" algorithm under SPHERE. Version 2.1 of SPHERE has been included in this disc which will permit the waveform files to be decompressed automatically as they are accessed. See the files under the "/sphere" directory on Disc 11-13.1. The filename extension for the waveforms contains the characters, "wv", followed by a 1-character code to identify the channel. The headers contain the following fields/types: Field Type Description - Probable defaults marked in () ----------------------- ------- --------------------------------------------- microphone string microphone description ("Sennheiser HMD410", "Crown PCC160", etc.) recording_site string recording site ("MIT","SRI","TI") database_id string database (corpus) identifier ("wsj0") database_version string database (corpus) revision ("1.0") recording_environment string text description of recording environment speaker_session_number string 2-char. base-36 session ID from filename session_utterance_number string 2-char. base-36 utterance number within session from the filename prompt_id string WSJ source sentence text ID - see .ptx description below for format (only in read data). utterance_id string utterance ID from filename of the form SSSTEEUU as described in the filename section above. speaking_mode string speaking mode ("spontaneous","read-common", "read-adaptation", etc.) speaker_id string 3-char. speaker ID from filename sample_count integer number of samples in waveform sample_min integer minimum sample value in waveform sample_max integer maximum sample value in waveform sample_checksum integer checksum obtained by the addition of all (uncompressed) samples into an unsigned 16-bit (short) and discarding overflow. recording_date string beginning of recording date stamp of the form DD-MMM-YYYY. recording_time string beginning of recording time stamp of the form HH:MM:SS.HH. channel_count integer number of channels in waveform ("1") sample_rate integer waveform sampling rate ("16000") sample_n_bytes integer number of bytes per sample ("2") sample_byte_format string byte order (MSB/LSB -> "10", LSB/MSB -> "01") sample_sig_bits integer number of significant bits in each sample ("16") sample_coding string waveform encoding ("pcm,embedded-shorten-v1.09") end_head 3.4.2 Detailed Orthographic Transcriptions (.dot) -------------------------------------------------- A detailed orthorgraphic transcription (.dot) containing lexical and non-lexical elements has been generated for each utterance. The specifications for the format of the detailed orthographic transcriptions are located in the file, "dot_spec.doc", under the "/wsj0/doc" directory on Disc 11-13.1. The transcriptions for all utterances in a session are concatenated into a single file of the form, "00.dot" and each transcription includes a corresponding utterance-ID code. The format for a single utterance transcription entry in this table is as follows: () example: Speculation in Tokyo was that the yen could rise because of the realignment (4oc0201) (new-line added for readability) There is one ".dot" file for each speaker-session. 3.4.3 Lexical SNOR Transcriptions (.lsn) ----------------------------------------- The lexical Standard Normal Orthographic Representation (lexical SNOR) (.lsn) transcriptions are word-level transcriptions derived from the ".dot" transcriptions with capitalization, non-speech markers, prosodic markings, fragments, and "\" character escapes filtered out. The .lsn transcriptions are of the same form as the .dot transcriptions and will be identified by a ".lsn" filename extension. example: SPECULATION IN TOKYO WAS THAT THE YEN COULD RISE BECAUSE OF THE REALIGNMENT (4OAC0201) (new-line added for readability) There is one ".lsn" file for each speaker-session. 3.4.4 Prompting Texts (.ptx) ----------------------------- The prompting texts for all read Wall Street Journal utterances in a session including the utterances' utterance-IDs and prompt IDs are concatenated into a single file of the form, "00.ptx". The prompt ID is Doug Paul's Wall Street Journal sentence index. The format for this index is: .... The format for a single prompting text entry in the .ptx file is as follows: ( ) example: Speculation in Tokyo was that the yen could rise because of the realignment. (4oac0201 87.051.870113-0174.6.1) (new-line added for readability) The inclusion of both the utterance ID and prompt ID allows the utterance to be mapped back to its source sentence text and surrounding paragraph. There is one .ptx file for each read speaker-session. 3.5 Indices ------------ Index files have been built for each of the suggested test sets. The files are located in the "wsj0/doc/indices/test" directory on Disc 11-13.1 and are named so as to clearly indicate the tests they pertain to. Index files for baseline training conditions have also been included in the directory, "wsj0/doc/indices/train". Each index file contains a header which describes its contents. Header lines are preceded by ";;". Each line following the header indicates the disc, path, and waveform file for an utterance in the test set (e.g., "11_14_1:wsj0/si_et_05/440/440c0201"). Note that auxiliary files such as adaptation utterances which are not part of the test set are not included in the indices. The .wv[1-2] extension has not been included in the test indices so that the indices can be used for testing on either channel/microphone. However, unlike the test corpora, since the training corpora for the different channels/microphones exists on different CD-ROMs, an index has been built for each channel for each training condition. 4.0 Baseline Training and Language Model Data ---------------------------------------------- Some of the tests in the CCCC test specifications in Section 2.0 call for the use of "standard" baseline training sets. These training sets are drawn from the CSR WSJ0 pilot corpus training data on NIST speech discs 11-1.1 - 11-12.1. Indices have been developed for the following baseline training sets and are located in the subdirectories under the directory, "wsj1/doc/indices/train", on Disc 11-13.1: wsj0/train: Indices for the WSJ0 (~7,200-utterance) Sennheiser training sets tr_l_wv1.ndx WSJ0 SD/SI-long term training, Sennheiser mic tr_l_wv2.ndx WSJ0 SD/SI-long term training, Secondary mic tr_s_wv1.ndx WSJ0 SI-short term training, Sennheiser mic tr_s_wv2.ndx WSJ0 SI-short term training, Secondary mic tr_v_wv1.ndx WSJ0 Longitudinal-SD/SI-very-long training, Sennheiser mic tr_v_wv2.ndx WSJ0 Longitudinal-SD/SI-very-long training, Secondary mic Some of the tests in the CCCC test specifications in Section 2.0 call for the use of "standard" baseline bigram or trigram language models. The baseline language models were developed by MIT-Lincoln Laboratories and are included in the directory, "wsj0/doc/lng_modl/base_lm", on Disc 11-13.1. The baseline language models are as follows: bcb05cnp.z 5K closed NVP bigram LM bcb05cvp.z 5K closed VP bigram LM bcb05onp.z 5K open NVP bigram LM bcb05ovp.z 5K open VP bigram LM bcb20cnp.z 20K closed NVP bigram LM bcb20cvp.z 20K closed VP bigram LM bcb20onp.z 20K open NVP bigram LM bcb20ovp.z 20K open VP bigram LM tb05cnp.z 5K closed NVP trigram LM tb20onp.z 20K open NVP trigram LM To conserve disc space, the language model files have been compressed using the standard UNIX "compress" utility and must be decompressed to be used. 5.0 Test Scoring ----------------- This section describes the process used by NIST in scoring the November 1992 WSJ/CSR Benchmark Tests. The information in this section can also be used by those who wish to duplicate the scoring methodology used. For a complete description of the NIST scoring package and it's use, see the file, "score/doc/score.rdm", on Disc 11-13.1. 5.1 Preparation of Hypothesized Transcripts -------------------------------------------- The system-generated hypothesized transcripts were formatted by the test sites according to the Lexical SNOR, (LSN), format used by the scoring package. See Section 3.4.3 for a description of the LSN format. 5.3 Scoring Results -------------------- In the November 1992 tests, the hypothesized transcriptions were scored using the standard NIST scoring package. The scoring package has been included on Disc 11-13.1 in the top-level directory, "score". The directory contains a "readme.doc" file with compilation and installation instructions. In order to score a hypothesis transcription against a reference transcription using the NIST scoring software, an "alignment" file must be created. The alignment file contains pairs of hypothesis and reference strings which have been aligned using a DP string alignment procedure. The format for using the "align" utility is: /bin/align -cfg -hyp -outfile where: INSTALL_DIR is the pathname to the compiled "score" directory. CONFIG_FILE contains a list of arguments to the scoring software including the filespec for the reference transcription file, lexicon file, and command line switches. HYP_FILE contains the LSN-formatted hypothesis transcriptions. ALIGNMENTS contains the output alignments. Example where "score" is located under the current directory: ./score/bin/align -cfg ./score/lib/wsj.cfg -hyp site.hyp -outfile site.ali The actual tabulation of the scores is generated using the above alignment file as input. The "score" program with the "-ovrall" switch creates a by-speaker summary table of the error rates, (insertions, deletions, etc.). The format for using the "score" utility is: /bin/score -cfg -align where: INSTALL_DIR is the pathname to the compiled "score" directory. CONFIG_FILE contains a list of arguments to the scoring software including the filespec for the reference transcription file, lexicon file, and command line switches. ALIGNMENTS contains the output alignments. REPORT_OPTIONS are switches to generate different reports. Example using the example output from "align" above, where "score" is located under the current directory: ./score/bin/score -cfg ./score/lib/wsj.cfg -align site.ali -ovrall Note: The "score" program can produce several other reports of greater or lesser detail using other command line switches. See the manual page for "score" for description of it's other uses. 5.4 System Descriptions ------------------------ As part of the November 1992 CSR Tests, each test site was required to generate a description of the systems used in each test according to a prescribed format. If you intend to publish results using this test material, you should provide such a system description along with your results. The format for the system description is as follows: SITE/SYSTEM NAME TEST DESIGNATION 1) TEST SYSTEM DESCRIPTION: 2) ACOUSTIC TRAINING: 3) GRAMMAR TRAINING: 4) RECOGNITION LEXICON DESCRIPTION: 5) NEW CONDITIONS FOR THIS EVALUATION: 6) REFERENCES: