November 1992 ARPA CSR Benchmark Tests Corpora and Instructions NIST Speech Discs 11-13.1, 11-14.1, and 11-15.1 Public Release, June, 1994 * * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * * * * * If you intend to implement the protocols for the November '92 ARPA CSR * * Benchmark Tests, please read the file, "csrnov92.doc", in the top-level * * directory of NIST Speech Disc 11-13.1 in its entirety before proceeding * * and do not examine the included transcriptions, calibration recordings, * * adaptation recordings, or documentation unless such examination is * * specifically permitted in the guidelines for the test(s) being run. * * Index files have been included which specify the exact data to be used * * for each test. To avoid testing on erroneous data, please refer to * * these files when running the tests. * * * * * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * * The test corpora and documentation for the November 1992 ARPA CSR Benchmark Tests is contained on 3 CD-ROMs: NIST speech discs 11-13.1, 11-14.1, and 11-15.1. Disc 11-13.1 contains the documentation and language models for the tests. The test corpora and transcriptions are contained on discs 11-14.1 and 11-15.1. The top-level directory of Disc 11-13.1 contains the following files and subdirectories: 11_13_1.dir - File containing a listing of the all the files and directories on this disc. 11_13_1.txt - File containing a summary of the corpora on this disc. csrnov92.doc - File containing an overview of the tests and corpora to implement the November '92 ARPA CSR Benchmark Tests This file also describes the directory and file structure of the discs. discinfo.dir - Same contents as file "11_13_1.txt". hgrep/ - Utility to search a collated SPHERE header contents file. readme.doc - This file. score/ - NIST speech recognition scoring software. Includes dynamic programming string-alignment scoring code and statistical significance tests. sphere/ - NIST SPeech HEader REsources toolkit. Provides command- line and programmer interface to NIST-headered speech waveform files. Also provides for automatic decompression of the Shorten-compressed waveform files on these discs. wsj0/doc/ - Directory containing documentation, language models, and indices for implementing the November 1992 ARPA CSR tests. Disc 11-14.1 contains the following files and test corpora subdirectories: 11_14_1.dir - File containing a listing of the all the files and directories on this disc. 11_14_1.txt - File containing a summary of the corpora on this disc. discinfo.dir - Same contents as file "11_14_1.txt". readme.doc - This file. wsj0/si_et_05/ - Speaker-Independent Read 5K Vocabulary Verbal Punctuation and No Verbal Punctuation Test Data wsj0/si_et_20/ - Speaker-Independent Read 20K Vocabulary Verbal Punctuation and No Verbal Punctuation Test Data wsj0/si_et_ad/ - Speaker-Independent Test Speakers, Read Adaptation Sentences wsj0/si_et_jd/ - Speaker-Independent Spontaneous Verbal Punctuation and No Verbal Punctuation Journalist Dictation. Disc 11-15.1 contains the following files and test corpora subdirectories: 11_15_1.dir - File containing a listing of the all the files and directories on this disc. 11_15_1.txt - File containing a summary of the corpora on this disc. discinfo.dir - Same contents as file "11_15_1.txt". readme.doc - This file. wsj0/sd_et_05/ - Speaker-Dependent Read 5K Vocabulary Verbal Punctuation and No Verbal Punctuation Test Data wsj0/sd_et_20/ - Speaker-Dependent Read 20K Vocabulary Verbal Punctuation and No Verbal Punctuation Test Data wsj0/si_et_jr - Speaker-Indpendent Read version of Spontaneous Verbal Punctuation and No Verbal Punctuation Journalist Dictation With the exception of the primary documentation file, "csrnov92.doc", located in the top-level directory of disc 11-13.1, the documentation for the tests and corpora are located under the "wsj0/doc" directory on disc 11-13.1. In addition to online documentation, Disc 11-13.1 contains software packages useful in processing the speech corpora and tabulating speech recognition scores. General information files named "readme.doc" have been included in the high-level directories and throughout the documentation directory ("wsj0/doc") on Disc 11-13.1 and describe the contents of the directories.