NIST Speech Discs 13-32.1, 13-33.1
Public Release, May, 1994
If you intend to implement the protocols for the November '93 ARPA CSR Benchmark Tests, please read the file, "csrnov93.doc", in the top-level directory of NIST Speech Disc 13-32.1 in its entirety before proceeding and do not examine the included transcriptions, calibration recordings, adaptation recordings, or documentation unless such examination is specifically permitted in the guidelines for the test(s) being run. Index files have been included which specify the exact data to be used for each test. To avoid testing on erroneous data, please refer to these files when running the tests.
Disc 13-32.1 contains the documentation for the tests and the test corpora for the Spoke 9 tests. With the exception of the primary documentation file, "csrnov93.doc", located in this directory, the documentation for the tests and corpora are located under the "wsj1/doc" directory on disc 13-32.1. In addition to online documentation, Disc 13-32.1 contains software packages useful in processing the speech corpora and tabulating speech recognition scores.
The top-level directory of Disc 13-32.1 contains the following files and subdirectories:
13_32_1.dir File containing a listing of the all the files and directories on this disc. 13_32_1.txt File containing a summary of the corpora on this disc. csrnov93.doc File containing an overview of the tests and corpora to implement the November '93 ARPA CSR Hub and Spoke Evaluation Tests. This file also describes the directory and file structure of the discs. discinfo.dir Same contents as file "13_32_1.txt". hgrep/ Utility to search a collated SPHERE header contents file. readme.doc This file. score/ NIST speech recognition scoring software. Includes dynamic programming string-alignment scoring code and statistical significance tests. sphere/ NIST SPeech HEader REsources toolkit. Provides command- line and programmer interface to NIST-headered speech waveform files. Also provides for automatic decompression of the Shorten-compressed waveform files on these discs. tranfilt/ Directory containing a UNIX shell script used to perform the post-adjudication transcription filter process. wsj1/ Test corpora for Spoke 9 and documentation.General information files named "readme.doc" have been included in the high-level directories and throughout the documentation directory ("wsj1/doc") on Disc 13-32.1 and describe the contents of the directories.
data_sum.doc file containing summary of the data collected for each WSJ1 data type in the tests as well as a by-CD-ROM listing of each data type; dot_spec.doc file containing the WSJ1 detailed orthographic transcription specifications; evl_spok/ directory containing documentation on the collection of the test data; includes a collated speaker information file, "spkrinfo.txt"; h1newart.ndx listing of waveforms where article boundaries occur in the H1 test corpora; indices/ directory containing indices of corpora to be used for each test and training condition; lng_modl/ directory containing MIT Lincoln Lab's '87-89 WSJ language model, source texts, vocabularies, and text selection tools; nov93_h1/ directory containing the output from, and scored results for, the LIMSI November '93 H1-C1 system; nov93_h2/ directory containing the output from, and scored results for, the Cambridge University-HTK November '93 H2-C1 system; readme.doc this file; s1_oov.lex Listing of out-of-vocabulary lexemes in the Spoke 1 test corpora with regard to the CSR-WSJ 20K Open vocabulary; s1newart.ndx listing of waveforms where article boundaries occur in the Spoke 1 test corpora; si_et_93.hdr collated SPHERE header contents file for all November 1993 ARPA CSR test corpora;