NIST Speech Discs 11-13.1, 11-14.1, and 11-15.1
Public Release, June, 1994
* * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * * * * * If you intend to implement the protocols for the November '92 ARPA CSR * * Benchmark Tests, please read the file, "csrnov92.doc", in the top-level * * directory of NIST Speech Disc 11-13.1 in its entirety before proceeding * * and do not examine the included transcriptions, calibration recordings, * * adaptation recordings, or documentation unless such examination is * * specifically permitted in the guidelines for the test(s) being run. * * Index files have been included which specify the exact data to be used * * for each test. To avoid testing on erroneous data, please refer to * * these files when running the tests. * * * * * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * *The ARPA Continuous Speech Recognition Pilot Corpus (WSJ0) is a 12-disc set containing high-fidelity speech recordings of 123 speakers reading excerpts from the Wall Street Journal; the data were recorded at SRI, TI and MIT.
Each utterance was recorded on two channels: a high-quality "primary" microphone (a head-mounted, noise-cancelling Sennheiser HMD410), and an additional (desk-mounted Crown or other) microphone. These two channels have been stored into separate files and organized in a parallel structure in the CD-ROM directories; six of the discs (11-1 through 11-6) contain the primary channel, and the other six contain the secondary channel. Header information in each waveform file provide details on the recording source and conditions.
In addition to the waveform data, the discs contain complete orthographic transcriptions of the speech data, and complete bigram language models for the Wall Street Journal text data from which the prompting text was taken. These materials are provided on disc 11-4 (as part of the primary microphone disc set) and also on disc 11-10 (as part of the secondary microphone disc set). Disc 11-4 also contains the complete text of the WSJ articles from which the utterance prompts and language models were derived; this set of text data originally appeared on the ACL/DCI cdrom.
The root directory of each cdrom contains a "dir" text file,
which lists all the file names contained on that disc; there is also
a "discinfo.txt" file (and a duplicate "11_
The waveform data are stored in compressed format, using the
"shorten" compression technique developed by Toni Robinson at
Cambridge University. Source code for decompressing the data and
manipulating the NIST SPHERE headers is available for free via
anonymous ftp. (If you do not have access to ftp data transfer,
please contact the LDC and we will supply the source code by whatever
means necessary.) To get the source code by ftp, use the following
sequence of commands ("%" is the operating system prompt; "ftp>" is
the prompt given by the ftp program):
The test corpora and documentation for the November 1992 ARPA CSR
Benchmark Tests is contained on 3 CD-ROMs: NIST speech discs 11-13.1,
11-14.1, and 11-15.1. Disc 11-13.1 contains the documentation and
language models for the tests. The test corpora and transcriptions are
contained on discs 11-14.1 and 11-15.1.
The top-level directory of Disc 11-13.1 contains the following files
and subdirectories:
General information files named "readme.doc" have been included in
the high-level directories and throughout the documentation directory
("wsj0/doc") on Disc 11-13.1 and describe the contents of the
directories.
This directory contains documentation for the November 1992 ARPA
CSR benchmark tests. The following subdirectories/files are included:
This file contains the speaker information for the CSR-I Training Data:
% ftp jaguar.ncsl.nist.gov
Name: anonymous
Password:
After the tar file contents have been extracted, you can easily find
and follow the directions for compiling the software. Please contact
the LDC if you have any difficulty with the software or data.
The on-line documentation for the test data.
11_13_1.dir - File containing a listing of the all the files and
directories on this disc.
11_13_1.txt - File containing a summary of the corpora on this disc.
csrnov92.doc - File containing an overview of the tests and corpora
to implement the November '92 ARPA CSR Benchmark Tests
This file also describes the directory and file structure
of the discs.
discinfo.dir - Same contents as file "11_13_1.txt".
hgrep/ - Utility to search a collated SPHERE header contents file.
readme.doc - This file.
score/ - NIST speech recognition scoring software. Includes
dynamic programming string-alignment scoring code and
statistical significance tests.
sphere/ - NIST SPeech HEader REsources toolkit. Provides command-
line and programmer interface to NIST-headered speech
waveform files. Also provides for automatic decompression
of the Shorten-compressed waveform files on these discs.
wsj0/doc/ - Directory containing documentation, language models, and
indices for implementing the November 1992 ARPA CSR tests.
Disc 11-14.1 contains the following files and test corpora subdirectories:
11_14_1.dir - File containing a listing of the all the files and
directories on this disc.
11_14_1.txt - File containing a summary of the corpora on this disc.
discinfo.dir - Same contents as file "11_14_1.txt".
readme.doc - This file.
wsj0/si_et_05/ - Speaker-Independent Read 5K Vocabulary Verbal Punctuation and
No Verbal Punctuation Test Data
wsj0/si_et_20/ - Speaker-Independent Read 20K Vocabulary Verbal Punctuation
and No Verbal Punctuation Test Data
wsj0/si_et_ad/ - Speaker-Independent Test Speakers, Read Adaptation Sentences
wsj0/si_et_jd/ - Speaker-Independent Spontaneous Verbal Punctuation and
No Verbal Punctuation Journalist Dictation.
Disc 11-15.1 contains the following files and test corpora subdirectories:
11_15_1.dir - File containing a listing of the all the files and
directories on this disc.
11_15_1.txt - File containing a summary of the corpora on this disc.
discinfo.dir - Same contents as file "11_15_1.txt".
readme.doc - This file.
wsj0/sd_et_05/ - Speaker-Dependent Read 5K Vocabulary Verbal Punctuation and
No Verbal Punctuation Test Data
wsj0/sd_et_20/ - Speaker-Dependent Read 20K Vocabulary Verbal Punctuation and
No Verbal Punctuation Test Data
wsj0/si_et_jr - Speaker-Indpendent Read version of Spontaneous Verbal
Punctuation and No Verbal Punctuation Journalist Dictation
With the exception of the primary documentation file, "csrnov92.doc",
located in the top-level directory of disc 11-13.1, the documentation
for the tests and corpora are located under the "wsj0/doc" directory
on disc 11-13.1. In addition to online documentation, Disc 11-13.1
contains software packages useful in processing the speech corpora and
tabulating speech recognition scores.
wsj0/doc/indices/readme.doc:
csrnov92.hdr Collated SPHERE header contents file for all November
1992 ARPA CSR test corpora;
dot_spec.doc File containing the WSJ0 detailed orthographic transcription
specifications.
indices/ Directory containing index files for standard training
and test sets.
lng_modl/ Directory containing MIT Lincoln Lab's '87-89 WSJ language
model, source texts, vocabularies, and text selection tools;
readme.doc This file.
spkrinfo.txt Test corpora speaker information table.
wsj0-spkr-info.txt:
train_spkrinfo.txt