The ARPA Continuous Speech Recognition Pilot Corpus

November 1992 ARPA CSR Benchmark Tests Corpora and Instructions

NIST Speech Discs 11-13.1, 11-14.1, and 11-15.1

Public Release, June, 1994


* * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * *
*                                                                         *
* If you intend to implement the protocols for the November '92 ARPA CSR  *
* Benchmark Tests, please read the file, "csrnov92.doc", in the top-level *
* directory of NIST Speech Disc 11-13.1 in its entirety before proceeding *
* and do not examine the included transcriptions, calibration recordings, *
* adaptation recordings, or documentation unless such examination is      *
* specifically permitted in the guidelines for the test(s) being run.     *
* Index files have been included which specify the exact data to be used  *
* for each test.  To avoid testing on erroneous data, please refer to     *
* these files when running the tests.                                     *
*                                                                         *
* * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * *
The ARPA Continuous Speech Recognition Pilot Corpus (WSJ0) is a 12-disc set containing high-fidelity speech recordings of 123 speakers reading excerpts from the Wall Street Journal; the data were recorded at SRI, TI and MIT.

Each utterance was recorded on two channels: a high-quality "primary" microphone (a head-mounted, noise-cancelling Sennheiser HMD410), and an additional (desk-mounted Crown or other) microphone. These two channels have been stored into separate files and organized in a parallel structure in the CD-ROM directories; six of the discs (11-1 through 11-6) contain the primary channel, and the other six contain the secondary channel. Header information in each waveform file provide details on the recording source and conditions.

In addition to the waveform data, the discs contain complete orthographic transcriptions of the speech data, and complete bigram language models for the Wall Street Journal text data from which the prompting text was taken. These materials are provided on disc 11-4 (as part of the primary microphone disc set) and also on disc 11-10 (as part of the secondary microphone disc set). Disc 11-4 also contains the complete text of the WSJ articles from which the utterance prompts and language models were derived; this set of text data originally appeared on the ACL/DCI cdrom.

The root directory of each cdrom contains a "dir" text file, which lists all the file names contained on that disc; there is also a "discinfo.txt" file (and a duplicate "11__1.txt" file) which summarizes the type of speech data on the disc, the number of speakers, and the number of utterances (waveform files).

The waveform data are stored in compressed format, using the "shorten" compression technique developed by Toni Robinson at Cambridge University. Source code for decompressing the data and manipulating the NIST SPHERE headers is available for free via anonymous ftp. (If you do not have access to ftp data transfer, please contact the LDC and we will supply the source code by whatever means necessary.) To get the source code by ftp, use the following sequence of commands ("%" is the operating system prompt; "ftp>" is the prompt given by the ftp program):

	% ftp jaguar.ncsl.nist.gov
	Name:  anonymous
	Password:  
	ftp> binary
	ftp> cd pub
	ftp> get sphere_2.0_Beta2.tar.Z
	ftp> bye
	% uncompress sphere_2.0_Beta2.tar.Z
	% tar xf sphere_2.0_Beta2.tar
After the tar file contents have been extracted, you can easily find and follow the directions for compiling the software. Please contact the LDC if you have any difficulty with the software or data.


The on-line documentation for the test data.

The test corpora and documentation for the November 1992 ARPA CSR Benchmark Tests is contained on 3 CD-ROMs: NIST speech discs 11-13.1, 11-14.1, and 11-15.1. Disc 11-13.1 contains the documentation and language models for the tests. The test corpora and transcriptions are contained on discs 11-14.1 and 11-15.1.

The top-level directory of Disc 11-13.1 contains the following files and subdirectories:

   11_13_1.dir - File containing a listing of the all the files and
	         directories on this disc.

   11_13_1.txt - File containing a summary of the corpora on this disc.

 csrnov92.doc - File containing an overview of the tests and corpora
                 to implement the November '92 ARPA CSR Benchmark Tests
                 This file also describes the directory and file structure 
                 of the discs.

  discinfo.dir - Same contents as file "11_13_1.txt".

        hgrep/ - Utility to search a collated SPHERE header contents file.

    readme.doc - This file.

        score/ - NIST speech recognition scoring software.  Includes 
                 dynamic programming string-alignment scoring code and
                 statistical significance tests.

       sphere/ - NIST SPeech HEader REsources toolkit.  Provides command-
                 line and programmer interface to NIST-headered speech 
	         waveform files.  Also provides for automatic decompression
	         of the Shorten-compressed waveform files on these discs.

     wsj0/doc/ - Directory containing documentation, language models, and
                 indices for implementing the November 1992 ARPA CSR tests.
               
Disc 11-14.1 contains the following files and test corpora subdirectories:
   11_14_1.dir - File containing a listing of the all the files and
	         directories on this disc.

   11_14_1.txt - File containing a summary of the corpora on this disc.

  discinfo.dir - Same contents as file "11_14_1.txt".

    readme.doc - This file.

wsj0/si_et_05/ - Speaker-Independent Read 5K Vocabulary Verbal Punctuation and 
                 No Verbal Punctuation Test Data

wsj0/si_et_20/ - Speaker-Independent Read 20K Vocabulary Verbal Punctuation 
                 and No Verbal Punctuation Test Data

wsj0/si_et_ad/ - Speaker-Independent Test Speakers, Read Adaptation Sentences

wsj0/si_et_jd/ - Speaker-Independent Spontaneous Verbal Punctuation and 
                 No Verbal Punctuation Journalist Dictation.

Disc 11-15.1 contains the following files and test corpora subdirectories:
   11_15_1.dir - File containing a listing of the all the files and
	         directories on this disc.

   11_15_1.txt - File containing a summary of the corpora on this disc.

  discinfo.dir - Same contents as file "11_15_1.txt".

    readme.doc - This file.

wsj0/sd_et_05/ - Speaker-Dependent Read 5K Vocabulary Verbal Punctuation and 
                 No Verbal Punctuation Test Data

wsj0/sd_et_20/ - Speaker-Dependent Read 20K Vocabulary Verbal Punctuation and 
                 No Verbal Punctuation Test Data

wsj0/si_et_jr  - Speaker-Indpendent Read version of Spontaneous Verbal 
                 Punctuation and No Verbal Punctuation Journalist Dictation
With the exception of the primary documentation file, "csrnov92.doc", located in the top-level directory of disc 11-13.1, the documentation for the tests and corpora are located under the "wsj0/doc" directory on disc 11-13.1. In addition to online documentation, Disc 11-13.1 contains software packages useful in processing the speech corpora and tabulating speech recognition scores.

General information files named "readme.doc" have been included in the high-level directories and throughout the documentation directory ("wsj0/doc") on Disc 11-13.1 and describe the contents of the directories.


wsj0/doc/indices/readme.doc:

This directory contains documentation for the November 1992 ARPA CSR benchmark tests. The following subdirectories/files are included:

csrnov92.hdr   Collated SPHERE header contents file for all November
               1992 ARPA CSR test corpora;

dot_spec.doc   File containing the WSJ0 detailed orthographic transcription
               specifications.

indices/       Directory containing index files for standard training 
               and test sets.

lng_modl/      Directory containing MIT Lincoln Lab's '87-89 WSJ language
               model, source texts, vocabularies, and text selection tools;
 
readme.doc     This file.

spkrinfo.txt   Test corpora speaker information table.

wsj0-spkr-info.txt:

This file contains the speaker information for the CSR-I Training Data:

train_spkrinfo.txt