November 1992 ARPA Continuous Speech Recognition Benchmark Tests Corpora and Instructions

NIST Speech Discs 11-13.1, 11-14.1, and 11-15.1

Public Release, June 1994

* * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * *
*                                                                         *
* If you intend to implement the protocols for the November '92 ARPA CSR  *
* Benchmark Tests, please read the file, "csrnov92.doc", in the top-level *
* directory of NIST Speech Disc 11-13.1 in its entirety before proceeding *
* and do not examine the included transcriptions, calibration recordings, *
* adaptation recordings, or documentation unless such examination is      *
* specifically permitted in the guidelines for the test(s) being run.     *
* Index files have been included which specify the exact data to be used  *
* for each test.  To avoid testing on erroneous data, please refer to     *
* these files when running the tests.                                     *
*                                                                         *
* * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * *


  1. Introduction
  2. November 1992 CSR Test Specifications
  3. CD-ROM Organization
    1. Directory Structure
    2. Test Corpora Summary
    3. Filenaming Formats
    4. Data Types
      1. Waveforms (.wv?)
      2. Detailed Orthographic Transcriptions (.dot)
      3. Lexical SNOR Transcriptions (.lsn)
      4. Prompting Texts (.ptx)
    5. Indices
  4. Baseline Training and Language Model Data
  5. Test Scoring
    1. Preparation of Hypothesized Transcripts
    2. Scoring Results
    3. System Descriptions

1.0 Introduction

This 3-disc set contains the test material and documentation for the November 1992 ARPA Continuous Speech Recognition Wall Street Journal (CSR-WSJ) Benchmark Tests. This material is to be used in conjuction with the WSJ0 training and development test material (NIST speech discs 11-1.1 - 11-12.1) which is available separately. The first disc, Disc 11-13.1, contains the documentation, instructions, and software for implementing the tests. The remaining 2 discs, Discs 11-14.1 and 11-15.1, contain the waveforms, prompts, and transcriptions which comprise the test material.

The Continuous Speech Recognition Wall Street Journal Phase I (CSR-WSJ0) Corpus was designed by the (D)ARPA CSR Corpus Coordinating Committee (CCCC) in 1990-1991 and was collected at the Massachusetts Institute of Technology Laboratory for Computer Science (MIT-LCS), SRI International, and Texas Instruments (TI) in late 1991. A "Dry Run" test was implemented on a set of corpora similar to this collection at the end of 1991 and preliminary results were reported at the February 1992 (D)ARPA Speech and Natural Language Workshop. An official CSR test was then conducted in November 1992 on this test material and results were reported at the January 1993 (D)ARPA Spoken Language Systems Technology Workshop and at the March 1993 ARPA Human Language Technology Workshop.

The test material on these discs supports read 5,000-word and 20,000-word WSJ vocabulary CSR tests as well as tests using spontaneous dictation. In addition, each set of test material is partitioned into utterances with and without verbal punctuation. And, each utterance was recorded with two microphones, a Sennheiser close-talking microphone (channel 1) and a secondary microphone of varying type (channel 2).

To minimize the storage requirements for the corpora, the waveforms have been compressed using the SPHERE-embedded "Shorten" lossless compression algorithm which was developed at Cambridge University. The use of "Shorten" has approximately halved the storage requirements for the corpora. The NIST SPeech HEader REsources (SPHERE) software with embedded Shorten compression is included in the top-level directory of disc 11-13.1 in the "sphere" directory and can be used to decompress and manipulate the waveform files.

Disc 1 in the set (NIST speech disc 11-13.1) contains documentation for the corpora on the other two discs as well as the MIT Lincoln Laboratory WSJ '87-89 language models, a collation of the speech waveform file headers and a program to search them, and indices for each suggested test. The NIST speech recognition scoring package and SPHERE toolkit have been included as well in the top-level directory of Disc 11-13.1.

General information files named, "readme.doc", have been included in most high-level directories on Disc 11-13.1 and describe the contents of the directories.

The collection and publication of the test corpora and implementation of the November '92 ARPA CSR Benchmark Tests have been sponsored by the Advanced Research Projects Agency Software and Intelligent Systems Technology Office (ARPA-SISTO) and the Linguistic Data Consortium (LDC). The corpus was designed by the ARPA CCCC. MIT Lincoln Laboratory developed the text selection tools and the WSJ '87-89 language models. The corpus was collected at MIT-LCS, SRI, and TI and produced on CD-ROM by the National Institute of Standards and technology (NIST) and the November '92 ARPA CSR Benchmark Tests were administered by NIST.

2.0 November 1992 CSR Test Specifications

The following is a copy of the final specifications for the November 1992 ARPA CSR benchmark tests as posted in an email message from Francis Kubala, the CCCC chair, on September 10, 1992. Information not pertainent to the data on these discs has been removed.






The majority of the tests to be performed by a number of sites will fall in this area. Common test sets and, in some cases, common problem definitions are strongly encouraged. Below are some of the more salient conditions, with the first three being strongly recommended by the SLS Coordinating Committee:

3.0 CD-ROM Organization

The test corpora for the November 1992 ARPA CSR Benchmark Tests is contained on 2 CD-ROMs (Disc 11-14.1 and 11-15.1). The documentation for the tests is contained on Disc 11-13.1.

The test material on Discs 11-14.1 and 11-15.1 is composed of read CSR Wall Street Journal Pilot (WSJ0) data collected at MIT, SRI, and TI and spontaneous/read dictation data collected at SRI. The data is formatted according to the standard CSR conventions and uses the same file and directory formats as the WSJ0 training data on Discs 11-1.1 - 11-12.1. However, the prompts and transcriptions are included along side the test waveforms on the two discs, unlike the WSJ0 training data in which the texts are collated on a single disc separate from the waveforms.

In addition to the online documentation for the tests, Disc 11-13.1 contains software packages useful in processing the speech corpora and tabulating speech recognition scores. The top-level directory of Disc 11-13.1 contains the following major subdirectories:

         hgrep/ Utility to search a collated SPHERE header contents file.
          wsj0/ Test corpora documentation.
         score/ NIST speech recognition scoring software.  Includes 
                dynamic string-alignment scoring code and statistical 
                significance tests.
        sphere/ NIST SPeech HEader REsources toolkit.  Provides command-
                line and programmer interface to NIST-headered speech 
                waveform files.  Also provides for automatic decompression
                of Shorten-compressed WSJ1 waveform files.
General information files named "readme.doc" have been included in each of the high-level directories and throughout the documentation directory ("wsj0/doc") on Disc 11-13.1 and describe the contents of the directories.

Three text files are included in the root directory of each of the 2 discs which contain descriptors for the contents of the discs. The file, "<DISC-ID>.dir" contains a list of all directories and files on the disc. The file, "discinfo.txt" and "<DISC-ID>.txt" both contain a high-level description of the corpora on the corpora on the disc. The static filename, "discinfo.txt" is used across all discs; and a variable filename determined by the disc ID are unique for each disc - this allows flexibility in using the information.

The following is an example of the contents of one of these sets of files (filenames - discinfo.txt and 11_14_1.txt):

disc_id: 11_14_1
data_types: si_et_05:8:1302, si_et_20:8:1312, si_et_ad:8:640, si_et_jd:8:1280
channels: 1,2
The first field, "disc_id", indentifies the disc number. The second field, "data_types", contains entries for subcorpora (directories) separated by commas with each subfield containing an entry identifying the subcorpora, number of speakers, and number of waveforms. The third field, "channel_ids", contains a comma-separated list of the channels contained on the disc. This field has a value of "1" (Sennheiser) or "2" (Other mic.) for this corpora.

3.1 Directory Structure

The following depicts the directory structure of the corpora on Discs 11-14.1 and 11-15.1:

top level: wsj0/     (Phase 1 (pilot) corpus)

2nd level: 
  /       si_et_05/  (Speaker-Independent, 5K Vocabulary, Read WSJ test data)
Disc      si_et_20/  (Speaker-Independent, 20K Vocabulary, Read WSJ test data)
11-14.1   si_et_ad/  (Speaker-Independent, Adaptation Utterances)
  \       si_et_jd/  (Speaker-Independent, Spontaneous Journalist Dictation
                      test data)

  /       sd_et_05/  (Speaker-Dependent, 5K Vocabulary, Read WSJ test data)
Disc      sd_et_20/  (Speaker-Dependent, 20K Vocabulary, Read WSJ test data)
11-15.1   si_et_jr/  (Speaker-Independent, Read version of spontaneous from
  \                   "si_et_jd" on Disc 11-14.1).

speaker level:  <XXX>/  (speaker-ID, where XXX = "001" to "zzz", base 36)
data level:  <FILES>  (corpora files, see below for format and types)
3.2 Test Corpora Summary

Disc    Directory  Utts  Files Data set(s)
------  ---------  ----  ----- ----------------------------------------------
11-14.1	si_et_05/   330   660  SI 5K read NVP (8 spkrs X ~40 utts) Senn/2nd
                    321   642  SI 5K read VP (8 spkrs X ~40 utts) Senn/2nd
                    ---  ----
                    651  1302  (total in directory)

	si_et_20/   333   666  SI 20K read NVP (8 spkrs X ~40 utts) Senn/2nd
                    323   646  SI 20K read VP (8 spkrs X ~40 utts) Senn/2nd
                    ---  ----
                    656  1312  (total in directory)
	si_et_ad/   320   640  SI adapt. sents. (8 spkrs X ~40 utts) Senn/2nd
                               Note: this is not test data

        si_et_jd/   320   640  SI spon. NVP (8 spkrs X ~40 utts) Senn/2nd
                    320   640  SI spon. VP (8 spkrs X ~40 utts) Senn/2nd
                    ---  ----  
                    640  1280  (total in directory)


11-15.1	sd_et_05/   310   620  SD 20K read NVP (12 spkrs X ~25 utts) Senn/2nd
                    300   600  SD 20K read VP (12 spkrs X ~25 utts) Senn/2nd
                    ---  ----  
                    610  1220  (total in directory)

	sd_et_20/   312   624  SD 5K read NVP (12 spkrs X ~25 utts) Senn/2nd
                    308   616  SD 5K read VP (12 spkrs X ~25 utts) Senn/2nd
                    ---  ----
                    620  1240  (total in directory)

	si_et_jr/   320   640  SI read/spon. NVP (8 spkrs X ~40 utts) Senn/2nd
                    320   640  SI read/spon. VP (8 spkrs X ~40 utts) Senn/2nd
                    ---  ----  
                    640  1280  (total in directory)
Note: The corpora for the November 1992 CSR tests was evenly split between including and not including verbal punctuation. "VP" indicates that the data contains verbal punctuation, "NVP" indicates that the data does not contain verbal punctuation. The verbal punctuation mode is indicated in the "speech type" code in the filenames (see below).

3.3 Filenaming Formats

The filenames and filetypes follow standard CSR WSJ conventions. Data types are differentiated by unique filename extensions. All files associated with the same utterance have the same basename. All filenames are unique across all WSJ corpora. Speech waveform (.wv[1-2]) files are utterance-level files and prompt (.ptx) and transcription (.dot and .lsn) files are session-level files and, therefore, contain texts for multiple waveform files. The filename format is as follows:





          SSS ::= 001 | ... | zzz (base-36 speaker ID)
          T ::= (speech type code)
                    c (Common read no verbal punctuation) |
                    s (Spontaneous no/unspecified verbal punctuation) |
                    a (Adaptation read) |
                    r (Read version of spontaneous no verbal puncutation) |
                    o (cOmmon read with verbal puncuation) |
                    p (sPontaneous with verbal punctuation) |
                    e (rEad version of spontaneous with verbal punctuation )

          EE ::= 01 | ... | zz (base-36 session ID)
          UU ::= 01 | ... | zz (base-36 within-session sequential speaker
                                utterance code - always "00" for .ptx, .dot 
                                and .lsn session-level files)

          XXX ::= (data type)

               .wv1 (channel 1 - Sennheiser waveform)
               .wv2 (channel 2 - Other mic waveform)

               .ptx (prompting text for read material)
               .dot (detailed orthographic transcription)
               .lsn (Lexical SNOR transcription derived from .dot)
3.4 Data Types

3.4.1 Waveforms (.wv?)

The waveforms are SPHERE-headered, digitized, and compressed using the lossless Cambridge University "Shorten" algorithm under SPHERE. Version 2.1 of SPHERE has been included in this disc which will permit the waveform files to be decompressed automatically as they are accessed. See the files under the "/sphere" directory on Disc 11-13.1.

The filename extension for the waveforms contains the characters, "wv", followed by a 1-character code to identify the channel. The headers contain the following fields/types:

Field                    Type     Description - Probable defaults marked in ()
-----------------------  -------  ---------------------------------------------
microphone 		 string   microphone description ("Sennheiser HMD410",
                                  "Crown PCC160", etc.)  
recording_site           string   recording site ("MIT","SRI","TI")
database_id              string   database (corpus) identifier ("wsj0")
database_version         string   database (corpus) revision ("1.0")
recording_environment    string   text description of recording environment
speaker_session_number   string   2-char. base-36 session ID from filename
session_utterance_number string   2-char. base-36 utterance number within 
                                  session from the filename
prompt_id                string   WSJ source sentence text ID - see .ptx
                                  description below for format (only in read 
utterance_id             string   utterance ID from filename of the form
                                  SSSTEEUU as described in the filename
                                  section above.
speaking_mode            string   speaking mode ("spontaneous","read-common",
                                  "read-adaptation", etc.)  
speaker_id               string   3-char. speaker ID from filename
sample_count             integer  number of samples in waveform
sample_min               integer  minimum sample value in waveform
sample_max               integer  maximum sample value in waveform
sample_checksum          integer  checksum obtained by the addition of all
                                  (uncompressed) samples into an unsigned 
                                  16-bit (short) and discarding overflow.  
recording_date           string   beginning of recording date stamp of the
                                  form DD-MMM-YYYY.  
recording_time           string   beginning of recording time stamp of the
                                  form HH:MM:SS.HH.  
channel_count            integer  number of channels in waveform ("1")
sample_rate              integer  waveform sampling rate ("16000")
sample_n_bytes           integer  number of bytes per sample ("2")
sample_byte_format       string   byte order (MSB/LSB -> "10", LSB/MSB -> "01")
sample_sig_bits          integer  number of significant bits in each sample
sample_coding            string   waveform encoding ("pcm,embedded-shorten-v1.09")

3.4.2 Detailed Orthographic Transcriptions (.dot)

A detailed orthorgraphic transcription (.dot) containing lexical and non-lexical elements has been generated for each utterance. The specifications for the format of the detailed orthographic transcriptions are located in the file, "dot_spec.doc", under the "/wsj0/doc" directory on Disc 11-13.1.

The transcriptions for all utterances in a session are concatenated into a single file of the form, "<SSS><T><EE>" and each transcription includes a corresponding utterance-ID code. The format for a single utterance transcription entry in this table is as follows:

Speculation in Tokyo was that the yen could rise because of the realignment (4oc0201)

(new-line added for readability)

There is one ".dot" file for each speaker-session.

3.4.3 Lexical SNOR Transcriptions (.lsn)

The lexical Standard Normal Orthographic Representation (lexical SNOR) (.lsn) transcriptions are word-level transcriptions derived from the ".dot" transcriptions with capitalization, non-speech markers, prosodic markings, fragments, and "\" character escapes filtered out.

The .lsn transcriptions are of the same form as the .dot transcriptions and will be identified by a ".lsn" filename extension.



(new-line added for readability)

There is one ".lsn" file for each speaker-session.

3.4.4 Prompting Texts (.ptx)

The prompting texts for all read Wall Street Journal utterances in a session including the utterances' utterance-IDs and prompt IDs are concatenated into a single file of the form, "<SSS><T><EE>00.ptx". The prompt ID is Doug Paul's Wall Street Journal sentence index. The format for this index is:

The format for a single prompting text entry in the .ptx file is as follows:
Speculation in Tokyo was that the yen could rise because of the realignment. (4oac0201 87.051.870113-0174.6.1)

(new-line added for readability)

The inclusion of both the utterance ID and prompt ID allows the utterance to be mapped back to its source sentence text and surrounding paragraph.

There is one .ptx file for each read speaker-session.

3.5 Indices

Index files have been built for each of the suggested test sets. The files are located in the "wsj0/doc/indices/test" directory on Disc 11-13.1 and are named so as to clearly indicate the tests they pertain to. Index files for baseline training conditions have also been included in the directory, "wsj0/doc/indices/train".

Each index file contains a header which describes its contents. Header lines are preceded by ";;". Each line following the header indicates the disc, path, and waveform file for an utterance in the test set (e.g., "11_14_1:wsj0/si_et_05/440/440c0201"). Note that auxiliary files such as adaptation utterances which are not part of the test set are not included in the indices.

The .wv[1-2] extension has not been included in the test indices so that the indices can be used for testing on either channel/microphone. However, unlike the test corpora, since the training corpora for the different channels/microphones exists on different CD-ROMs, an index has been built for each channel for each training condition.

4.0 Baseline Training and Language Model Data

Some of the tests in the CCCC test specifications in Section 2.0 call for the use of "standard" baseline training sets. These training sets are drawn from the CSR WSJ0 pilot corpus training data on NIST speech discs 11-1.1 - 11-12.1. Indices have been developed for the following baseline training sets and are located in the subdirectories under the directory, "wsj1/doc/indices/train", on Disc 11-13.1:

wsj0/train: Indices for the WSJ0 (~7,200-utterance) Sennheiser training sets

tr_l_wv1.ndx	WSJ0 SD/SI-long term training, Sennheiser mic
tr_l_wv2.ndx	WSJ0 SD/SI-long term training, Secondary mic
tr_s_wv1.ndx	WSJ0 SI-short term training, Sennheiser mic
tr_s_wv2.ndx	WSJ0 SI-short term training, Secondary mic
tr_v_wv1.ndx    WSJ0 Longitudinal-SD/SI-very-long training, Sennheiser mic
tr_v_wv2.ndx    WSJ0 Longitudinal-SD/SI-very-long training, Secondary mic
Some of the tests in the CCCC test specifications in Section 2.0 call for the use of "standard" baseline bigram or trigram language models. The baseline language models were developed by MIT-Lincoln Laboratories and are included in the directory, "wsj0/doc/lng_modl/base_lm", on Disc 11-13.1. The baseline language models are as follows:
	bcb05cnp.z  5K closed NVP bigram LM
        bcb05cvp.z  5K closed VP bigram LM
	bcb05onp.z  5K open NVP bigram LM
        bcb05ovp.z  5K open VP bigram LM
	bcb20cnp.z  20K closed NVP bigram LM
        bcb20cvp.z  20K closed VP bigram LM
	bcb20onp.z  20K open NVP bigram LM
        bcb20ovp.z  20K open VP bigram LM
	tb05cnp.z   5K closed NVP trigram LM
	tb20onp.z   20K open NVP trigram LM
To conserve disc space, the language model files have been compressed using the standard UNIX "compress" utility and must be decompressed to be used.

5.0 Test Scoring

This section describes the process used by NIST in scoring the November 1992 WSJ/CSR Benchmark Tests. The information in this section can also be used by those who wish to duplicate the scoring methodology used.

For a complete description of the NIST scoring package and it's use, see the file, "score/doc/score.rdm", on Disc 11-13.1.

5.1 Preparation of Hypothesized Transcripts

The system-generated hypothesized transcripts were formatted by the test sites according to the Lexical SNOR, (LSN), format used by the scoring package. See Section 3.4.3 for a description of the LSN format.

5.3 Scoring Results

In the November 1992 tests, the hypothesized transcriptions were scored using the standard NIST scoring package. The scoring package has been included on Disc 11-13.1 in the top-level directory, "score". The directory contains a "readme.doc" file with compilation and installation instructions.

In order to score a hypothesis transcription against a reference transcription using the NIST scoring software, an "alignment" file must be created. The alignment file contains pairs of hypothesis and reference strings which have been aligned using a DP string alignment procedure. The format for using the "align" utility is:

	<INSTALL_DIR>/bin/align -cfg <CONFIG_FILE> -hyp <HYP_FILE>
				-outfile <ALIGNMENTS>

		where: 	INSTALL_DIR is the pathname to the compiled "score"

			CONFIG_FILE contains a list of arguments to the
			scoring software including the filespec for the
			reference transcription file, lexicon file, and
			command	line switches.

			HYP_FILE contains the LSN-formatted hypothesis

			ALIGNMENTS contains the output alignments.
Example where "score" is located under the current directory:
	./score/bin/align -cfg ./score/lib/wsj.cfg -hyp site.hyp
			  -outfile site.ali
The actual tabulation of the scores is generated using the above alignment file as input. The "score" program with the "-ovrall" switch creates a by-speaker summary table of the error rates, (insertions, deletions, etc.). The format for using the "score" utility is:
	<INSTALL_DIR>/bin/score -cfg <CONFIG_FILE> -align <ALIGNMENTS>

		where: 	INSTALL_DIR is the pathname to the compiled "score"

			CONFIG_FILE contains a list of arguments to the
			scoring software including the filespec for the
			reference transcription file, lexicon file, and
			command	line switches.

			ALIGNMENTS contains the output alignments.

			REPORT_OPTIONS are switches to generate different
Example using the example output from "align" above, where "score" is located under the current directory:
	./score/bin/score -cfg ./score/lib/wsj.cfg -align site.ali -ovrall
Note: The "score" program can produce several other reports of greater or lesser detail using other command line switches. See the manual page for "score" for description of it's other uses.

5.4 System Descriptions

As part of the November 1992 CSR Tests, each test site was required to generate a description of the systems used in each test according to a prescribed format. If you intend to publish results using this test material, you should provide such a system description along with your results. The format for the system description is as follows: