Work in progress.

November 1993 ARPA Continuous Speech Recognition

Hub and Spoke Benchmark Tests Corpora and Instructions

NIST Speech Discs 13-32.1, 13-33.1

Public Release, May, 1994

* * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * *
*                                                                         *
* If you intend to implement the protocols for the ARPA November '93 CSR  *
* Benchmark Tests, please read this document in its entirety before       *
* proceeding and do not examine the included transcriptions, calibration  *
* recordings, adaptation recordings, or documentation unless such         *
* examination is specifically permitted in the guidelines for the test(s) *
* being run.  Index files have been included which specify the exact data *
* to be used for each test.  To avoid testing on erroneous data, please   *
* refer to these files when running the tests.                            *
*                                                                         *
* * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * *

Contents

  1. Introduction
  2. Hub and Spoke Test Specifications
  3. CD-ROM Organization
    1. Directory Structure
    2. Filenaming Formats
    3. Data Types
      1. Waveforms (.wv?)
      2. Detailed Orthographic Transcriptions (.dot)
      3. Lexical SNOR Transcriptions (.lsn)
      4. Prompting Texts (.ptx)
      5. SPeech Quality Assurance Reports (.spq)
      3.4 Online Documentation
    4. 3.5 Indices
  4. Hub and Spoke Test Data Specifications
  5. November 1993 CSR Test Overview
    1. Test Data Disribution
    2. Test protocols
    3. Initial Scoring
    4. Adjudication
    5. Final Scoring
    6. LIMSI and CU-HTK November 1993 Sample Output and Scores
  6. Baseline Training and Language Model Data
  7. Test Scoring
    1. Preparation of Hypothesized Transcripts
    2. Transcription Prefiltering
    3. Scoring Results
    4. System Descriptions

1.0 Introduction

This 2-disc set contains the test material for the November 1993 ARPA Continuous Speech Recognition (CSR) Hub and Spoke Benchmark Tests. This material is to be used in conjunction with the WSJ1 training and development test material (NIST speech discs 13-1.1 - 13-31.1, and 13-34.1) which is available separately. These two discs contain the waveforms, prompts, transcriptions, software, and documentation required to implement the tests.

In early 1993, the ARPA CSR Corpus Coordinating Committee (CCCC) designed a "Hub and Spoke" test paradigm. The resulting suite of tests contained 2 general "hub" tests to assess the performance of large (5K vocabulary) and very large (64K vocabulary) speaker-independent continuous speech recognition systems and 9 "spoke" tests to assess the performance of systems designed to address specific areas of research in continuous speech recognition. Contrastive tests were also designed for each Hub and Spoke test. There are 25 contrastive tests for the 11 Hub and Spoke tests.

Speech data to support the similarly-designed development test and evaluation test suites was collected in mid-1993. The Hub and Spoke evaluation test suite speech corpora consists of approximately 7,500 waveforms (~11 hours of speech). To minimize the storage requirements for the corpora, the waveforms have been compressed using the SPHERE-embedded "Shorten" lossless compression algorithm which was developed at Cambridge University. The use of "Shorten" has approximately halved the storage requirements for the corpora. The NIST SPeech HEader REsources (SPHERE) software with embedded Shorten compression is included in the top-level directory of disc 13-32.1 in the "sphere" directory and can be used to decompress and manipulate the waveform files.

Disc 1 in the set (NIST speech disc 13-32.1) contains all of the available online documentation for the corpora on the two discs as well as the MIT Lincoln Laboratory WSJ '87-89 language models, a collation of the speech waveform file headers and a program to search them, and indices for the WSJ1 training corpora and for each Hub and Spoke test set. The NIST speech recognition scoring package and SPHERE toolkit have been included as well in the top-level directory of Disc 13-32.1.

General information files named, "readme.doc", have been included in most high-level directories on Disc 13-32.1 and describe the contents of the directories.

The collection and publication of the test corpora and implementation of the November '93 ARPA CSR Benchmark Tests have been sponsored by the Advanced Research Projects Agency Software and Intelligent Systems Technology Office (ARPA-SISTO) and the Linguistic Data Consortium (LDC). The Hub and Spoke Test paradigm and the form of the associated corpora was designed by the ARPA Continuous Speech Recognition Corpus Coordinating Committee (CCCC). MIT Lincoln Laboratory developed the text selection tools and the WSJ '87-89 language models. The corpora were collected at SRI International and produced on CD-ROM by the National Institute of Standards and technology (NIST) and the November '93 ARPA CSR Benchmark Tests were administered by NIST.

2.0 Hub and Spoke Test Specifications

The ARPA CSR Corpus Coordinating Committee (CCCC) designed a "Hub and Spoke" test paradigm which consists of general "hub" core tests and optional "spoke" tests to probe specific areas of research interest and/or difficulty.

Two "hub" test sets were designed and speech data was collected for them:

  1. 64,000-word lexicon WSJ read baseline (Sennheiser mic)
  2. 5,000-word lexicon WSJ read baseline (Sennheiser mic)
Nine "spoke" test sets were designed and speech data was collected for them:
  1. Language model adaptation (Sennheiser mic)
  2. Domain-independence (Sennheiser mic)
  3. SI Recognition Outliers - non-native speakers (Sennheiser mic)
  4. Incremental speaker adaptation (Sennheiser mic)
  5. Microphone independence (Sennheiser + Second mic of unknown varying type)
  6. Known alternate microphone (Sennheiser + Audio Technica/telephone)
  7. Noisy environments (Sennheiser + Audio Technica/telephone)
  8. Calibrated noise sources (Sennheiser + Audio Technica)
  9. Spontaneous WSJ-style dictation (Sennheiser mic)
Each Hub and Spoke test contains a primary test and one or more contrastive tests. A set of test corpora exists for each primary and contrastive test. Indices for each test set have been created to indicate the location of the test data on disc. The indices are located in the "wsj1/doc/indices" directory on Disc 13-32.1.

Prior to the design of the November '93 Hub and Spoke Evaluation Test suite, the CCCC also designed a comparable Hub and Spoke Development Test suite (which is available with the training portion of WSJ1.) Note, however, that some tests were modified in creating the evaluation test suite. Therefore, there is not a one-to-one match in all tests between the development test suite and the evaluation test suite. In some tests, parameters were changed, and in some cases, contrastive tests were dropped or added.

The following is a copy of the CCCC Hub and Spoke Test Specifications for the November '93 ARPA CSR benchmark tests. It has been edited to remove details pertaining only to the November 1993 evaluation and has been annotated to indicate the appropriate test data index file for each test.


Specification for the 1993 CSR Evaluation -- Hub and Spoke Paradigm.

Rev 14: 10-21-93

MOTIVATION

This evaluation proposal attempts to accomodate research over a broad variety of important problems in CSR, to maintain a clear program-wide focus, and to extract as much information as possible from the results. It consists of a compact 'Hub' test, on which every participating site evaluates, and a variety of problem-specific 'Spoke' tests, which are run at the discretion of the sites ("different spokes for different folks").

Speaker Sets

Speakers are balanced for gender in each dataset below. In total, there will be 30 different speakers used for this evaluation -- 10 for S3. (SI Recognition Outliers), 10 for S9 (Spontaneous WSJ Dictation), and 10 for all the rest of the test and rapid enrollment data. These speaker sets are labeled, A (test), B (outliers), and C (spontaneous) below. An additional set, labeled D speakers are from the devtest dataset and are to be used for microphone adaptation in Spokes, S6, S7, and S8.

Terminology

A 'session' implies that the speaker, microphone, and acoustic environment remain constant for a group of utterances.

A 'static SI' test does not offer session boundaries or utterance order as side information to the system, and therefore implies that the speaker, microphone, and environment may change from utterance to utterance. Functionally, it implies that each utterance must be recognized independently of all others, yielding the same answers for any utterance order (or the same expectation, in the case of a non-deterministic recognizer).

'Unsupervised incremental adaptation' means that the system is allowed to use any information it can extract from test data that it has already recognized. It implies that session boundaries and utterance order are known to the system as side information.

'Supervised incremental adaptation' means that the correct transcription is made available to the system after each utterance has been recognized.

'Transcription-mode adaptation' means that the speech from an entire test session is available to the system prior to recognition. It is a non-causal and unsupervised mode.

Training Data

Unless otherwise noted, there are no restrictions on acoustic or language model training data.

Privately acquired data may be used as long as it can be made available to the LDC in a form suitable for publication as a LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if ARPA elects to have it published. Delivery of the data to LDC may be done after the evaluation, as long as it is done in a timely fashion.

Use of Calibration Waveforms

Use of all calibration waveforms from the devtest is allowed. However, the calibration data for the evaluation test cannot be used.

Default Side Information

These are the defaults for side information given to the system: Speaking style is always known. General environment conditions (quiet or noisy) are always known. Microphone identity is known unless noted otherwise.

Speaker gender is always unknown. Specific environment conditions (room code) are always unknown. Session boundaries and utterance order are unknown unless noted otherwise.

This implies that static SI conditions are the default.

Finally, unless an exception is noted, supervised adaptation and transcription-mode adaptation are not allowed, except as optional contrasts in addition to the required runs.

Legend

Each test below consists of a primary condition and a variable number of required and optional contrast conditions. P0 indicates the primary test, CX indicates a contrastive one (X = 1,2,3...). (req) indicates a required condition, opt) an optional one.
TEST SIZE indicates the total number of utterances that need to be run to complete the required portion of a test.
SIDE INFO, where present, indicates only the changes to the default side information conditions given above.
METRICS, where present, indicates that a measure other than just the standard overall word error rate is to be used.
SIG TESTS indicates the pair-wise comparisons to be used in the standard set of statistical significance tests performed by NIST.

Statistical Significance Tests

The standard set of significance tests can be performed across all systems for the controlled HX-C1 tests. This is the only formal comparison possible across all systems. The same tests can be made where appropriate, within each system, comparing each CX contrast test to the P0 primary test.

System Descriptions

Sites should publish a system description for the results published for each system run on any test.

Multiple Systems on a Single Test

In order to discourage the running of several systems on a single test to improve one's chances of scoring well, sites must designate one of them as the primary one if more than one system is run on a single test. The designation is to be made before looking at any results. Results must be reported for all systems run on any test.

Use of Prior Evaluation Data

Evaluation data from the past will not be used for any training. parameter optimization, or frequent development testing. Infrequent sampling of past evaluation test sets, to determine whether a result on development test data will generalize, is considered an acceptable use.

THE HUB

All sites are required to run one Hub test. Sites that can't handle the size of the H1 test may run on H2.


H1. Read WSJ Baseline.
----------------------
GOAL: improve basic SI performance on clean data.
DATA: 10 speakers * 20 utts = 200 utts
 64K-word read WSJ data, Sennheiser mic.
CONDITIONS: 
 P0: (opt) any grammar or acoustic training, session boundaries and utterance
           order given as side information. (INDEX h1_p0.ndx)
 C1: (req) Static SI test with standard 20K trigram open-vocab grammar and
           choice of either short-term or long-term speakers of both WSJ0 and
           WSJ1 (37.2K utts). (INDEX h1_c1.ndx)
 C2: (opt) Static SI test with standard 20K bigram open-vocab grammar and
           choice of either short-term or long-term speakers of both WSJ0 and
           WSJ1 (37.2K utts). (INDEX h1_c2.ndx)
SIDE INFO: session boundaries and utterance order are known for H1-P0 only.
SIG TESTS: P0:C1, C1:C2  and a pairwise comparison of all systems for H1-C1.
TEST SIZE: 200 utts


H2. 5K-Word Read WSJ Baseline
-----------------------------
GOAL: improve basic SI performance on clean data.
DATA: 10 speakers * 20 utts = 200 utts
 5K-word read WSJ data, Sennheiser mic.
CONDITIONS: 
 P0: (opt) any grammar or acoustic training, session boundaries and utterance
           order given as side information. (INDEX h2_p0.ndx)
 C1: (req) Static SI test with standard 5K bigram closed-vocab grammar and
           choice of either short-term or long-term speakers from WSJ0
           (7.2K utts). (INDEX h2_c1.ndx)
SIDE INFO: session boundaries and utterance order are known for H2-P0 only.
SIG TESTS: P0:C1 and a pairwise comparison of all systems for H2-C1.
TEST SIZE: 200 utts

THE SPOKES

For the 5K vocab test sets (Spokes S3-S8) it is assumed, but not required, that a 5K closed LM will be used. A de facto standard 5K closed bigram and trigram have been contributed by Lincoln Labs of MIT for use by any participating site.

Spokes S1-S4 support problems in adaptation.

S1. Language Model Adaptation.
------------------------------
GOAL: evaluate an incremental supervised LM adaptation algorithm on a problem
 of sublanguage adaptation.
DATA: 4 A spkrs * 1-5 articles (~100 utts) = 400 utts
 Read unfiltered WSJ data from 1990 publications in TIPSTER corpus,
 Sennheiser mic, minimum of 20 sentences per article.
CONDITIONS:
 P0: (req) incremental supervised LM adaptation,
           closed vocabulary, any LM trained from 1987-89 WSJ0 texts 
           (INDEX s1_p0.ndx)
 C1: (req) S1-P0 system with LM adaptation disabled (INDEX s1_c1.ndx)
 C2: (opt) incremental unsupervised LM adaptation (INDEX s1_c2.ndx)
SIDE INFO: session boundaries and utterance order are known
SIG TESTS: P0:C1,  optionally P0:C2, C2:C1
 NOTE: the sign test and the Wilcoxon signed-rank test will not be done
TEST SIZE: 800 utts
METRICS: Partition the data into 4 equal parts distinguished by length
 of context (e.g. 0-5 sents, 6-10 sents, 11-20 sents, 20+ sents).
 For each part, report the standard measure and perplexity.


S2O. Domain-Independence.
------------------------
GOAL: evaluate techniques for dealing with a newspaper domain different from 
 training.
DATA: 10 A spkrs * 1 article (~20 utts) = 200 utts 
 Sennheiser mic data from San Jose Mercury, minimum of 20 sentences per article
CONDITIONS:
 P0: (req) any grammar or acoustic training BUT no training whatsoever from
           the Mercury, nor any use of the knowledge of the paper's identity.
           (INDEX s2_p0.ndx)
 C1: (req) S2-P0 system on H1 data (INDEX s2_c1.ndx)
 C2: (req) H1-C1 system on S2 data (INDEX s2_c2.ndx)
SIDE INFO: session boundaries and utterance order are known
SIG TESTS: H1-P0:C1, P0:C2
TEST SIZE: 600 utts


S3. SI Recognition Outliers.
----------------------------
GOAL: evaluate a rapid enrollment speaker adaptation algorithm on difficult
 speakers.
DATA: 10 B spkrs * 40 utts = 400 utts (test)
      10 B spkrs * 40 utts = 400 utts (rapid enrollment from S3 speakers,
                                       used for S3-P0)
      10 A spkrs * 40 utts = 400 utts (rapid enrollment from Hub speakers,
                                       used for S3-C2)
 5K-word read WSJ data, Sennheiser mic, collected from non-native 
 speakers of American English (British, European, Asian dialects, etc.).
CONDITIONS: 
 P0: (req) rapid enrollment speaker adaptation (INDEX s3_p0.ndx)
 C1: (req) S3-P0 system with speaker adaptation disabled (INDEX s3_c1.ndx)
 C2: (req) S3-P0 system on H2 data (INDEX s3_c2.ndx)
 C3: (opt) incremental unsupervised adaptation (INDEX s3_c3.ndx)
SIDE INFO: speaker identity is known for P0, C1, and C2,
 session boundaries and utterance order is known for C3.
SIG TESTS: P0:C1, H2-P0:C2  optionally P0:C3, C1:C3
TEST SIZE: 1000 utts


S4. Incremental Speaker Adaptation.
-----------------------------------
GOAL: evaluate an incremental speaker adaptation algorithm.
DATA: 4 A spkrs * 100 utts = 400 utts (test)
      4 A spkrs * 40  utts = 160 utts (rapid enrollment from A speakers in S3)
 5K-word read WSJ data, Sennheiser mic.
CONDITIONS:
 P0: (req) incremental unsupervised speaker adaptation (INDEX s4_p0.ndx)
 C1: (req) S4-P0 system with speaker adaptation disabled (INDEX s4_c1.ndx)
 C2: (opt) incremental supervised adaptation (INDEX s4_c2.ndx)
 C3: (opt) rapid enrollment speaker adaptation (INDEX s4_c3.ndx)
SIDE INFO: for all conditions: session boundaries and utterance order are 
 known; additional for C2: correct transcription is known after the fact.
SIG TESTS: P0:C1  optionally P0:C2, P0:C3
 NOTE: the sign test and the Wilcoxon signed-rank test will not be done
TEST SIZE: 800 utts
METRICS: standard measure on each quarter of the data in sequence, 
 plus the ratio: total_runtime(S4-P0)/total_runtime(S4-C1).

Spokes S5-S8 support problems in channel and noise compensation.


S5. Microphone-Independence.
----------------------------
GOAL: evaluate an unsupervised channel compensation algorithm.
DATA: 10 A spkrs * 20 utts = 200 utts (2 channels, same speech as H2)
 5K-word read WSJ data, 10 different mics not in training or development test. 
 NOTE: No speech from the test microphones can be used.
CONDITIONS:
 P0: (req) unsupervised channel compensation enabled on wv2 data 
           (INDEX s5_p0.ndx)
 C1: (req) S5-P0 system with compensation disabled on wv2 data 
           (INDEX s5_c1.ndx)
 C2: (req) S5-P0 system on Sennheiser (wv1) data (INDEX s5_c2.ndx)
 C3: (opt) S5-C1 system on Sennheiser (wv1) data (INDEX s5_c3.ndx)
SIDE INFO: Microphone identities are not known
SIG TESTS: P0:C1, P0:C2, C1:C2,  optionally P0:C3, C1:C3, C2:C3
TEST SIZE: 600 utts


S6. Known Alternate Microphone.
-------------------------------
GOAL: evaluate a known microphone adaptation algorithm.
DATA: 10 A spkrs * 20 utts * 2 mics = 400 utts (test, 2 channels)
      10 D spkrs * 40 utts * 2 mics = 800 utts (mic-adaptation from devtest,
                                                2 channels)
 5K-word read WSJ data, from an Audio-Technica directional stand-mounted mic
 and telephone handset over external lines, plus stereo mic adaptation data.
 NOTE: the 800 stereo microphone adaptation utterances will come from the 
 devtest and are the only data from the target mics that are allowed.
CONDITIONS:
 P0: (req) supervised mic adaptation enabled on wv2 data (INDEX s6_p0.ndx)
 C1: (req) S6-P0 system with mic adaptation disabled on wv2 data 
           (INDEX s6_c1.ndx)
 C2: (req) S6-C1 system on Sennheiser (wv1) data (INDEX s6_c2.ndx)
SIDE INFO: Microphone identites are known.  Use of the stereo mic-adaptation
 data will be allowed for the S6-P0 condition only
SIG TESTS: P0:C1, P0:C2, C1:C2
TEST SIZE: 1200 utts
METRICS: Separate error rates will be reported for each mic.


S7. Noisy Environments.
-----------------------
GOAL: evaluate a noise compensation algorithm with known alternate mic.
DATA: 10 A spkrs * 10 utts * 2 mics * 2 envs = 400 utts (test, 2 channels)
 5K-word read WSJ data, same 2 secondary mics as in S6, collected in two
 environments with a background A-weighted noise level of about 55-68 dB.
 NOTE: the 800 stereo microphone adaptation utterances will come from the 
 devtest and are the only data from the target mics that is allowed.
 The only data available for adaptation to the environment will be from 
 the S7 Spoke of the devtest data.
CONDITIONS:
 P0: (req) noise compensation enabled on wv2 data (INDEX s7_p0.ndx)
 C1: (req) S7-P0 system with compensation disabled on wv2 data 
     (INDEX s7_c1.ndx)
 C2: (req) S7-P0 system on Sennheiser (wv1) data (INDEX s7_c2.ndx)
SIDE INFO: Microphone identites are known.  Use of the stereo 
 environment-adaptation data will be allowed for the S7-P0 condition only
SIG TESTS: P0:C1, P0:C2, C1:C2
TEST SIZE: 1200 utts
METRICS: Separate error rates will be reported for each mic/environment pair.


S8. Calibrated Noise Sources.
-----------------------------
GOAL: evaluate a noise compensation algorithm with known alternate mic on 
 data corrupted with calibrated noise sources.
DATA: 10 A spkrs * 10 utts * 2 sources * 3 levels = 600 utts (test, 2 channels)
5K-word read WSJ data collected with competing recorded music or talk radio
 in the background at 0, 10, and 20 dB SNR, using the Audio-Technica
 directional stand-mounted mic from S6.
 NOTE: the 400 stereo microphone adaptation utterances will come from the 
 devtest and are the only data from the target mic that is allowed.
CONDITIONS:
 P0: (req) noise compensation enabled on wv2 data (INDEX s8_p0.ndx)
 C1: (req) S8-P0 system with compensation disabled on wv2 data 
     (INDEX s8_c2.ndx)
 C2: (req) S8-P0 system on Sennheiser (wv1) data (INDEX s8_c2.ndx)
 C3: (opt) S8-C1 system on Sennheiser (wv1) data (INDEX s8_c3.ndx)
SIDE INFO: 
SIG TESTS: P0:C1, P0:C2, C1:C2 and optionally P0:C3, C1:C3, C2:C3
TEST SIZE: 1800 utts
METRICS: Separate error rates will be reported for each source/level pair.


S9. Spontaneous WSJ Dictation.
------------------------------
GOAL: improve basic SI performance on spontaneous dictation-style speech.
DATA: 10 C speakers * 20 utts = 200 utts
 Spontaneous WSJ-like dictations (business news stories), Sennheiser mic.
CONDITIONS: 
 P0: (req) any grammar or acoustic training (INDEX s9_p0.ndx)
 C1: (req) S9-P0 system on H1 data (INDEX s9_c1.ndx)
 C2: (req) H1-C1 system on S9 data (INDEX s9_c2.ndx)
SIG TESTS: H1-P0:C1, P0:C2
TEST SIZE: 600 utts

3.0 CD-ROM Organization

The test corpora for the November 1993 ARPA CSR Hub and Spoke Evaluation Test suite is contained on 2 CD-ROMs, NIST speech discs 13-32.1 and 13-33.1. Disc 13-32.1 contains the documentation for the tests and the test corpora for the Spoke 9 tests. Disc 13-33.1 contains the test corpora for the Hub 1 and Hub 2 tests and for the Spoke 1 through Spoke 8 tests. The prompts and transcriptions are included along side the test waveforms on the two discs, unlike the WSJ1 training data in which the texts are collated on a single disc separate from the waveforms.

In addition to online documentation, Disc 13-32.1 contains software packages useful in processing the speech corpora and tabulating speech recognition scores. The top-level directory of Disc 13-32.1 contains the following major subdirectories:


         hgrep/ Utility to search a collated SPHERE header contents file.

	  wsj1/ Test corpora and documentation.

         score/ NIST speech recognition scoring software.  Includes 
	        dynamic string-alignment scoring code and statistical 
                significance tests.

	sphere/ NIST SPeech HEader REsources toolkit.  Provides command-
	        line and programmer interface to NIST-headered speech 
	        waveform files.  Also provides for automatic decompression
	        of Shorten-compressed WSJ1 waveform files.

      tranfilt/ Directory containing a UNIX shell script used to perform
		the post-adjudication transcription filter process.
General information files named "readme.doc" have been included in each of the high-level directories and throughout the documentation directory ("wsj1/doc") on Disc 13-32.1 and describe the contents of the directories.

Three text files are included in the root directory of each of the 2 discs which contain descriptors for the the contents of the discs. The file, ".dir" contains a list of all directories and files on the disc. The file, "discinfo.txt" and ".txt" both contain a high-level description of the corpora on the disc. The static filename, "discinfo.txt" is used across all discs; and a variable filename determined by the disc ID are unique for each disc - this allows flexibility in using the information.

The following is an example of the contents of one of these sets of files (filenames - discinfo.txt and 13_33_1.txt):

   discid: 13_33_1
   data_types: si_et_h1:10:223, si_et_h2:10:635, si_et_s1:4:428, \
   si_et_s2:10:214, si_et_s3:10:836, si_et_s4:4:418, si_et_s5:10:225, \
   si_et_s6:10:908, si_et_s7:10:984, si_et_s8:10:2020
   channel_ids: 1,2 
The first field, "disc_id", indentifies the disc number. The second field, "data_types", contains entries for subcorpora (directories) separated by commas with each subfield containing an entry identifying the subcorpora, number of speakers, and number of waveforms. The third field, "channel_ids", contains a comma-separated list of the channels contained on the disc. This field normally has a value of "1" (Sennheiser) or "2" (Other mic.) for this corpora.

3.1 Directory Structure

The following depicts the directory structure of the corpora on the two discs:

top level: wsj1/      (Phase 2 corpus)

2nd level: doc/       (online documentation [disc 13-32.1 only])

  /        si_et_h1/wsj64k/ (Hub 1 - 64K vocabulary test data)
  | 
  |        si_et_h2/wsj5k/  (Hub 2 - 5K vocabulary test data)
  |
  |        si_et_s1/wsj/    (Spoke 1 - language model adaptation WSJ test data)
  |
  |        si_et_s2/sjm/    (Spoke 2 - domain-indep. San Jose Mercury test
  |			     data)
  |
  |        si_et_s3/non_nat/(Spoke 3 - non-native speakers test data)
  |
  |        si_et_s4/inc_adp/(Spoke 4 - incremental speaker adaptation test
 Disc                        data)
13-33.1
  |        si_et_s5/mic_ind/(Spoke 5 - microphone independence test data)
  |
  |        si_et_s6/        (Spoke 6 - known alternate mic test data:)
  |                 at_te/  (Audio Technica mic)
  |                 th_te/  (telephone handset)
  |
  |        si_et_s7/        (Spoke 7 - noisy environments test data:) 
  |                 at_e1/  (Audio Technica mic, noise environment 1)
  |                 at_e2/  (Audio Technica mic, noise environment 2)
  |                 th_e1/  (telephone handset, noise environment 1)
  |                 th_e2/  (telephone handset, noise environment 2)
  |
  |        si_et_s8/        (Spoke 8 - calibrated noise sources:)
  |                 mu_0/   (competing music, 0 dB. SNR)
  |                 mu_10/  (competing music, 10 dB. SNR)
  |                 mu_20/  (competing music, 20 dB. SNR)
  |                 tr_0/   (competing talk radio, 0 dB. SNR)
  |                 tr_10/  (competing talk radio, 10 dB. SNR)
  \                 tr_20/  (competing talk radio, 20 dB. SNR)

 Disc
13-32.1->  si_et_s9/journ/  (Spoke 9 - spontaneous WSJ-style dictation:)
speaker level: <XXX>/ (speaker-ID, where XXX = "001" to "zzz", base 36)

data level: <FILES> (corpora files, see below for format and types)

3.2 Filenaming Formats

The filenames and filetypes follow standard CSR WSJ1 conventions. Data types are differentiated by unique filename extensions. All files associated with the same utterance have the same basename. All filenames are unique across all WSJ corpora. Speech waveform (.wav) files are utterance-level files and prompt (.ptx), transcription (.dot and .lsn), and SPQA (.spq) files are session-level files and, therefore, contain texts for multiple waveform files. The filename format is as follows:

<UTTERANCE-ID>.<XXX>

where,

     UTTERANCE-ID ::= <SSS><T><EE><UU>

     where,

          SSS ::= 001 | ... | zzz (base-36 speaker ID)
          T ::= (speech type code)
                    c (Common read) |
                    s (Spontaneous) |
                    a (Adaptation read) |
                    x (calibration recording)
                    
          EE ::= 01 | ... | zz (base-36 session ID)
          UU ::= 01 | ... | zz (base-36 within-session sequential speaker
                                utterance code - always "00" for .ptx, .dot 
                                and .lsn session-level files)

          XXX ::= (data type)

               .wv1 (channel 1 - Sennheiser waveform)
               .wv2 (channel 2 - Other mic waveform)

               .ptx (prompting text for read material)
               .dot (detailed orthographic transcription)
               .lsn (Lexical SNOR transcription derived from .dot)
               .spq (output from SPeech Quality Assurance software)

3.3 Data Types

3.3.1 Waveforms (.wv?)

The waveforms are SPHERE-headered, digitized, and compressed using the lossless Cambridge University "Shorten" algorithm under SPHERE. Version 2.1 of SPHERE has been included in this disc which will permit the waveform files to be decompressed automatically as they are accessed. See the files under the "/sphere" directory on Disc 13-32.1.

The filename extension for the waveforms contains the characters, "wv", followed by a 1-character code to identify the channel. The headers contain the following fields/types:

Field                    Type     Description - Probable defaults marked in ()
-----------------------  -------  ---------------------------------------------
microphone 		 string   microphone description ("Sennheiser HMD410",
                                  "Crown PCC160", etc.)  
recording_site           string   recording site ("SRI")
database_id              string   database (corpus) identifier ("wsj1")
database_version         string   database (corpus) revision ("1.0")
recording_environment    string   text description of recording environment
speaker_session_number   string   2-char. base-36 session ID from filename
session_utterance_number string   2-char. base-36 utterance number within 
                                  session from the filename
prompt_id                string   WSJ source sentence text ID - see .ptx
                                  description below for format (only in read 
                                  data).
utterance_id             string   utterance ID from filename of the form
                                  SSSTEEUU as described in the filename
                                  section above.
speaking_mode            string   speaking mode ("spontaneous","read-common",
                                  "read-adaptation", etc.)  
speaker_id               string   3-char. speaker ID from filename
sample_count             integer  number of samples in waveform
sample_min               integer  minimum sample value in waveform
sample_max               integer  maximum sample value in waveform
sample_checksum          integer  checksum obtained by the addition of all
                                  (uncompressed) samples into an unsigned 
                                  16-bit (short) and discarding overflow.  
recording_date           string   beginning of recording date stamp of the
                                  form DD-MMM-YYYY.  
recording_time           string   beginning of recording time stamp of the
                                  form HH:MM:SS.HH.  
channel_count            integer  number of channels in waveform ("1")
sample_rate              integer  waveform sampling rate ("16000")
sample_n_bytes           integer  number of bytes per sample ("2")
sample_byte_format       string   byte order (MSB/LSB -> "10", LSB/MSB -> "01")
sample_sig_bits          integer  number of significant bits in each sample
                                  ("16")
sample_coding            string   waveform encoding ("pcm,embedded-shorten-v1.09")
end_head

3.3.2 Detailed Orthographic Transcriptions (.dot)

A detailed orthorgraphic transcription (.dot) containing lexical and non-lexical elements has been generated for each utterance. The specifications for the format of the detailed orthographic transcriptions are located in the file, "dot_spec.doc", under the "/wsj1/doc" directory on Disc 13-32.1.

The transcriptions for all utterances in a session are concatenated into a single file of the form, "<SSS><T><EE>00.dot" and each transcription includes a corresponding utterance-ID code. The format for a single utterance transcription entry in this table is as follows:

        <TRANSCRIPTION-TEXT> (<UTTERANCE-ID>)<NEW-LINE>
example:
Speculation in Tokyo was that the yen could rise because of the realignment (4oc0201)

(new-line added for readability)

There is one ".dot" file for each speaker-session.

The .dot transcriptions for the test corpora were corrected during an adjudication process which followed the November 1993 tests. Only the corrected transcriptions are included on these discs.

3.3.3 Lexical SNOR Transcriptions (.lsn)

The lexical Standard Normal Orthographic Representation (lexical SNOR) (.lsn) transcriptions are word-level transcriptions derived from the ".dot" transcriptions with capitalization, non-speech markers, prosodic markings, fragments, and "\" character escapes filtered out.

The .lsn transcriptions are of the same form as the .dot transcriptions and will be identified by a ".lsn" filename extension.

example:

SPECULATION IN TOKYO WAS THAT THE YEN COULD RISE BECAUSE OF THE REALIGNMENT (4OAC0201)

(new-line added for readability)

There is one ".lsn" file for each speaker-session.

The .dot (and derivative .lsn) transcriptions for the test corpora were corrected during an adjudication process which followed the November 1993 tests. Only the corrected transcriptions are included on these discs.

3.3.4 Prompting Texts (.ptx)

The prompting texts for all read Wall Street Journal utterances in a session including the utterances' utterance-IDs and prompt IDs are concatenated into a single file of the form, "<SSS><T><EE>00.ptx". The prompt ID is Doug Paul's Wall Street Journal sentence index. The format for this index is:

     <YEAR>.<FILE-NUMBER>.<ARTICLE-NUMBER>.<PARAGRAPH-NUMBER>.<SENTENCE-NUMBER>
The format for a single prompting text entry in the .ptx file is as follows:
     <PROMPTING-TEXT> (<UTTERANCE-ID> <PROMPT-ID>)
example:
Speculation in Tokyo was that the yen could rise because of the realignment. (4oac0201 87.051.870113-0174.6.1)

(new-line added for readability)

The inclusion of both the utterance ID and prompt ID allows the utterance to be mapped back to its source sentence text and surrounding paragraph.

There is one .ptx file for each read speaker-session.

3.3.5 SPeech Quality Assurance Reports (.spq)

The data collectors at SRI screened all of the test data using the NIST SPeech Quality Assurance (SPQA) software. A SPQA report was generated for each speaker-session and is included in files of the form, "<SSS><T><EE>00.spq".

The SPQA software scans digitized speech waveforms for signal defects and anomalies. The version used, SPQA 2.2, scanned the test waveform files and determined the peak speech power, mean noise power, signal-to-noise ratio (SNR), and ratio of speech duration to total recording duration. The software also checked for DC bias, clipping, and 60-Hz EM-interference hum.

3.4 Online Documentation

In addition to prompts and transcriptions, Disc 13-32.1 contains online documentation for the test corpora. The documentation is located under the "wsj1/doc" directory and consists of Hub and Spoke evaluation test indices, data collection information, a summary of the CD-ROM distribution, directories of each CD-ROM, specifications for the transcription format, collated waveform headers for the test corpora, source texts, vocabularies, and language models for the read WSJ material.

PLEASE NOTE: IF YOU INTEND TO COMPLY WITH THE RULES OF THE NOVEMBER 1993 ARPA CSR HUB AND SPOKE EVALUATION TESTS IN IMPLEMENTING TESTS ON THIS MATERIAL AT YOUR SITE, THE ONLINE DOCUMENTATION SHOULD NOT BE EXAMINED PRIOR TO RUNNING TESTS UNLESS IT IS SPECIFICALLY PERMITTED BY THE TEST GUIDELINES UNDER SECTION 2.0.

3.5 Indices

Index files have been built for each of the 36 hub and spoke tests to indicate the corpora to be used in each test. The files are located in the "wsj1/doc/indices" directory on Disc 13-32.1 and are named so as to clearly indicate the tests they pertain to (e.g., "h1_p0.ndx"). Index files for required baseline training conditions have also been included. Each index file contains a header which describes the test/training set. Header lines are all preceded by ";;". Each line following the header indicates the disc, path, and waveform file for an utterance in the test set (e.g.,
"13_33_1:wsj1/si_et_h1/wsj64k/4oa/4oac0201.wv1"). Note that auxiliary files such as calibration recordings and adaptation utterances which are not part of the test set are not included in the indices.

SPECIAL NOTE: Each test set corpora directory contains slightly more data than is specified in the CCCC Hub and Spoke Test specifications in Sections 2.0 and 4.0. This extra data was collected so as to avoid truncating WSJ or SJM news articles. To minimize the burden of processing the already bulky Spoke 8 data, the index file for each of the Spoke 8 tests has been edited to include exactly 600 utterances (as specified in the CCCC Hub and Spoke Test specs) rather than the full 710 utterances which have been collected and are included on disc 13-32.1. Only these 600 utterances should be tested on.

4.0 Hub and Spoke Test Data Specifications

The following is a copy of the CCCC specifications for the collection of the corpora to support the November 1993 ARPA CSR Hub and Spoke Tests. It is included to provide the details of the test data collection structure for those interested, and is not required reading for those who intend to implement the tests.


Specification of Evaluation Test Data in Support of the Hub and Spoke Paradigm for the November 1993 ARPA-sponsored CSR Evaluation.
Rev 4: 9-7-93

Introduction

This document specifies the contents and dimensions for each of the 2 Hub and 9 Spoke tests that have been approved for the 1993 CSR evaluation.

Here are the overall numbers for this proposal (including the Hub and S9).

TOTAL EVALTEST DATA PROPOSED:
        5000 wavs        3400 utts        test
        1200 wavs        1200 utts        rapid enrollment
        ---------        ---------
        6200 wavs        4600 utts        grand totals for the evaltest

Compare to:
        6760 wavs        4360 utts        grand totals for the devtest
So the total number of waveforms required is fewer than the total required for the 8-Spoke devtest (560 waves less), but the number of utterances is larger (240 utts more).

The changes in waveforms, relative to the devtest, are listed here:
+ 200 test for H1
+ 400 rapid enrollment for S3
- 160 rapid enrollment for S4
- 1600 stereo mic adaptation for S6
+ 200 test for S9
+ 400 rapid enrollment for S9
- 560 total reduction

Note that all procedures defined for the devtest will hold for the evaltest as well including those defined for Spoke 8 (Calibrated Noise Sources). In addition, Spoke 8 will require calibration waveforms from the Sennheiser and A-weighted sound level measurements at the subject's position.

Also note that there will be a total of 30 speakers required for the evaltest. Ten of these are to be journalist speakers for the spontaneous spoke, S9. For completeness, rapid enrollment data for these speakers has been added.

Finally, note that the 10 alternate mics for Spoke S5 need to be different than any used previously for training or development test.

GENERAL REMARKS

There are two tests in the Hub differentiated by vocabulary, that evaluate basic SI recognition technology. Additionally, there are nine specific problem areas supported by the Spoke tests.

Test Datasets In each dataset below the speakers should be balanced for gender.

In total, there should be 30 different speakers used to satisfy the requirements of the entire spec -- 10 non-native speakers for S3 (SI Recognition Outliers), 10 for S9 (Spontaneous WSJ Dictation) and 10 for all the rest of the eval test and rapid enrollment data. These speaker sets are labeled, A (devtest), B (non-native), and C (spontaneous) below.

Note that H1 and S5 share the wv1 channel data, S3 and S4 share rapid enrollment data, and the mic-adaptation data for S6 will come from the existing devtest data. A table of all data segments required in this spec can be made by extracting all lines containing the string 'DATA:'. This should yield totals of 5000 test wavs (3400 utts) and 1200 rapid enrollment wavs (1200 utts).

Read Text Sources

The 5K-word devtest data specified here is to be generated from the existing text pools constructed for the WSJ0 pilot corpus. The 64K-word devtest data specified here is to be generated from the existing text pools constructed for the WSJ0 pilot that were used to support the 20K standard grammar conditions. Prompting texts should be sampled randomly from the appropriate text pool for each speaker/condition combination.

Microphone Documentation

All microphones used should be documented according to the guidelines approved by the CCCC and Data-Quality Committee. These guidelines are reproduced here: For each microphone used in an ARPA-sponsored common corpus, there should be the following documentation:

  1. A block diagram of all physical connections from the microphone to the A/D. Include preamps, mixers, and wiring diagrams if applicable.
  2. A description of mic positioning with respect to the subject, the computer, and the surrounding environment.
  3. Photographs of the physical setup including subject and immediate surroundings.
  4. A copy of the instructions given to the subject that deal with mic handling and placement and a note on the way compliance was monitored.
  5. A copy of the vendor-supplied spec-sheet.
Environment Documentation

Each environment used should be documented in the following three ways:

  1. A prosaic description of the environment including rough dimensions, surface materials, and noise sources within.
  2. An A-weighted sound level measurement made over 20 seconds at the position and orientation of the (secondary) microphone with the subject in position. This measurement should be made at the beginning of every session of data collection. A session is a contiguous block of data collected without a subject break or change of physical setup.
  3. A corresponding 20 second file of ambient noise from the same time period as the A-weighted sound level measurement.
Note the environment calibration waveforms will not be distributed before the test. They are included here for documentation purposes only.

Differences Between Eval and Dev Test Data

Every speaker, room, and microphone must be physically different between these two test corpora.

The only exception to this is the 'sortation' lab environment specified in S7, which will be used again. The position of the subject and the mix of machinery used should be varied to be different than that used in the devtest.

THE HUB

H1. Read WSJ Baseline.

Read 64K-word WSJ data, Sennheiser mic. H1 DATA: 10 A spkrs * 20 utts = 200 test wavs (200 utts)

H2. 5K-Word Read WSJ Baseline

Read 5K-word WSJ data, Sennheiser mic. This data is from the wv1 channel in Spoke S5.

THE SPOKES

S1. Language Model Adaptation. Read unfiltered 1990 WSJ data (in TIPSTER corpus), Sennheiser mic. Whole articles selected at random, minimum of 20 sentences per article. Texts should come from the set-aside portion of the devtest pool at LDC. S1 DATA: 4 A spkrs * 1-5 articles (~100 utts) = 400 test wavs (400 utts)

S2. Domain-Independence.

Read unfiltered data from the San Jose Mercury, Sennheiser mic. Whole articles selected at random, minimum of 20 sentences per article. Texts should come from the set-aside portion of the devtest pool at LDC. S2 DATA: 10 A spkrs * 1 article (~20 utts) = 200 test wavs (200 utts)

S3. SI Recognition Outliers.

5K-word read WSJ data, Sennheiser mic, collected from non-native speakers of American English (British, European, Asian dialects, etc.). Native speakers with very marked dialects are also OK. The rapid adapt data for the A speakers is to satisfy condition S3-C2. It will also be used for S4.
S3 DATA: 10 B spkrs * 40 utts = 400 test wavs (400 utts)
S3 DATA: 10 B spkrs * 40 utts = 400 rapid enrollment wavs (400 utts)
S3 DATA: 10 A spkrs * 40 utts = 400 rapid enrollment wavs (400 utts)

S4. Incremental Speaker Adaptation.

5K-word read WSJ data, Sennheiser mic. Rapid adaptation data for these speakers will come from S3. S4 DATA: 4 A spkrs * 100 utts = 400 test wavs (400 utts)

S5. Microphone-Independence.

5K-word read WSJ data, from up to 10 different mics that are not in training or in any previous development test data. Stereo Sennheiser test data will also be collected for contrastive tests. The wv1 channel data from this Spoke will be used in the Hub test, H2. S5 DATA: 10 A spkrs * 20 utts * 2 chans = 400 test wavs (200 utts)

S6. Known Alternate Microphone.

5K-word read WSJ data, from 2 secondary mics -- the Audio-Technica UniPoint AT853a stand-mounted mic and the AT&T 712 speaker phone handset mic. The Audio-Technica mic should sit between the console and the keyboard and project about 8 inches forward from the front surface of the console toward the subject, so that the tip of the mic is about 8-18 inches from the subjects' mouth. The telephone handset data is to be collected over external phone lines. Stereo Sennheiser test data will also be collected for contrastive tests. The stereo mic adaptation data for this Spoke will come from the devtest. S6 DATA: 10 A spkrs * 20 utts * 2 mics * 2 chans = 800 test wavs (400 utts)

S7. Noisy Environments.

5K-word read WSJ data, same 2 secondary mics as in S6, collected in two environments with a background A-weighted sound level of about 55-68 dB. One of these environments will be the 'sortation' lab that was used for the devtest, but here the subject location and machinery mix will be changed. The other environment will be new and at the lower end of the range. Stereo Sennheiser data will be collected for comparative tests. S7 DATA: 10 A spkrs * 10 utts * 2 mics * 2 envs * 2 chans = 800 test wavs S7 DATA: (400 utts)

S8. Calibrated Noise Sources.

5K-word read WSJ data collected with competing recorded music or talk radio playing in the background at 3 SNR levels (0, 10 and 20 dB) recorded through the Audio-Technica stand-mounted mic specified in S6. The noise source should be positioned to the side of the subject and mic at a comfortable (arm's reach) distance on the workstation desktop. The environment should be typical quiet offices. Noise levels are to be set by determining rough A-weighted sound levels through the A-T mic necessary to yield approximate desired SNR levels. The calibration procedure should be identical to that used for the devtest. This procedure is documented in the file, noise-calibration.procedure, that was delivered along with the devtest data. In addition, A-weighted sound level of the noise source should be measured at the subject's ear closest to the noise source. Each speaker will produce 30 utts, 10 each at 0, 10 and 20 dB SNR. Stereo Sennheiser channel will also be collected for contrastive tests. SNR measurements should be included for both channels. S8 DATA: 10 A spkrs * 10 utts * 2 sources * 3 levels * 2 chans = 1200 test wavs S8 DATA: (600 utts)

S9. Spontaneous WSJ Dictation.

Spontaneous WSJ-like dictations (business news stories) from journalists, Sennheiser mic. If it appears that procuring 10 journalist subjects will not be possible within 30 days, then each of the first subjects should collect 40 utts in case the total number of subjects falls short of 10. S9 DATA: 10 C spkrs * ~20 utts per dictation = 200 test wavs (200 utts) S9 DATA: 10 C spkrs * 40 utts = 400 rapid enrollment wavs (400 utts)


5.0 November 1993 CSR Test Overview

This section provides background for those who intend to duplicate the test and scoring conditions for the November 1993 tests. This section begins by describing how the November 1993 ARPA CSR Hub and Spoke Benchmark Tests were conducted. It then covers how the November 1993 scoring protocols were implemented, and concludes with sample output and scored results from two of the participants in the November 1993 tests.

5.1 Test Data Disribution

The test material was distributed on 2 recordable CD-ROMs to the sites participating in the tests on November 1, 1993. Documentation similar to that which is included on this disc was distributed on the recordable CD-ROMs and via anonymous ftp and email.

5.2 Test protocols

The tests were conducted according to the protocols specified in the ARPA CCCC document: "Specification for the 1993 CSR Evaluation -- Hub and Spoke Paradigm", Rev 14: 10-21-93, which was originally distributed in email and is included in this document in Section 2.0.

5.3 Initial Scoring

Results (recognition output) for the Hub Primary tests (H1, H2 (P0)) and Spoke Primary (S1 - S9 (P0)) tests were due at NIST on November 22, 1993. This allowed the test sites 3 weeks to process the test corpora and package it for scoring at NIST. The sites were permitted to run each test only once and results received after the 3-week deadline were marked as "late". The results of the initial scoring run by NIST were made available to the sites on December 8, 1993.

The balance of the test results for the contrastive tests (H1,H2, S1 - S9 (C*)) were due at NIST on December 13.

5.4 Adjudication

During the 1-month period between December 8, 1993 and January 7, 1994, sites were permitted to "contest" the transcriptions used in scoring their recognition output. Specially-designed bug report forms were used to submit requests for "adjudication" to NIST. The adjudication requests generally contained requests for transcription modifications due to transcription errors and to suggest transcription alternatives.

NIST "adjudicators" considered each request and made decisions on whether or not to modify the transcription(s) in question. Of the transcriptions which were revised, most were the result of judgements by the adjudicators that the transcriptions contained words which could have multiple orthographic representations or which were lexically ambiguous. In many of these cases, both the original transcription and an alternative transcription were permitted. This was implemented by mapping alternate word forms to a single form in both the transcriptions and the recognized strings. The remaining revisions were the result of corrections to simple transcription errors.

5.5 Final Scoring

After the adjudication was completed and the final revisions were made to the transcriptions, a final "official" scoring run was made on all the test results and the final scores were reported to the test participants via email/ftp on January 18, 1994.

The official results of the November 1993 ARPA CSR Hub and Spoke Benchmark Tests were published in the proceedings of the ARPA Human Language Technology Workshop, March 8-11, 1994.

5.6 LIMSI and CU-HTK November 1993 Sample Output and Scores

In order to provide sample input/output for calibrating the NIST scoring software and a point of comparison for those who are new to the ARPA/NIST CSR tests, LIMSI and Cambridge University have consented to the inclusion of their test output and scored results for one of the hub tests in this disc set.

The output for the LIMSI Hub-1, Contrast-1 (WSJ 20K open vocabulary, trigram language model, 37.2K WSJ1 training utterances) is located in the directory, "wsj1/doc/nov93_h1" on Disc 13-32.1.

The output for the Cambridge University/HTK Hub-2, Contrast-1 (WSJ 5K closed vocabulary, bigram language model, 7,200 WSJ0 training utterances) is located in the directory, "wsj1/doc/nov93_h2" on Disc 13-32.1.

6.0 Baseline Training and Language Model Data

Some of the tests in he CCCC Hub and Spoke test specifications in Section 2.0 call for the use of "standard" baseline training sets. These training sets are drawn from two major collections of corpora: The CSR pilot corpus, WSJ0 (NIST speech discs 11-1.1 - 11-12.1) and the CSR Phase II corpus, WSJ1 (NIST speech discs 13-1.1 - 13-34.1). Indices have been developed for the following baseline training sets and are located in the subdirectories under the directory, "wsj1/doc/indices", on Disc 13-32.1:

wsj0/train: Indices for the WSJ0 (~7,200-utterance) Sennheiser training sets

tr_l_wv1.ndx WSJ0 SD/SI-long term training, Sennheiser mic
tr_s_wv1.ndx WSJ0 SI-short term training, Sennheiser mic

wsj1/train: Indices for the WSJ1 (~30,000-utterance) Sennheiser training sets

tr_l_wv1.ndx WSJ1 SI-long term training, Sennheiser mic
tr_s_wv1.ndx WSJ1 SI-short term training, Sennheiser mic

Some of the tests in the CCCC Hub and Spoke test specifications in Section 2.0 call for the use of "standard" baseline bigram or trigram language models. The baseline language models were developed by MIT-Lincoln Laboratories and are included in the directory, "wsj1/doc/lng_modl/base_lm", on Disc 13-32.1. The baseline language models are as follows:

bcb05cnp.z 5K closed NVP bigram LM
bcb05onp.z 5K open NVP bigram LM
bcb20cnp.z 20K closed NVP bigram LM
bcb20onp.z 20K open NVP bigram LM
tb05cnp.z 5K closed NVP trigram LM
tb20onp.z 20K open NVP trigram LM
To conserve disc space, the language model files have been compressed using the standard UNIX "compress" utility and must be decompressed to be used.

Note: VP language models and 5K open and 20K closed trigram language models are not used in these tests. All VP bigram language models are included in the "base_lm" directory for completeness. 5K open and 20K closed trigram language models were not available for inclusion.

7.0 Test Scoring

This section describes the process used by NIST in scoring the November 1993 WSJ/CSR Hub and Spoke tests. The information in this section can also be used by those who wish to duplicate the scoring methodology used.

New to the CSR tests in 1993 was an official adjudication period and the allowance of some alternative transcriptions in scoring the results. To implement the limited alternations (see Section 5.4), multiple hypothesis and reference transcriptions were mapped to a single representation in a "prefiltering" step which occured before the actual scoring. Therefore, no modifications were made to the scoring software itself.

To implement the prefiltering process, first, the hypothesized transcriptions generated by the recognition systems were mapped to single representations according to rules defined by the adjudicators. After the hypothesized transcripts were prefiltered, they were aligned and scored using the NIST scoring package.

For a complete description of the NIST scoring package and it's use, see the file, "score/doc/score.rdm", on Disc 13-32.1.

7.1 Preparation of Hypothesized Transcripts

The system-generated hypothesized transcripts were formatted by the test sites according to the Lexical SNOR, (LSN), format used by the scoring package. See Section 3.3.3 for a description of the LSN format.

7.2 Transcription Prefiltering

A utility to apply the mapping rules described in Section 7.0 has been included on Disc 13-32.1 in the top-level directory, "tranfilt". The directory contains a "readme.doc" file with compilation and installation instructions.

In the November 1993 tests, the hypothesized transcripts were prefiltered using the Bourne Shell script, "nov93flt.sh". A copy of this script has been included in the "tranfilt" directory above. The script operates as a simple UNIX filter that reads the hypothesis transcriptions from "stdin" and writes the filtered transcriptions to "stdout". The format for using the utility is as follows:

<INSTALL_DIR>/tranfilt/nov93flt.sh < system.hyp > system.filt.hyp

where: INSTALL_DIR is the pathname of the compiled 'tranfilt' directory.

Example using the sample LIMSI data in the "wsj1/doc/nov93_h1" directory, where "tranfilt" is located under the current directory:
./tranfilt/nov93flt.sh < 13_32.1:/wsj1/doc/nov93_h1/limsi.hyp > limsi1_filt.hyp
7.3 Scoring Results

In the November 1993 tests, the filtered hypothesized transcriptions were scored using the standard NIST scoring package described in Section 7.0. The scoring package has been included on Disc 13-32.1 in the top-level directory, "score". The directory contains a "readme.doc" file with compilation and installation instructions.

In order to score a hypothesis transcription against a reference transcription using the NIST scoring software, an "alignment" file must be created. The alignment file contains pairs of hypothesis and reference strings which have been aligned using a DP string alignment procedure. The format for using the "align" utility is:

<INSTALL_DIR>/bin/align -cfg <CONFIG_FILE> -hyp <HYP_FILE> -outfile <ALIGNMENTS>

where:
INSTALL_DIR is the pathname to the compiled "score" directory.

CONFIG_FILE contains a list of arguments to the scoring software including the filespec for the reference transcription file, lexicon file, and command line switches.

HYP_FILE contains the LSN-formatted hypothesis transcriptions.

ALIGNMENTS contains the output alignments.

Example using the example output from "tranfilt" in Section 7.2, where "score" is located under the current directory:

./score/bin/align -cfg ./score/lib/wsj.cfg -hyp limsi_filt.hyp -outfile limsi.ali

The actual tabulation of the scores is generated using the above alignment file as input. The "score" program with the "-ovrall" switch creates a by-speaker summary table of the error rates, (insertions, deletions, etc.). The format for using the "score" utility is:

<INSTALL_DIR>/bin/score -cfg <CONFIG_FILE> -align <ALIGNMENTS> <REPORT_OPTIONS>

where:
INSTALL_DIR is the pathname to the compiled "score" directory.

CONFIG_FILE contains a list of arguments to the scoring software including the filespec for the reference transcription file, lexicon file, and command line switches.

ALIGNMENTS contains the output alignments.

REPORT_OPTIONS are switches to generate different reports.

Example using the example output from "align" above, where "score" is located under the current directory:

./score/bin/score -cfg ./score/lib/wsj.cfg -align limsi.ali -ovrall

Note: The "score" program can produce several other reports of greater or lesser detail using other command line switches. See the manual page for "score" for description of it's other uses.

7.4 System Descriptions

As part of the November 1993 CSR Tests, each test site was required to generate a description of the systems used in each hub and spoke test according to a prescribed format. If you intend to publish results using this test material, you should provide such a system description along with your results. The format for the system description is as follows:

	SITE/SYSTEM NAME
	HUB OR SPOKE TEST DESIGNATION

1) PRIMARY TEST SYSTEM DESCRIPTION:

2) ACOUSTIC TRAINING: 

3) GRAMMAR TRAINING:

4) RECOGNITION LEXICON DESCRIPTION:

5) DIFFERENCES FOR EACH CONTRASTIVE TEST:

6) NEW CONDITIONS FOR THIS EVALUATION:

7) REFERENCES:
See the sample results for the LIMSI or CU-HTK systems in the files, "/wsj1/doc/nov93_h[12]/*.txt" on Disc 13_32.1 for examples of completed system descriptions.