Hub and Spoke Benchmark Tests Corpora and Instructions
NIST Speech Discs 13-32.1, 13-33.1
Public Release, May, 1994
* * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * * * * * If you intend to implement the protocols for the ARPA November '93 CSR * * Benchmark Tests, please read this document in its entirety before * * proceeding and do not examine the included transcriptions, calibration * * recordings, adaptation recordings, or documentation unless such * * examination is specifically permitted in the guidelines for the test(s) * * being run. Index files have been included which specify the exact data * * to be used for each test. To avoid testing on erroneous data, please * * refer to these files when running the tests. * * * * * * * * * * * * * * * * * * W A R N I N G * * * * * * * * * * * * * * * *
This 2-disc set contains the test material for the November 1993 ARPA Continuous Speech Recognition (CSR) Hub and Spoke Benchmark Tests. This material is to be used in conjunction with the WSJ1 training and development test material (NIST speech discs 13-1.1 - 13-31.1, and 13-34.1) which is available separately. These two discs contain the waveforms, prompts, transcriptions, software, and documentation required to implement the tests.
In early 1993, the ARPA CSR Corpus Coordinating Committee (CCCC) designed a "Hub and Spoke" test paradigm. The resulting suite of tests contained 2 general "hub" tests to assess the performance of large (5K vocabulary) and very large (64K vocabulary) speaker-independent continuous speech recognition systems and 9 "spoke" tests to assess the performance of systems designed to address specific areas of research in continuous speech recognition. Contrastive tests were also designed for each Hub and Spoke test. There are 25 contrastive tests for the 11 Hub and Spoke tests.
Speech data to support the similarly-designed development test and evaluation test suites was collected in mid-1993. The Hub and Spoke evaluation test suite speech corpora consists of approximately 7,500 waveforms (~11 hours of speech). To minimize the storage requirements for the corpora, the waveforms have been compressed using the SPHERE-embedded "Shorten" lossless compression algorithm which was developed at Cambridge University. The use of "Shorten" has approximately halved the storage requirements for the corpora. The NIST SPeech HEader REsources (SPHERE) software with embedded Shorten compression is included in the top-level directory of disc 13-32.1 in the "sphere" directory and can be used to decompress and manipulate the waveform files.
Disc 1 in the set (NIST speech disc 13-32.1) contains all of the available online documentation for the corpora on the two discs as well as the MIT Lincoln Laboratory WSJ '87-89 language models, a collation of the speech waveform file headers and a program to search them, and indices for the WSJ1 training corpora and for each Hub and Spoke test set. The NIST speech recognition scoring package and SPHERE toolkit have been included as well in the top-level directory of Disc 13-32.1.
General information files named, "readme.doc", have been included in most high-level directories on Disc 13-32.1 and describe the contents of the directories.
The collection and publication of the test corpora and implementation of the November '93 ARPA CSR Benchmark Tests have been sponsored by the Advanced Research Projects Agency Software and Intelligent Systems Technology Office (ARPA-SISTO) and the Linguistic Data Consortium (LDC). The Hub and Spoke Test paradigm and the form of the associated corpora was designed by the ARPA Continuous Speech Recognition Corpus Coordinating Committee (CCCC). MIT Lincoln Laboratory developed the text selection tools and the WSJ '87-89 language models. The corpora were collected at SRI International and produced on CD-ROM by the National Institute of Standards and technology (NIST) and the November '93 ARPA CSR Benchmark Tests were administered by NIST.
The ARPA CSR Corpus Coordinating Committee (CCCC) designed a "Hub and Spoke" test paradigm which consists of general "hub" core tests and optional "spoke" tests to probe specific areas of research interest and/or difficulty.
Two "hub" test sets were designed and speech data was collected for them:
Prior to the design of the November '93 Hub and Spoke Evaluation Test suite, the CCCC also designed a comparable Hub and Spoke Development Test suite (which is available with the training portion of WSJ1.) Note, however, that some tests were modified in creating the evaluation test suite. Therefore, there is not a one-to-one match in all tests between the development test suite and the evaluation test suite. In some tests, parameters were changed, and in some cases, contrastive tests were dropped or added.
The following is a copy of the CCCC Hub and Spoke Test Specifications for the November '93 ARPA CSR benchmark tests. It has been edited to remove details pertaining only to the November 1993 evaluation and has been annotated to indicate the appropriate test data index file for each test.
Rev 14: 10-21-93
MOTIVATION
This evaluation proposal attempts to accomodate research over a broad variety of important problems in CSR, to maintain a clear program-wide focus, and to extract as much information as possible from the results. It consists of a compact 'Hub' test, on which every participating site evaluates, and a variety of problem-specific 'Spoke' tests, which are run at the discretion of the sites ("different spokes for different folks").
Speaker Sets
Speakers are balanced for gender in each dataset below. In total, there will be 30 different speakers used for this evaluation -- 10 for S3. (SI Recognition Outliers), 10 for S9 (Spontaneous WSJ Dictation), and 10 for all the rest of the test and rapid enrollment data. These speaker sets are labeled, A (test), B (outliers), and C (spontaneous) below. An additional set, labeled D speakers are from the devtest dataset and are to be used for microphone adaptation in Spokes, S6, S7, and S8.
Terminology
A 'session' implies that the speaker, microphone, and acoustic environment remain constant for a group of utterances.
A 'static SI' test does not offer session boundaries or utterance order as side information to the system, and therefore implies that the speaker, microphone, and environment may change from utterance to utterance. Functionally, it implies that each utterance must be recognized independently of all others, yielding the same answers for any utterance order (or the same expectation, in the case of a non-deterministic recognizer).
'Unsupervised incremental adaptation' means that the system is allowed to use any information it can extract from test data that it has already recognized. It implies that session boundaries and utterance order are known to the system as side information.
'Supervised incremental adaptation' means that the correct transcription is made available to the system after each utterance has been recognized.
'Transcription-mode adaptation' means that the speech from an entire test session is available to the system prior to recognition. It is a non-causal and unsupervised mode.
Training Data
Unless otherwise noted, there are no restrictions on acoustic or language model training data.
Privately acquired data may be used as long as it can be made available to the LDC in a form suitable for publication as a LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if ARPA elects to have it published. Delivery of the data to LDC may be done after the evaluation, as long as it is done in a timely fashion.
Use of Calibration Waveforms
Use of all calibration waveforms from the devtest is allowed. However, the calibration data for the evaluation test cannot be used.
Default Side Information
These are the defaults for side information given to the system: Speaking style is always known. General environment conditions (quiet or noisy) are always known. Microphone identity is known unless noted otherwise.
Speaker gender is always unknown. Specific environment conditions (room code) are always unknown. Session boundaries and utterance order are unknown unless noted otherwise.
This implies that static SI conditions are the default.
Finally, unless an exception is noted, supervised adaptation and transcription-mode adaptation are not allowed, except as optional contrasts in addition to the required runs.
Legend
Each test below consists of a primary condition and a variable number of
required and optional contrast conditions.
P0 indicates the primary test, CX indicates a contrastive one (X = 1,2,3...).
(req) indicates a required condition, opt) an optional one.
TEST SIZE indicates the total number of utterances that need to be run to
complete the required portion of a test.
SIDE INFO, where present, indicates only the changes to the default side
information conditions given above.
METRICS, where present, indicates that a measure other than just the standard
overall word error rate is to be used.
SIG TESTS indicates the pair-wise comparisons to be used in the standard set of
statistical significance tests performed by NIST.
Statistical Significance Tests
The standard set of significance tests can be performed across all systems for the controlled HX-C1 tests. This is the only formal comparison possible across all systems. The same tests can be made where appropriate, within each system, comparing each CX contrast test to the P0 primary test.
System Descriptions
Sites should publish a system description for the results published for each system run on any test.
Multiple Systems on a Single Test
In order to discourage the running of several systems on a single test to improve one's chances of scoring well, sites must designate one of them as the primary one if more than one system is run on a single test. The designation is to be made before looking at any results. Results must be reported for all systems run on any test.
Use of Prior Evaluation Data
Evaluation data from the past will not be used for any training. parameter optimization, or frequent development testing. Infrequent sampling of past evaluation test sets, to determine whether a result on development test data will generalize, is considered an acceptable use.
All sites are required to run one Hub test. Sites that can't handle the size of the H1 test may run on H2.
H1. Read WSJ Baseline. ---------------------- GOAL: improve basic SI performance on clean data. DATA: 10 speakers * 20 utts = 200 utts 64K-word read WSJ data, Sennheiser mic. CONDITIONS: P0: (opt) any grammar or acoustic training, session boundaries and utterance order given as side information. (INDEX h1_p0.ndx) C1: (req) Static SI test with standard 20K trigram open-vocab grammar and choice of either short-term or long-term speakers of both WSJ0 and WSJ1 (37.2K utts). (INDEX h1_c1.ndx) C2: (opt) Static SI test with standard 20K bigram open-vocab grammar and choice of either short-term or long-term speakers of both WSJ0 and WSJ1 (37.2K utts). (INDEX h1_c2.ndx) SIDE INFO: session boundaries and utterance order are known for H1-P0 only. SIG TESTS: P0:C1, C1:C2 and a pairwise comparison of all systems for H1-C1. TEST SIZE: 200 utts H2. 5K-Word Read WSJ Baseline ----------------------------- GOAL: improve basic SI performance on clean data. DATA: 10 speakers * 20 utts = 200 utts 5K-word read WSJ data, Sennheiser mic. CONDITIONS: P0: (opt) any grammar or acoustic training, session boundaries and utterance order given as side information. (INDEX h2_p0.ndx) C1: (req) Static SI test with standard 5K bigram closed-vocab grammar and choice of either short-term or long-term speakers from WSJ0 (7.2K utts). (INDEX h2_c1.ndx) SIDE INFO: session boundaries and utterance order are known for H2-P0 only. SIG TESTS: P0:C1 and a pairwise comparison of all systems for H2-C1. TEST SIZE: 200 utts
For the 5K vocab test sets (Spokes S3-S8) it is assumed, but not required, that a 5K closed LM will be used. A de facto standard 5K closed bigram and trigram have been contributed by Lincoln Labs of MIT for use by any participating site.
Spokes S1-S4 support problems in adaptation.
S1. Language Model Adaptation. ------------------------------ GOAL: evaluate an incremental supervised LM adaptation algorithm on a problem of sublanguage adaptation. DATA: 4 A spkrs * 1-5 articles (~100 utts) = 400 utts Read unfiltered WSJ data from 1990 publications in TIPSTER corpus, Sennheiser mic, minimum of 20 sentences per article. CONDITIONS: P0: (req) incremental supervised LM adaptation, closed vocabulary, any LM trained from 1987-89 WSJ0 texts (INDEX s1_p0.ndx) C1: (req) S1-P0 system with LM adaptation disabled (INDEX s1_c1.ndx) C2: (opt) incremental unsupervised LM adaptation (INDEX s1_c2.ndx) SIDE INFO: session boundaries and utterance order are known SIG TESTS: P0:C1, optionally P0:C2, C2:C1 NOTE: the sign test and the Wilcoxon signed-rank test will not be done TEST SIZE: 800 utts METRICS: Partition the data into 4 equal parts distinguished by length of context (e.g. 0-5 sents, 6-10 sents, 11-20 sents, 20+ sents). For each part, report the standard measure and perplexity. S2O. Domain-Independence. ------------------------ GOAL: evaluate techniques for dealing with a newspaper domain different from training. DATA: 10 A spkrs * 1 article (~20 utts) = 200 utts Sennheiser mic data from San Jose Mercury, minimum of 20 sentences per article CONDITIONS: P0: (req) any grammar or acoustic training BUT no training whatsoever from the Mercury, nor any use of the knowledge of the paper's identity. (INDEX s2_p0.ndx) C1: (req) S2-P0 system on H1 data (INDEX s2_c1.ndx) C2: (req) H1-C1 system on S2 data (INDEX s2_c2.ndx) SIDE INFO: session boundaries and utterance order are known SIG TESTS: H1-P0:C1, P0:C2 TEST SIZE: 600 utts S3. SI Recognition Outliers. ---------------------------- GOAL: evaluate a rapid enrollment speaker adaptation algorithm on difficult speakers. DATA: 10 B spkrs * 40 utts = 400 utts (test) 10 B spkrs * 40 utts = 400 utts (rapid enrollment from S3 speakers, used for S3-P0) 10 A spkrs * 40 utts = 400 utts (rapid enrollment from Hub speakers, used for S3-C2) 5K-word read WSJ data, Sennheiser mic, collected from non-native speakers of American English (British, European, Asian dialects, etc.). CONDITIONS: P0: (req) rapid enrollment speaker adaptation (INDEX s3_p0.ndx) C1: (req) S3-P0 system with speaker adaptation disabled (INDEX s3_c1.ndx) C2: (req) S3-P0 system on H2 data (INDEX s3_c2.ndx) C3: (opt) incremental unsupervised adaptation (INDEX s3_c3.ndx) SIDE INFO: speaker identity is known for P0, C1, and C2, session boundaries and utterance order is known for C3. SIG TESTS: P0:C1, H2-P0:C2 optionally P0:C3, C1:C3 TEST SIZE: 1000 utts S4. Incremental Speaker Adaptation. ----------------------------------- GOAL: evaluate an incremental speaker adaptation algorithm. DATA: 4 A spkrs * 100 utts = 400 utts (test) 4 A spkrs * 40 utts = 160 utts (rapid enrollment from A speakers in S3) 5K-word read WSJ data, Sennheiser mic. CONDITIONS: P0: (req) incremental unsupervised speaker adaptation (INDEX s4_p0.ndx) C1: (req) S4-P0 system with speaker adaptation disabled (INDEX s4_c1.ndx) C2: (opt) incremental supervised adaptation (INDEX s4_c2.ndx) C3: (opt) rapid enrollment speaker adaptation (INDEX s4_c3.ndx) SIDE INFO: for all conditions: session boundaries and utterance order are known; additional for C2: correct transcription is known after the fact. SIG TESTS: P0:C1 optionally P0:C2, P0:C3 NOTE: the sign test and the Wilcoxon signed-rank test will not be done TEST SIZE: 800 utts METRICS: standard measure on each quarter of the data in sequence, plus the ratio: total_runtime(S4-P0)/total_runtime(S4-C1). Spokes S5-S8 support problems in channel and noise compensation. S5. Microphone-Independence. ---------------------------- GOAL: evaluate an unsupervised channel compensation algorithm. DATA: 10 A spkrs * 20 utts = 200 utts (2 channels, same speech as H2) 5K-word read WSJ data, 10 different mics not in training or development test. NOTE: No speech from the test microphones can be used. CONDITIONS: P0: (req) unsupervised channel compensation enabled on wv2 data (INDEX s5_p0.ndx) C1: (req) S5-P0 system with compensation disabled on wv2 data (INDEX s5_c1.ndx) C2: (req) S5-P0 system on Sennheiser (wv1) data (INDEX s5_c2.ndx) C3: (opt) S5-C1 system on Sennheiser (wv1) data (INDEX s5_c3.ndx) SIDE INFO: Microphone identities are not known SIG TESTS: P0:C1, P0:C2, C1:C2, optionally P0:C3, C1:C3, C2:C3 TEST SIZE: 600 utts S6. Known Alternate Microphone. ------------------------------- GOAL: evaluate a known microphone adaptation algorithm. DATA: 10 A spkrs * 20 utts * 2 mics = 400 utts (test, 2 channels) 10 D spkrs * 40 utts * 2 mics = 800 utts (mic-adaptation from devtest, 2 channels) 5K-word read WSJ data, from an Audio-Technica directional stand-mounted mic and telephone handset over external lines, plus stereo mic adaptation data. NOTE: the 800 stereo microphone adaptation utterances will come from the devtest and are the only data from the target mics that are allowed. CONDITIONS: P0: (req) supervised mic adaptation enabled on wv2 data (INDEX s6_p0.ndx) C1: (req) S6-P0 system with mic adaptation disabled on wv2 data (INDEX s6_c1.ndx) C2: (req) S6-C1 system on Sennheiser (wv1) data (INDEX s6_c2.ndx) SIDE INFO: Microphone identites are known. Use of the stereo mic-adaptation data will be allowed for the S6-P0 condition only SIG TESTS: P0:C1, P0:C2, C1:C2 TEST SIZE: 1200 utts METRICS: Separate error rates will be reported for each mic. S7. Noisy Environments. ----------------------- GOAL: evaluate a noise compensation algorithm with known alternate mic. DATA: 10 A spkrs * 10 utts * 2 mics * 2 envs = 400 utts (test, 2 channels) 5K-word read WSJ data, same 2 secondary mics as in S6, collected in two environments with a background A-weighted noise level of about 55-68 dB. NOTE: the 800 stereo microphone adaptation utterances will come from the devtest and are the only data from the target mics that is allowed. The only data available for adaptation to the environment will be from the S7 Spoke of the devtest data. CONDITIONS: P0: (req) noise compensation enabled on wv2 data (INDEX s7_p0.ndx) C1: (req) S7-P0 system with compensation disabled on wv2 data (INDEX s7_c1.ndx) C2: (req) S7-P0 system on Sennheiser (wv1) data (INDEX s7_c2.ndx) SIDE INFO: Microphone identites are known. Use of the stereo environment-adaptation data will be allowed for the S7-P0 condition only SIG TESTS: P0:C1, P0:C2, C1:C2 TEST SIZE: 1200 utts METRICS: Separate error rates will be reported for each mic/environment pair. S8. Calibrated Noise Sources. ----------------------------- GOAL: evaluate a noise compensation algorithm with known alternate mic on data corrupted with calibrated noise sources. DATA: 10 A spkrs * 10 utts * 2 sources * 3 levels = 600 utts (test, 2 channels) 5K-word read WSJ data collected with competing recorded music or talk radio in the background at 0, 10, and 20 dB SNR, using the Audio-Technica directional stand-mounted mic from S6. NOTE: the 400 stereo microphone adaptation utterances will come from the devtest and are the only data from the target mic that is allowed. CONDITIONS: P0: (req) noise compensation enabled on wv2 data (INDEX s8_p0.ndx) C1: (req) S8-P0 system with compensation disabled on wv2 data (INDEX s8_c2.ndx) C2: (req) S8-P0 system on Sennheiser (wv1) data (INDEX s8_c2.ndx) C3: (opt) S8-C1 system on Sennheiser (wv1) data (INDEX s8_c3.ndx) SIDE INFO: SIG TESTS: P0:C1, P0:C2, C1:C2 and optionally P0:C3, C1:C3, C2:C3 TEST SIZE: 1800 utts METRICS: Separate error rates will be reported for each source/level pair. S9. Spontaneous WSJ Dictation. ------------------------------ GOAL: improve basic SI performance on spontaneous dictation-style speech. DATA: 10 C speakers * 20 utts = 200 utts Spontaneous WSJ-like dictations (business news stories), Sennheiser mic. CONDITIONS: P0: (req) any grammar or acoustic training (INDEX s9_p0.ndx) C1: (req) S9-P0 system on H1 data (INDEX s9_c1.ndx) C2: (req) H1-C1 system on S9 data (INDEX s9_c2.ndx) SIG TESTS: H1-P0:C1, P0:C2 TEST SIZE: 600 utts
In addition to online documentation, Disc 13-32.1 contains software packages useful in processing the speech corpora and tabulating speech recognition scores. The top-level directory of Disc 13-32.1 contains the following major subdirectories:
hgrep/ Utility to search a collated SPHERE header contents file. wsj1/ Test corpora and documentation. score/ NIST speech recognition scoring software. Includes dynamic string-alignment scoring code and statistical significance tests. sphere/ NIST SPeech HEader REsources toolkit. Provides command- line and programmer interface to NIST-headered speech waveform files. Also provides for automatic decompression of Shorten-compressed WSJ1 waveform files. tranfilt/ Directory containing a UNIX shell script used to perform the post-adjudication transcription filter process.General information files named "readme.doc" have been included in each of the high-level directories and throughout the documentation directory ("wsj1/doc") on Disc 13-32.1 and describe the contents of the directories.
Three text files are included in the root directory of each of the 2
discs which contain descriptors for the the contents of the discs.
The file, "
The following is an example of the contents of one of these sets of files
(filenames - discinfo.txt and 13_33_1.txt):
3.1 Directory Structure
The following depicts the directory structure of the corpora on the
two discs:
data level: <FILES> (corpora files, see below for format and types)
3.2 Filenaming Formats
The filenames and filetypes follow standard CSR WSJ1 conventions.
Data types are differentiated by unique filename extensions. All
files associated with the same utterance have the same basename. All
filenames are unique across all WSJ corpora. Speech waveform (.wav)
files are utterance-level files and prompt (.ptx), transcription (.dot
and .lsn), and SPQA (.spq) files are session-level files and,
therefore, contain texts for multiple waveform files. The filename
format is as follows:
<UTTERANCE-ID>.<XXX>
where,
3.3.1 Waveforms (.wv?)
The waveforms are SPHERE-headered, digitized, and compressed using the
lossless Cambridge University "Shorten" algorithm under SPHERE.
Version 2.1 of SPHERE has been included in this disc which will permit
the waveform files to be decompressed automatically as they are
accessed. See the files under the "/sphere" directory on Disc
13-32.1.
The filename extension for the waveforms contains the characters,
"wv", followed by a 1-character code to identify the channel. The
headers contain the following fields/types:
A detailed orthorgraphic transcription (.dot) containing lexical and
non-lexical elements has been generated for each utterance. The
specifications for the format of the detailed orthographic
transcriptions are located in the file, "dot_spec.doc", under the
"/wsj1/doc" directory on Disc 13-32.1.
The transcriptions for all utterances in a session are concatenated
into a single file of the form, "<SSS><T><EE>00.dot" and each
transcription includes a corresponding utterance-ID code. The format
for a single utterance transcription entry in this table is as
follows:
(new-line added for readability)
The .dot transcriptions for the test corpora were corrected during an
adjudication process which followed the November 1993 tests. Only the
corrected transcriptions are included on these discs.
3.3.3 Lexical SNOR Transcriptions (.lsn)
The lexical Standard Normal Orthographic Representation (lexical SNOR)
(.lsn) transcriptions are word-level transcriptions derived from the
".dot" transcriptions with capitalization, non-speech markers,
prosodic markings, fragments, and "\" character escapes filtered out.
The .lsn transcriptions are of the same form as the .dot
transcriptions and will be identified by a ".lsn" filename extension.
example:
(new-line added for readability)
The .dot (and derivative .lsn) transcriptions for the test corpora
were corrected during an adjudication process which followed the
November 1993 tests. Only the corrected transcriptions are included
on these discs.
3.3.4 Prompting Texts (.ptx)
The prompting texts for all read Wall Street Journal utterances in a
session including the utterances' utterance-IDs and prompt IDs are
concatenated into a single file of the form, "<SSS><T><EE>00.ptx".
The prompt ID is Doug Paul's Wall Street Journal sentence index.
The format for this index is:
(new-line added for readability)
There is one .ptx file for each read speaker-session.
3.3.5 SPeech Quality Assurance Reports (.spq)
The data collectors at SRI screened all of the test data using the
NIST SPeech Quality Assurance (SPQA) software. A SPQA report was
generated for each speaker-session and is included in files of the
form, "<SSS><T><EE>00.spq".
The SPQA software scans digitized speech waveforms for signal defects
and anomalies. The version used, SPQA 2.2, scanned the test waveform
files and determined the peak speech power, mean noise power,
signal-to-noise ratio (SNR), and ratio of speech duration to total
recording duration. The software also checked for DC bias, clipping,
and 60-Hz EM-interference hum.
3.4 Online Documentation
In addition to prompts and transcriptions, Disc 13-32.1 contains
online documentation for the test corpora. The documentation is
located under the "wsj1/doc" directory and consists of Hub and Spoke
evaluation test indices, data collection information, a summary of the
CD-ROM distribution, directories of each CD-ROM, specifications for
the transcription format, collated waveform headers for the test
corpora, source texts, vocabularies, and language models for the read
WSJ material.
PLEASE NOTE: IF YOU INTEND TO COMPLY WITH THE RULES OF THE NOVEMBER
1993 ARPA CSR HUB AND SPOKE EVALUATION TESTS IN IMPLEMENTING TESTS ON
THIS MATERIAL AT YOUR SITE, THE ONLINE DOCUMENTATION SHOULD NOT BE
EXAMINED PRIOR TO RUNNING TESTS UNLESS IT IS SPECIFICALLY PERMITTED BY
THE TEST GUIDELINES UNDER SECTION 2.0.
3.5 Indices
Index files have been built for each of the 36 hub and spoke tests to
indicate the corpora to be used in each test. The files are located
in the "wsj1/doc/indices" directory on Disc 13-32.1 and are named so
as to clearly indicate the tests they pertain to (e.g., "h1_p0.ndx").
Index files for required baseline training conditions have also been
included. Each index file contains a header which describes the
test/training set. Header lines are all preceded by ";;". Each line
following the header indicates the disc, path, and waveform file for
an utterance in the test set (e.g.,
SPECIAL NOTE: Each test set corpora directory contains slightly more
data than is specified in the CCCC Hub and Spoke Test specifications
in Sections 2.0 and 4.0. This extra data was collected so as to avoid
truncating WSJ or SJM news articles. To minimize the burden of
processing the already bulky Spoke 8 data, the index file for each of
the Spoke 8 tests has been edited to include exactly 600 utterances
(as specified in the CCCC Hub and Spoke Test specs) rather than the
full 710 utterances which have been collected and are included on disc
13-32.1. Only these 600 utterances should be tested on.
The following is a copy of the CCCC specifications for the collection of
the corpora to support the November 1993 ARPA CSR Hub and Spoke Tests.
It is included to provide the details of the test data collection structure
for those interested, and is not required reading for those who intend to
implement the tests.
This document specifies the contents and dimensions for each of the 2 Hub
and 9 Spoke tests that have been approved for the 1993 CSR evaluation.
Here are the overall numbers for this proposal (including the Hub and S9).
The changes in waveforms, relative to the devtest, are listed here:
Note that all procedures defined for the devtest will hold for the evaltest
as well including those defined for Spoke 8 (Calibrated Noise Sources).
In addition, Spoke 8 will require calibration waveforms from the Sennheiser
and A-weighted sound level measurements at the subject's position.
Also note that there will be a total of 30 speakers required for the evaltest.
Ten of these are to be journalist speakers for the spontaneous spoke, S9.
For completeness, rapid enrollment data for these speakers has been added.
Finally, note that the 10 alternate mics for Spoke S5 need to be different
than any used previously for training or development test.
There are two tests in the Hub differentiated by vocabulary, that evaluate
basic SI recognition technology. Additionally, there are nine specific problem
areas supported by the Spoke tests.
Test Datasets
In each dataset below the speakers should be balanced for gender.
In total, there should be 30 different speakers used to satisfy the
requirements of the entire spec -- 10 non-native speakers for S3 (SI
Recognition Outliers), 10 for S9 (Spontaneous WSJ Dictation) and 10 for all
the rest of the eval test and rapid enrollment data. These speaker sets are
labeled, A (devtest), B (non-native), and C (spontaneous) below.
Note that H1 and S5 share the wv1 channel data, S3 and S4 share rapid
enrollment data, and the mic-adaptation data for S6 will come from the
existing devtest data. A table of all data segments required in this
spec can be made by extracting all lines containing the string 'DATA:'.
This should yield totals of 5000 test wavs (3400 utts) and 1200 rapid
enrollment wavs (1200 utts).
Read Text Sources
The 5K-word devtest data specified here is to be generated from the existing
text pools constructed for the WSJ0 pilot corpus. The 64K-word devtest data
specified here is to be generated from the existing text pools constructed for
the WSJ0 pilot that were used to support the 20K standard grammar conditions.
Prompting texts should be sampled randomly from the appropriate text pool for
each speaker/condition combination.
Microphone Documentation
All microphones used should be documented according to the guidelines approved
by the CCCC and Data-Quality Committee. These guidelines are reproduced here:
For each microphone used in an ARPA-sponsored common corpus, there should
be the following documentation:
Each environment used should be documented in the following three ways:
Differences Between Eval and Dev Test Data
Every speaker, room, and microphone must be physically different between these
two test corpora.
The only exception to this is the 'sortation' lab environment specified in S7,
which will be used again. The position of the subject and the mix of
machinery used should be varied to be different than that used in the devtest.
H1. Read WSJ Baseline.
Read 64K-word WSJ data, Sennheiser mic.
H1 DATA: 10 A spkrs * 20 utts = 200 test wavs (200 utts)
H2. 5K-Word Read WSJ Baseline
Read 5K-word WSJ data, Sennheiser mic.
This data is from the wv1 channel in Spoke S5.
S1. Language Model Adaptation.
Read unfiltered 1990 WSJ data (in TIPSTER corpus), Sennheiser mic.
Whole articles selected at random, minimum of 20 sentences per article.
Texts should come from the set-aside portion of the devtest pool at LDC.
S1 DATA: 4 A spkrs * 1-5 articles (~100 utts) = 400 test wavs (400 utts)
S2. Domain-Independence.
Read unfiltered data from the San Jose Mercury, Sennheiser mic.
Whole articles selected at random, minimum of 20 sentences per article.
Texts should come from the set-aside portion of the devtest pool at LDC.
S2 DATA: 10 A spkrs * 1 article (~20 utts) = 200 test wavs (200 utts)
S3. SI Recognition Outliers.
5K-word read WSJ data, Sennheiser mic, collected from non-native speakers of
American English (British, European, Asian dialects, etc.).
Native speakers with very marked dialects are also OK.
The rapid adapt data for the A speakers is to satisfy condition S3-C2.
It will also be used for S4.
S4. Incremental Speaker Adaptation.
5K-word read WSJ data, Sennheiser mic.
Rapid adaptation data for these speakers will come from S3.
S4 DATA: 4 A spkrs * 100 utts = 400 test wavs (400 utts)
S5. Microphone-Independence.
5K-word read WSJ data, from up to 10 different mics that are not in training
or in any previous development test data.
Stereo Sennheiser test data will also be collected for contrastive tests.
The wv1 channel data from this Spoke will be used in the Hub test, H2.
S5 DATA: 10 A spkrs * 20 utts * 2 chans = 400 test wavs (200 utts)
S6. Known Alternate Microphone.
5K-word read WSJ data, from 2 secondary mics -- the Audio-Technica UniPoint
AT853a stand-mounted mic and the AT&T 712 speaker phone handset mic.
The Audio-Technica mic should sit between the console and the keyboard and
project about 8 inches forward from the front surface of the console toward
the subject, so that the tip of the mic is about 8-18 inches from the
subjects' mouth.
The telephone handset data is to be collected over external phone lines.
Stereo Sennheiser test data will also be collected for contrastive tests.
The stereo mic adaptation data for this Spoke will come from the devtest.
S6 DATA: 10 A spkrs * 20 utts * 2 mics * 2 chans = 800 test wavs (400 utts)
S7. Noisy Environments.
5K-word read WSJ data, same 2 secondary mics as in S6, collected in two
environments with a background A-weighted sound level of about 55-68 dB.
One of these environments will be the 'sortation' lab that was used for the
devtest, but here the subject location and machinery mix will be changed.
The other environment will be new and at the lower end of the range.
Stereo Sennheiser data will be collected for comparative tests.
S7 DATA: 10 A spkrs * 10 utts * 2 mics * 2 envs * 2 chans = 800 test wavs
S7 DATA: (400 utts)
S8. Calibrated Noise Sources.
5K-word read WSJ data collected with competing recorded music or talk radio
playing in the background at 3 SNR levels (0, 10 and 20 dB) recorded through
the Audio-Technica stand-mounted mic specified in S6.
The noise source should be positioned to the side of the subject and mic at
a comfortable (arm's reach) distance on the workstation desktop.
The environment should be typical quiet offices.
Noise levels are to be set by determining rough A-weighted sound levels through
the A-T mic necessary to yield approximate desired SNR levels.
The calibration procedure should be identical to that used for the devtest.
This procedure is documented in the file, noise-calibration.procedure, that
was delivered along with the devtest data.
In addition, A-weighted sound level of the noise source should be measured at
the subject's ear closest to the noise source.
Each speaker will produce 30 utts, 10 each at 0, 10 and 20 dB SNR.
Stereo Sennheiser channel will also be collected for contrastive tests.
SNR measurements should be included for both channels.
S8 DATA: 10 A spkrs * 10 utts * 2 sources * 3 levels * 2 chans = 1200 test wavs
S8 DATA: (600 utts)
S9. Spontaneous WSJ Dictation.
Spontaneous WSJ-like dictations (business news stories) from journalists,
Sennheiser mic.
If it appears that procuring 10 journalist subjects will not be possible
within 30 days, then each of the first subjects should collect 40 utts
in case the total number of subjects falls short of 10.
S9 DATA: 10 C spkrs * ~20 utts per dictation = 200 test wavs (200 utts)
S9 DATA: 10 C spkrs * 40 utts = 400 rapid enrollment wavs (400 utts)
This section provides background for those who intend to duplicate the
test and scoring conditions for the November 1993 tests. This section
begins by describing how the November 1993 ARPA CSR Hub and Spoke
Benchmark Tests were conducted. It then covers how the November 1993
scoring protocols were implemented, and concludes with sample output
and scored results from two of the participants in the November 1993
tests.
5.1 Test Data Disribution
The test material was distributed on 2 recordable CD-ROMs to
the sites participating in the tests on November 1, 1993. Documentation
similar to that which is included on this disc was distributed on the
recordable CD-ROMs and via anonymous ftp and email.
5.2 Test protocols
The tests were conducted according to the protocols specified in the
ARPA CCCC document: "Specification for the 1993 CSR Evaluation -- Hub
and Spoke Paradigm", Rev 14: 10-21-93, which was originally
distributed in email and is included in this document in Section 2.0.
5.3 Initial Scoring
Results (recognition output) for the Hub Primary tests (H1, H2 (P0)) and Spoke
Primary (S1 - S9 (P0)) tests were due at NIST on November 22, 1993. This
allowed the test sites 3 weeks to process the test corpora and package
it for scoring at NIST. The sites were permitted to run each test only
once and results received after the 3-week deadline were marked as "late".
The results of the initial scoring run by NIST were made available to the
sites on December 8, 1993.
The balance of the test results for the contrastive tests (H1,H2, S1 -
S9 (C*)) were due at NIST on December 13.
5.4 Adjudication
During the 1-month period between December 8, 1993 and January 7, 1994,
sites were permitted to "contest" the transcriptions used in scoring
their recognition output. Specially-designed bug report forms were
used to submit requests for "adjudication" to NIST. The
adjudication requests generally contained requests for transcription
modifications due to transcription errors and to suggest transcription
alternatives.
NIST "adjudicators" considered each request and made decisions on
whether or not to modify the transcription(s) in question. Of the
transcriptions which were revised, most were the result of judgements
by the adjudicators that the transcriptions contained words which
could have multiple orthographic representations or which were
lexically ambiguous. In many of these cases, both the original
transcription and an alternative transcription were permitted. This
was implemented by mapping alternate word forms to a single form in
both the transcriptions and the recognized strings. The remaining
revisions were the result of corrections to simple transcription
errors.
5.5 Final Scoring
After the adjudication was completed and the final revisions were made
to the transcriptions, a final "official" scoring run was made on
all the test results and the final scores were reported to the
test participants via email/ftp on January 18, 1994.
The official results of the November 1993 ARPA CSR Hub and Spoke Benchmark
Tests were published in the proceedings of the ARPA Human Language Technology
Workshop, March 8-11, 1994.
5.6 LIMSI and CU-HTK November 1993 Sample Output and Scores
In order to provide sample input/output for calibrating the NIST
scoring software and a point of comparison for those who are new to
the ARPA/NIST CSR tests, LIMSI and Cambridge University have consented
to the inclusion of their test output and scored results for one of
the hub tests in this disc set.
The output for the LIMSI Hub-1, Contrast-1 (WSJ 20K open vocabulary,
trigram language model, 37.2K WSJ1 training utterances) is located in
the directory, "wsj1/doc/nov93_h1" on Disc 13-32.1.
The output for the Cambridge University/HTK Hub-2, Contrast-1 (WSJ 5K closed
vocabulary, bigram language model, 7,200 WSJ0 training utterances)
is located in the directory, "wsj1/doc/nov93_h2" on Disc 13-32.1.
Some of the tests in he CCCC Hub and Spoke test specifications in
Section 2.0 call for the use of "standard" baseline training sets.
These training sets are drawn from two major collections of corpora:
The CSR pilot corpus, WSJ0 (NIST speech discs 11-1.1 - 11-12.1) and
the CSR Phase II corpus, WSJ1 (NIST speech discs 13-1.1 - 13-34.1).
Indices have been developed for the following baseline training sets
and are located in the subdirectories under the directory,
"wsj1/doc/indices", on Disc 13-32.1:
wsj0/train: Indices for the WSJ0 (~7,200-utterance) Sennheiser training sets
tr_l_wv1.ndx WSJ0 SD/SI-long term training, Sennheiser mic
wsj1/train: Indices for the WSJ1 (~30,000-utterance) Sennheiser training sets
tr_l_wv1.ndx WSJ1 SI-long term training, Sennheiser mic
Some of the tests in the CCCC Hub and Spoke test specifications in
Section 2.0 call for the use of "standard" baseline bigram or trigram
language models. The baseline language models were developed by
MIT-Lincoln Laboratories and are included in the directory,
"wsj1/doc/lng_modl/base_lm", on Disc 13-32.1. The baseline language
models are as follows:
Note: VP language models and 5K open and 20K closed trigram language
models are not used in these tests. All VP bigram language models are
included in the "base_lm" directory for completeness. 5K open and 20K
closed trigram language models were not available for inclusion.
This section describes the process used by NIST in scoring the November 1993
WSJ/CSR Hub and Spoke tests. The information in this section can also be used
by those who wish to duplicate the scoring methodology used.
New to the CSR tests in 1993 was an official adjudication period and the
allowance of some alternative transcriptions in scoring the results. To
implement the limited alternations (see Section 5.4), multiple hypothesis and
reference transcriptions were mapped to a single representation in a
"prefiltering" step which occured before the actual scoring. Therefore, no
modifications were made to the scoring software itself.
To implement the prefiltering process, first, the hypothesized transcriptions
generated by the recognition systems were mapped to single representations
according to rules defined by the adjudicators. After the hypothesized
transcripts were prefiltered, they were aligned and scored using the NIST
scoring package.
For a complete description of the NIST scoring package and it's use, see the
file, "score/doc/score.rdm", on Disc 13-32.1.
7.1 Preparation of Hypothesized Transcripts
The system-generated hypothesized transcripts were formatted by the test sites
according to the Lexical SNOR, (LSN), format used by the scoring package. See
Section 3.3.3 for a description of the LSN format.
7.2 Transcription Prefiltering
A utility to apply the mapping rules described in Section 7.0 has been
included on Disc 13-32.1 in the top-level directory, "tranfilt". The directory
contains a "readme.doc" file with compilation and installation instructions.
In the November 1993 tests, the hypothesized transcripts were prefiltered using
the Bourne Shell script, "nov93flt.sh". A copy of this script has been
included in the "tranfilt" directory above. The script operates as a simple
UNIX filter that reads the hypothesis transcriptions from "stdin" and writes
the filtered transcriptions to "stdout". The format for using the utility is
as follows:
where: INSTALL_DIR is the pathname of the compiled 'tranfilt'
directory.
In the November 1993 tests, the filtered hypothesized transcriptions were
scored using the standard NIST scoring package described in Section 7.0. The
scoring package has been included on Disc 13-32.1 in the top-level directory,
"score". The directory contains a "readme.doc" file with compilation and
installation instructions.
In order to score a hypothesis transcription against a reference transcription
using the NIST scoring software, an "alignment" file must be created. The
alignment file contains pairs of hypothesis and reference strings which
have been aligned using a DP string alignment procedure. The format for using
the "align" utility is:
where:
CONFIG_FILE contains a list of arguments to the
scoring software including the filespec for the
reference transcription file, lexicon file, and
command line switches.
HYP_FILE contains the LSN-formatted hypothesis
transcriptions.
ALIGNMENTS contains the output alignments.
./score/bin/align -cfg ./score/lib/wsj.cfg -hyp limsi_filt.hyp
-outfile limsi.ali
The actual tabulation of the scores is generated using the above alignment file
as input. The "score" program with the "-ovrall" switch creates a by-speaker
summary table of the error rates, (insertions, deletions, etc.). The format
for using the "score" utility is:
where:
CONFIG_FILE contains a list of arguments to the
scoring software including the filespec for the
reference transcription file, lexicon file, and
command line switches.
ALIGNMENTS contains the output alignments.
REPORT_OPTIONS are switches to generate different
reports.
./score/bin/score -cfg ./score/lib/wsj.cfg -align limsi.ali -ovrall
Note: The "score" program can produce several other reports of greater or
lesser detail using other command line switches. See the manual page for
"score" for description of it's other uses.
7.4 System Descriptions
As part of the November 1993 CSR Tests, each test site was required to
generate a description of the systems used in each hub and spoke test
according to a prescribed format. If you intend to publish results
using this test material, you should provide such a system description
along with your results. The format for the system description is as
follows:
discid: 13_33_1
data_types: si_et_h1:10:223, si_et_h2:10:635, si_et_s1:4:428, \
si_et_s2:10:214, si_et_s3:10:836, si_et_s4:4:418, si_et_s5:10:225, \
si_et_s6:10:908, si_et_s7:10:984, si_et_s8:10:2020
channel_ids: 1,2
The first field, "disc_id", indentifies the disc number. The second
field, "data_types", contains entries for subcorpora (directories)
separated by commas with each subfield containing an entry identifying
the subcorpora, number of speakers, and number of waveforms. The
third field, "channel_ids", contains a comma-separated list of the
channels contained on the disc. This field normally has a value of
"1" (Sennheiser) or "2" (Other mic.) for this corpora.
top level: wsj1/ (Phase 2 corpus)
2nd level: doc/ (online documentation [disc 13-32.1 only])
/ si_et_h1/wsj64k/ (Hub 1 - 64K vocabulary test data)
|
| si_et_h2/wsj5k/ (Hub 2 - 5K vocabulary test data)
|
| si_et_s1/wsj/ (Spoke 1 - language model adaptation WSJ test data)
|
| si_et_s2/sjm/ (Spoke 2 - domain-indep. San Jose Mercury test
| data)
|
| si_et_s3/non_nat/(Spoke 3 - non-native speakers test data)
|
| si_et_s4/inc_adp/(Spoke 4 - incremental speaker adaptation test
Disc data)
13-33.1
| si_et_s5/mic_ind/(Spoke 5 - microphone independence test data)
|
| si_et_s6/ (Spoke 6 - known alternate mic test data:)
| at_te/ (Audio Technica mic)
| th_te/ (telephone handset)
|
| si_et_s7/ (Spoke 7 - noisy environments test data:)
| at_e1/ (Audio Technica mic, noise environment 1)
| at_e2/ (Audio Technica mic, noise environment 2)
| th_e1/ (telephone handset, noise environment 1)
| th_e2/ (telephone handset, noise environment 2)
|
| si_et_s8/ (Spoke 8 - calibrated noise sources:)
| mu_0/ (competing music, 0 dB. SNR)
| mu_10/ (competing music, 10 dB. SNR)
| mu_20/ (competing music, 20 dB. SNR)
| tr_0/ (competing talk radio, 0 dB. SNR)
| tr_10/ (competing talk radio, 10 dB. SNR)
\ tr_20/ (competing talk radio, 20 dB. SNR)
Disc
13-32.1-> si_et_s9/journ/ (Spoke 9 - spontaneous WSJ-style dictation:)
speaker level: <XXX>/ (speaker-ID, where XXX = "001" to "zzz", base 36)
UTTERANCE-ID ::= <SSS><T><EE><UU>
where,
SSS ::= 001 | ... | zzz (base-36 speaker ID)
T ::= (speech type code)
c (Common read) |
s (Spontaneous) |
a (Adaptation read) |
x (calibration recording)
EE ::= 01 | ... | zz (base-36 session ID)
UU ::= 01 | ... | zz (base-36 within-session sequential speaker
utterance code - always "00" for .ptx, .dot
and .lsn session-level files)
XXX ::= (data type)
.wv1 (channel 1 - Sennheiser waveform)
.wv2 (channel 2 - Other mic waveform)
.ptx (prompting text for read material)
.dot (detailed orthographic transcription)
.lsn (Lexical SNOR transcription derived from .dot)
.spq (output from SPeech Quality Assurance software)
3.3 Data Types
Field Type Description - Probable defaults marked in ()
----------------------- ------- ---------------------------------------------
microphone string microphone description ("Sennheiser HMD410",
"Crown PCC160", etc.)
recording_site string recording site ("SRI")
database_id string database (corpus) identifier ("wsj1")
database_version string database (corpus) revision ("1.0")
recording_environment string text description of recording environment
speaker_session_number string 2-char. base-36 session ID from filename
session_utterance_number string 2-char. base-36 utterance number within
session from the filename
prompt_id string WSJ source sentence text ID - see .ptx
description below for format (only in read
data).
utterance_id string utterance ID from filename of the form
SSSTEEUU as described in the filename
section above.
speaking_mode string speaking mode ("spontaneous","read-common",
"read-adaptation", etc.)
speaker_id string 3-char. speaker ID from filename
sample_count integer number of samples in waveform
sample_min integer minimum sample value in waveform
sample_max integer maximum sample value in waveform
sample_checksum integer checksum obtained by the addition of all
(uncompressed) samples into an unsigned
16-bit (short) and discarding overflow.
recording_date string beginning of recording date stamp of the
form DD-MMM-YYYY.
recording_time string beginning of recording time stamp of the
form HH:MM:SS.HH.
channel_count integer number of channels in waveform ("1")
sample_rate integer waveform sampling rate ("16000")
sample_n_bytes integer number of bytes per sample ("2")
sample_byte_format string byte order (MSB/LSB -> "10", LSB/MSB -> "01")
sample_sig_bits integer number of significant bits in each sample
("16")
sample_coding string waveform encoding ("pcm,embedded-shorten-v1.09")
end_head
3.3.2 Detailed Orthographic Transcriptions (.dot)
<TRANSCRIPTION-TEXT> (<UTTERANCE-ID>)<NEW-LINE>
example:
Speculation in Tokyo was that the yen could rise because of the
realignment (4oc0201)
There is one ".dot" file for each speaker-session.
SPECULATION IN TOKYO WAS THAT THE YEN COULD RISE BECAUSE OF THE
REALIGNMENT (4OAC0201)
There is one ".lsn" file for each speaker-session.
<YEAR>.<FILE-NUMBER>.<ARTICLE-NUMBER>.<PARAGRAPH-NUMBER>.<SENTENCE-NUMBER>
The format for a single prompting text entry in the .ptx file is as follows:
<PROMPTING-TEXT> (<UTTERANCE-ID> <PROMPT-ID>)
example:
Speculation in Tokyo was that the yen could rise because of the
realignment. (4oac0201 87.051.870113-0174.6.1)
The inclusion of both the utterance ID and prompt ID allows the utterance to be
mapped back to its source sentence text and surrounding paragraph.
"13_33_1:wsj1/si_et_h1/wsj64k/4oa/4oac0201.wv1"). Note that auxiliary
files such as calibration recordings and adaptation utterances which
are not part of the test set are not included in the indices.
4.0 Hub and Spoke Test Data Specifications
Specification of Evaluation Test Data in Support of the Hub and Spoke Paradigm
for the November 1993 ARPA-sponsored CSR Evaluation.
Rev 4: 9-7-93
Introduction
TOTAL EVALTEST DATA PROPOSED:
5000 wavs 3400 utts test
1200 wavs 1200 utts rapid enrollment
--------- ---------
6200 wavs 4600 utts grand totals for the evaltest
Compare to:
6760 wavs 4360 utts grand totals for the devtest
So the total number of waveforms required is fewer than the total required for
the 8-Spoke devtest (560 waves less), but the number of utterances is larger
(240 utts more).
+ 200 test for H1
+ 400 rapid enrollment for S3
- 160 rapid enrollment for S4
- 1600 stereo mic adaptation for S6
+ 200 test for S9
+ 400 rapid enrollment for S9
- 560 total reduction
GENERAL REMARKS
Environment Documentation
Note the environment calibration waveforms will not be distributed before the
test. They are included here for documentation purposes only.
THE HUB
THE SPOKES
S3 DATA: 10 B spkrs * 40 utts = 400 test wavs (400 utts)
S3 DATA: 10 B spkrs * 40 utts = 400 rapid enrollment wavs (400 utts)
S3 DATA: 10 A spkrs * 40 utts = 400 rapid enrollment wavs (400 utts)
5.0 November 1993 CSR Test Overview
6.0 Baseline Training and Language Model Data
tr_s_wv1.ndx WSJ0 SI-short term training, Sennheiser mic
tr_s_wv1.ndx WSJ1 SI-short term training, Sennheiser mic
bcb05cnp.z 5K closed NVP bigram LM
To conserve disc space, the language model files have been compressed
using the standard UNIX "compress" utility and must be decompressed to
be used.
bcb05onp.z 5K open NVP bigram LM
bcb20cnp.z 20K closed NVP bigram LM
bcb20onp.z 20K open NVP bigram LM
tb05cnp.z 5K closed NVP trigram LM
tb20onp.z 20K open NVP trigram LM
7.0 Test Scoring
<INSTALL_DIR>/tranfilt/nov93flt.sh < system.hyp > system.filt.hyp
Example using the sample LIMSI data in the "wsj1/doc/nov93_h1" directory,
where "tranfilt" is located under the current directory:
./tranfilt/nov93flt.sh < 13_32.1:/wsj1/doc/nov93_h1/limsi.hyp > limsi1_filt.hyp
7.3 Scoring Results
<INSTALL_DIR>/bin/align -cfg <CONFIG_FILE> -hyp <HYP_FILE>
-outfile <ALIGNMENTS>
Example using the example output from "tranfilt" in Section 7.2, where
"score" is located under the current directory:
INSTALL_DIR is the pathname to the compiled "score"
directory.
<INSTALL_DIR>/bin/score -cfg <CONFIG_FILE> -align <ALIGNMENTS>
<REPORT_OPTIONS>
Example using the example output from "align" above, where "score" is located
under the current directory:
INSTALL_DIR is the pathname to the compiled "score"
directory.
SITE/SYSTEM NAME
HUB OR SPOKE TEST DESIGNATION
1) PRIMARY TEST SYSTEM DESCRIPTION:
2) ACOUSTIC TRAINING:
3) GRAMMAR TRAINING:
4) RECOGNITION LEXICON DESCRIPTION:
5) DIFFERENCES FOR EACH CONTRASTIVE TEST:
6) NEW CONDITIONS FOR THIS EVALUATION:
7) REFERENCES:
See the sample results for the LIMSI or CU-HTK systems in the files,
"/wsj1/doc/nov93_h[12]/*.txt" on Disc 13_32.1 for examples of
completed system descriptions.