Wall Street Journal-based Continuous Speech Recognition (CSR) Corpus
                                Phase II
                                 (WSJ1)

         Training and Development Test Texts and Documentation

                              April 1994


Contents
--------

1.0   Introduction
2.0   Training data
3.0   "Generic" development test data
4.0   Hub and Spoke development test suite
5.0   The ARPA CCCC Hub and Spoke test paradigm
6.0   CD-ROM data distribution
7.0   Directory structure
8.0   Filenaming formats 
9.0   Data types
9.1   Waveforms (.wv?)
9.2   Detailed Orthographic Transcriptions (.dot)
9.3   Lexical SNOR transcriptions (.lsn)
9.4   Prompting texts (.ptx)
10.0  Online documentation


1.0  Introduction
-----------------

These 34 discs contain a corpus of speech collected to facilitate the
development and evaluation of large vocabulary, speaker-independent,
continuous speech recognition systems.  This is the second phase in
the collection of such corpora - a Phase I pilot corpus (WSJ0) was
collected in 1991 and was used in ARPA benchmark tests late in 1991
and in the Fall of 1992.  Collection of this corpus began in the Fall
of 1992 and was completed during the Summer of 1993.  This corpus (in
conjunction with an evaluation test suite, available separately) was
used in the November 1993 ARPA benchmark tests.

Unlike the pilot corpus, WSJ1 contains no verbal punctuation and the
prompting texts for read portions of the corpus have not been
"pre-filtered" to insure unambiguous pronunciations of words.

WSJ1 contains approximately 78,000 training utterances (~73 hours of
speech), 4,000 of which are the result of spontaneous dictation by
journalists with varying degrees of experience in dictation.  The
corpus contains approximately 8,200 (5,000-word and 20,000-word
vocabulary) "generic" development test utterances (~ 8 hours of
speech), 6,800 of which are from spontaneous dictation.  As with WSJ0,
all of the training portion of the corpus was collected using 2
microphones: a Sennheiser close-talking head-mounted microphone, and a
secondary microphone of varying types.

In early 1993, the ARPA CSR Corpus Coordinating Committee (CCCC)
designed a "Hub and Spoke" test paradigm.  Similarly designed
development test and evaluation test suites were collected in
mid-1993.  The Hub and Spoke development test suite is included in
this release.  The Hub and Spoke evaluation test suite is available
separately.  The Hub and Spoke development and evaluation test suites
each contain approximately 7,500 waveforms (~11 hours of speech).

To minimize the storage requirements for such a large corpus, the
waveforms have been compressed using the SPHERE-embedded "Shorten"
compression algorithm which was developed at Cambridge University.
The use of "Shorten" has approximately halved the storage requirements
for WSJ1.

This disc, NIST speech disc 13-34.1, contains all of the prompts,
transcriptions, and documentation for the entire WSJ1 training and
development test corpora.  The MIT Lincoln Laboratory WSJ '87-89
language models have also been included as well as a collation of all
speech waveform file headers and a program to search them.  The disc
also contains indices for each Hub and Spoke development test set as
well as an index for the "standard" WSJ1 training set.  See the
"readme.doc" file in each high-level directory of the disc for more
information.

The collection and publication of the Phase I and Phase II corpora
have been sponsored by the Advanced Research Projects Agency Software
and Intelligent Systems Technology Office (ARPA-SISTO) and the
Linguistic Data Consortium (LDC).  Guidance was provided on the design
of the corpus by the ARPA Continuous Speech Recognition Corpus
Coordinating Committee (CCCC).  MIT Lincoln Laboratory developed the
text selection tools and the WSJ '87-89 language models.  The corpus
was collected at SRI International and produced on CD-ROM by the LDC
and the National Institute of Standards and Technology (NIST).


2.0  Training Data
------------------

The training portion of the corpus consists of read and spontaneous
speech components amounting to approximately 78,000 utterances.
Subjects were recruited by the SRI data collectors to read Wall Street
Journal article paragraphs excerpted from the ACL/DCI CD-ROM.  The
read texts were pseudo-randomly selected using MIT Lincoln
Laboratory's "parselct" and "pargrep" utilities.  A subset of the
subjects, who were journalists with varying degrees of experience in
dictation, also dictated spontanous news articles on various selected
topics.

The training portion of the corpus is apportioned as follows.  The
numbers are approximate since the number of sentences each subject
read for each session was rounded to the nearest paragraph boundary.
For the read WSJ data, the prompts were evenly selected from 5K and 20K
vocabularies.


     Training [77,800 utts]:

          for 200 non-journalist subjects:
               block adaptation - 40 predefined sentences [8,000 utts]
               read WSJ speech - 150 sentences [30,000 utts]

          for 25 non-journalist subjects:
               block adaptation - 40 predefined sentences [1,000 utts]
               read WSJ speech - 1200 sentences [30,000 utts]

          for 20 journalist subjects:
                    block adaptation - 40 predefined sentences [800 utts]
                    read WSJ speech - 200 sentences [4,000 utts]
                    spontaneous speech - 200 sentences (minimum) [4,000 utts]


3.0  "Generic" Development Test Data
------------------------------------

A "generic" development test set was created at the inception of the
corpus and collected with the training data.  However, the CCCC Hub
and Spoke development test suite described below supplants this
corpora and actually makes use of some of it.  The entire original
"generic" development test corpora have been included for completeness
and is described as follows:

     "Generic" Development Test [8,200 utts]: 

          for 10 non-journalist subjects:
                    block adaptation - 40 predefined sentences [400 utts]
                    read WSJ speech - 100 sentences [1,000 utts]
  
          for 20 journalist subjects:
                    block adaptation - 40 predefined sentences [800 utts]
                    read WSJ speech - 100 sentences [2,000 utts]
                    spontaneous speech - 200 sentences (minimum) [4,000 utts]


4.0  Hub and Spoke Development Test Suite
-----------------------------------------

The ARPA CSR Corpus Coordinating Committee (CCCC) designed a "Hub and
Spoke" test paradigm which consists of general "hub" core tests and
optional "spoke" tests to probe specific areas of interest and/or
difficulty.

Two "hub" test sets were designed and speech data was collected for them:

	1.  64,000-word lexicon WSJ read baseline (Sennheiser mic)
	2.  5,000-word lexicon WSJ read baseline (Sennheiser mic)

Nine "spoke" test sets were designed and speech data was collected for
them:
	
	1.  Language model adaptation (Sennheiser mic)
	2.  Domain-independence (Sennheiser mic)
	3.  SI Recognition Outliers - non-native speakers (Sennheiser mic)
	4.  Incremental speaker adaptation (Sennheiser mic)
	5.  Microphone independence (Sennheiser + Second mic of unknown 
            varying type)
	6.  Known alternate microphone (Sennheiser + Audio Technica/telephone)
	7.  Noisy environments (Sennheiser + Audio Technica/telephone)
	8.  Calibrated noise sources (Sennheiser + Audio Technica)
	9.  Spontaneous WSJ-style dictation (Sennheiser mic)

A set of test corpora exists for each hub and spoke test.  Indices for
each test set have been created to indicate the location of the
test data on disc.  The indices are located in the "/wsj1/doc/indices"
directory.


5.0  The ARPA CCCC Hub and Spoke Test Paradigm
----------------------------------------------

The following is the Evaluation paradigm developed by the ARPA CCCC
committee which describes the usage of the Hub and Spoke Development
Test data.

=======================================================================
-------------------------------------------------------------------------------
Final Proposal for the 1993 CSR Evaluation -- Hub and Spoke Paradigm.
-------------------------------------------------------------------------------
Rev 9: 6-10-93

==========
MOTIVATION
==========
This evaluation proposal attempts to accomodate research over a broad variety
of important problems in CSR, to maintain a clear program-wide focus, and to
extract as much information from the results as possible.  It consists of a
compact 'Hub' test, on which every site would evaluate, and a variety of
problem-specific 'Spoke' tests, which would be run at the discretion of the 
participating sites.

===============
GENERAL REMARKS
===============
Participating sites will be asked to commit to evaluate on the appropriate
Hub test set and a specific set of Spoke tests of their choosing.  Firm 
commitments for the Spoke tests will be solicited before the evaluation data
is collected (tentatively scheduled to begin in August '93).

Site commitments are used to control evaluation and to manage evaluation 
resources.  It is imperative that sites honor their commitments in order for 
the evaluation to have beneficial R&D impact.  Sites must notify the CCCC 
chairman as soon as possible, prior to the distribution of the evaluation data,
if it appears that a commitment may not be honored.  Defaulting on a commitment
may jeopardize ARPA support for participation in subsequent evaluations.

Results from all primary conditions (P0) are due at NIST by November 22, 1993.
Results from all contrast conditions (both required and optional) are due at
NIST any time before December 13, 1993.

The 'total required utts' listed below for each test set indicates the number
of utterances that would need to be run to complete the required portion of 
the test.  P0 indicates the primary test condition, CX indicates a contrastive 
test condition, (req) indicates a required condition, and (opt) indicates an 
optional one.

Speakers are balanced for gender in each dataset below.  In total, there
will be only 40 different speakers used in this proposal -- 10 for S3. (SI
Recognition Outliers), 10 for microphone-adaptation in S6. (Known Alternate
Microphone), 10 for the ATIS data in S2. (Domain-Independence), and 10 for all
the rest of the test and rapid enrollment data.  These speaker sets are 
labeled, A (test), B (ATIS), C (outliers), and D (mic-adapt) below.

=======
THE HUB
=======
All sites are required to run on H1.  Sites that can't handle the size of
the H1 test may run on H2.

H1. Read WSJ Baseline.
----------------------
DATA: 10 speakers * 20 utts = 200 utts (500 utts collected)
 64K-word read WSJ data, Sennheiser mic.
CONDITIONS: total required utts = 200 
 P0: (opt) any test paradigm, grammar, and acoustic training.
 C1: (req) Static SI test with standard 20K trigram open-vocab grammar and
           choice of either SI-few or SI-many of both WSJ0 and WSJ1 (37.2K utts).
 C2: (opt) Static SI test with standard 20K bigram open-vocab grammar and
           choice of either SI-few or SI-many of both WSJ0 and WSJ1 (37.2K utts).

H2. 5K-Word Read WSJ Baseline (for sites that can't handle H1).
---------------------------------------------------------------
DATA: 10 speakers * 20 utts = 200 utts (500 utts collected)
 5K-word read WSJ data, Sennheiser mic.
CONDITIONS: total required utts = 200
 P0: (opt) any test paradigm, grammar, and acoustic training.
 C1: (req) Static SI test with standard 5K bigram closed-vocab grammar and
           choice of either SI-few or SI-many subcorpus from WSJ0 (7.2K utts).

==========
THE SPOKES
==========
Sites will commit in advance to evaluate on some number of Spoke tests.  
The number of Spokes supported for the evaluation is expected to shrink to 4-5.
The final set should be determined in early August.

For the 5K vocab test sets (Spokes S3-S8) it is assumed, but not required, that
a 5K closed LM will be used.

The SITE field below is included only to show who might participate if the 
Spoke were supported in the November '93 evaluation.  If a site name includes 
a digit, it indicates priority with 1 being of highest interest.  A ?? mark 
indicates a potential for participation and constitutes a placeholder.
The ARPA PM's ranking is also included in these lists.

When present below, METRICS, indicates that a measure other than the standard
overall word error rate is recommended.  

Spokes S1 through S4 support problems in adaptation.

S1. Language Model Adaptation.
------------------------------
DATA: 4 A spkrs * 1-5 articles (~100 utts) = 400 utts
 Read unfiltered WSJ data from 1990 publications in TIPSTER corpus,
 Sennheiser mic, minimum of 20 sentences per article.
 [NOTE: 1993 WSJ texts may be used for the evaluation]
GOAL: evaluate an incremental LM adaptation algorithm.
CONDITIONS: total required utts = 800
 P0: (req) incremental supervised LM adaptation,
           closed vocab, any LM trained from 1987-89 WSJ0
 C1: (req) S1-P0 system with LM adaptation disabled
 C2: (opt) S1-P0 system with LM and acoustic adaptation disabled
 C3: (opt) incremental supervised LM adaptation with open vocabulary
 C4: (opt) incremental unsupervised LM adaptation
METRICS: standard measures as function of utt context.


S2. Domain-Independence.
------------------------
DATA: 10 B spkrs * 1-3 sessions (~20 utts) = 200 utts (ATIS)
      10 A spkrs * 1 article (~20 utts) = 200 utts (Mercury)
 Sennheiser mic data from ATIS and San Jose Mercury, minimum of 7 queries per
 session from ATIS and 20 sentences per article from Mercury.
GOAL: evaluate techniques for dealing with a domain different from training.
CONDITIONS: total required utts = 800
 P0: (req) any test paradigm, grammar, and acoustic training
           BUT no training whatsoever from the 2 test domains.
 C1: (req) S2-P0 system on H1 data


S3. SI Recognition Outliers.
----------------------------
DATA: 10 C spkrs * 40 utts = 400 utts (test)
      10 C spkrs * 40 utts = 400 utts (rapid enrollment from test speakers)
 5K-word read WSJ data, Sennheiser mic, collected from non-native 
 speakers of American English (British, European, Asian dialects, etc.).
GOAL: evaluate a speaker adaptation algorithm.
CONDITIONS: total required utts = 1200
 P0: (req) some form of speaker adaptation
 C1: (req) S3-P0 system with speaker adaptation disabled
 C2: (req) S3-P0 system on H2 data


S4. Incremental Speaker Adaptation.
-----------------------------------
DATA: 4 A spkrs * 100 utts = 400 utts (test)
      4 A spkrs * 40  utts = 160 utts (rapid enrollment from test speakers)
 5K-word read WSJ data, Sennheiser mic.
GOAL: evaluate an incremental speaker adaptation algorithm.
CONDITIONS: total required utts = 1200
 P0: (req) incremental unsupervised speaker adaptation,
 C1: (req) S4-P0 system with speaker adaptation disabled
 C2: (req) S4-P0 system on H2 data
 C3: (opt) incremental supervised adaptation
 C4: (opt) rapid enrollment speaker adaptation
METRICS: standard measures on each quarter of the data in sequence,
         plus total run time for each condition.


Spokes S5 through S8 support problems in channel and noise compensation.

S5. Microphone-Independence.
----------------------------
DATA: 10 A spkrs * 20 utts = 200 utts (second channel from H2)
 5K-word read WSJ data, from 10 different mics not in training.
GOAL: evaluate an unsupervised channel compensation algorithm.
CONDITIONS: total required utts = 600
 P0: (req) unsupervised channel compensation enabled
 C1: (req) S5-P0 system with compensation disabled
 C2: (req) S5-P0 system on Sennheiser data
 C3: (opt) S5-C1 system on Sennheiser data
METRICS: augment standard with %change between contrasts and primary.


S6. Known Alternate Microphone.
-------------------------------
DATA: 10 A spkrs * 20 utts * 2 mics = 400 utts (test, 2 channels)
      10 D spkrs * 40 utts * 2 mics = 800 utts (mic-adapt, 2 channels)
 5K-word read WSJ data, from an Audio-Technica directional stand-mounted mic
 and telephone handset over external lines, plus stereo mic adaptation data.
GOAL: evaluate a supervised microphone adaptation algorithm.
CONDITIONS: total required utts = 1200
 P0: (req) supervised mic adaptation enabled
 C1: (req) S6-P0 system with mic adaptation disabled
 C2: (req) S6-C1 system on Sennheiser data
METRICS: augment standard with %change between contrasts and primary.


S7. Noisy Environments.
-----------------------
DATA: 10 A spkrs * 10 utts * 2 mics * 2 envs = 400 utts (test, 2 channels)
 5K-word read WSJ data, same 2 secondary mics as in S6, collected in two
 environments with a background A-weighted noise level of about 47-61 dB.
GOAL: evaluate a noise compensation algorithm with known alternate mic.
CONDITIONS: total required utts = 1200
 P0: (req) noise compensation enabled
 C1: (req) S7-P0 system with compensation disabled
 C2: (req) S7-P0 system on Sennheiser data
 C3: (opt) S7-C1 system on Sennheiser data
METRICS: augment standard with %change between contrasts and primary.


S8. Calibrated Noise Sources.
-----------------------------
DATA: 10 A spkrs * 10 utts * 2 sources * 3 levels = 600 utts (test, 2 channels)
 5K-word read WSJ data collected with competing recorded music or talk radio
 in the background at 0, 10, and 20 dB SNR, same stand-mounted mic from S6.
GOAL: evaluate a noise compensation algorithm with known alternate mic.
CONDITIONS: total required utts = 1800
 P0: (req) noise compensation enabled
 C1: (req) S8-P0 system with compensation disabled
 C2: (req) S8-P0 system on Sennheiser data
 C3: (opt) S8-C1 system on Sennheiser data
METRICS: augment standard with %change between contrasts and primary.


S9. Spontaneous WSJ Dictation.
------------------------------
DATA: 10 A speakers * 20 utts = 200 utts
 Spontaneous WSJ-like dictations, Sennheiser mic.
CONDITIONS: total required utts = 400
 P0: (req) any test paradigm, grammar, and acoustic training
 C1: (req) S9-P0 system on H1 data


=======================================================================


6.0  CD-ROM Data Distribution
-----------------------------

This corpus is contained on 34 discs.  Discs 1 - 31 contain waveform
data only.  This disc, 13-34.1, contains the prompts, transcriptions,
language model, and indicies for all of the waveform data on discs 
1 - 31.  Discs 13-32.1 and 13-33.1 contain the data and instructions 
for implementing the ARPA November 1993 CSR Hub and Spoke Benchmark 
Tests and are available separately.

Three text files are included in the root directory of each waveform
disc and contain descriptors for the contents of the disc. 

The file, "<DISC-ID>.dir" contains a list of all directories and 
files on the disc.

The file, "discinfo.txt" and "<DISC-ID>.txt" both contain a high-level
description of the contents of the disc.  The static filename,
"discinfo.txt" is used across all discs; and a variable filename
determined by the disc ID are unique for each disc - this allows
flexibility in using the information.  	

The following is an example of the contents of one of
these sets of files (filenames - discinfo.txt and 13_5_1.txt):

     disc_id: 13_5_1
     data_types: si_tr_s:39:7466
     channel_ids: 1

The first field, "disc_id", identifies the disc number.  The second
field, "data_types", contains entries for subcorpora (directories)
separated by commas with each subfield containing an entry identifying
the subcorpora, number of speakers, and number of waveforms.  The
third field, "channel_ids", contains a comma-separated list of the
channels contained on the disc.  This field normally has a value
value of "1" (Sennheiser) or "2" (Other mic.) for this corpora.


7.0  Directory Structure
------------------------

The following depicts the directory structure of the subcorpora on
different discs.  Subcorpora categories are denoted by the directory
names in level 2.  

top level: wsj1/     (Phase 2 corpus)

2nd level: doc/      (online documentation - on text disc)

     *     trans/    (prompts and transcriptions - on text disc)

           si_tr_s/           (SI, training, 150 WSJ sentences)
           si_tr_l/           (SI, training, 1200 WSJ sentences)
           si_tr_j/           (SI, training, journalists, 200 WSJ sentences)
           si_tr_jd/          (SI, training, journalists, spon. dictation)

           si_dt_20/          (Hub 1 test data)
           si_dt_05/          (Hub 2 test data)
           si_dt_jd/          (Spoke 9 test data)  
           si_dt_s1/wsj/      (Spoke 1 test data)

           si_dt_s2/sjm/      (Spoke 2 test data)
     **            atis/     
           si_dt_s3/non_nat/  (Spoke 3 test data)
           si_dt_s4/inc_adp/  (Spoke 4 test data)
           si_dt_s5/mic_ind/  (Spoke 5 test data)
           si_dt_s6/at_ad/    (Spoke 6 test data)
                    at_te/
                    th_ad/
                    th_te/
           si_dt_s7/at_e1/    (Spoke 7 test data)
                    at_e2/
                    th_e1/
                    th_e2/
           si_dt_s8/mu_0/     (Spoke 8 test data)
                    mu_10/
                    mu_20/
                    tr_0/
                    tr_10/
                    tr_20/

speaker level:  <XXX>/     (speaker-ID, where XXX = "001" to "zzz", base 36)
               
               
data level:  <FILES>    (corpora files, see below for format and types)

    * The directory structure under the "trans" directory on the text disc,
      which contain all of the prompts and transcriptions for the training
      and development test portions of the corpus, matches the directory 
      structure on the waveform discs.  Therefore, the waveforms (on 
      multiple discs) and the texts (on the text disc) can be logically 
      merged by directory and filename.

   ** Note: the Spoke 2 test set contains native ATIS2 data which is formatted
            according to ATIS2 filenaming, waveform, and transcription
            conventions.  Please see the documentation disc for ATIS2
            (CD12-1) in the ATIS2 CD-ROM set for information about the 
            structure and contents of these files.


8.0  Filenaming Formats 
----------------------- 

Data types are differentiated by unique filename extensions.  All
files associated with the same utterance have the same basename.
All Filenames are unique across all WSJ corpora.  Utterance ID's
(basenames) are not be re-used.  The filename format is as follows:

<UTTERANCE-ID>.<XXX> 

where,

     UTTERANCE-ID ::= <SSS><T><EE><UU>

     where,

          SSS ::= 001 | ... | zzz (base-36 speaker ID)
          T ::= (speech type code)
                    c (Common read) |
                    s (Spontaneous) |
                    a (Adaptation read) |
                    x (calibration recording )
                    
          EE ::= 01 | ... | zz (base-36 session ID)
          UU ::= 01 | ... | zz (base-36 within-session sequential speaker
                                utterance code - always "00" for .ptx, .dot 
                                and .lsn session-level files)

          XXX ::= (data type)

               .wv1 (channel 1 - Sennheiser waveform)
               .wv2 (channel 2 - Other mic waveform)

               .ptx (prompting text for read material)
               .dot (detailed orthographic transcription)
               .lsn (Lexical SNOR transcription derived from .dot)


9.0  Data Types
---------------

9.1  Waveforms (.wv?)
---------------------

The waveforms are SPHERE-headered, digitized, and compressed using the
Cambridge University "Shorten" algorithm under SPHERE.  Version 2.1 of
SPHERE has been included in this disc which will permit the waveform
files to be decompressed automatically as they are accessed.  See the
files under the "/sphere" directory.

The filename extension for the waveforms contains the characters,
"wv", followed by a 1-character code to identify the channel.  The
headers contain the following fields/types:

Field                    Type     Description - Probable defaults marked in ()
-----------------------  -------  ---------------------------------------------
speaker_id               string   3-char. speaker ID from filename
speaking_mode            string   speaking mode ("spontaneous","read-common",
                                  "read-adaptation", etc.)  
recording_site           string   recording site ("SRI")
recording_date           string   beginning of recording date stamp of the
                                  form DD-MMM-YYYY.  
recording_time -s11      string   beginning of recording time stamp of the
                                  form HH:MM:SS.HH.  
recording_environment    string   text description of recording environment
microphone               string   microphone description ("Sennheiser HMD-410",
                                  "Crown PCC-160", etc.)
utterance_id             string   utterance ID from filename of the form
                                  SSSTEEUU as described in the filenames 
                                  section above.
prompt_id                string   WSJ source sentence text ID - see .ptx
                                  description below for format (only in read 
                                  data).
database_id              string   database (corpus) identifier ("wsj0" or 
                                  "wsj1")
database_version         string   database (corpus) revision ("1.0")
channel_count            integer  number of channels in waveform ("1")
speaker_session_number   string   2-char. base-36 session ID from filename
sample_count             integer  number of samples in waveform
sample_max               integer  maximum sample value in waveform
sample_min               integer  minimum sample value in waveform
sample_rate              integer  waveform sampling rate ("16000")
sample_n_bytes           integer  number of bytes per sample ("2")
sample_byte_format       string   byte order (MSB/LSB -> "10", LSB/MSB -> "01")
sample_coding            string   waveform encoding ("embedded-wavpack-v1.0")
sample_checksum          integer  checksum obtained by the addition of all
                                  (uncompressed) samples into an unsigned 
                                  16-bit (short) and discarding overflow.  
sample_sig_bits          integer  number of significant bits in each sample
                                  ("16")
session_utterance_number string   2-char. base-36 utterance number within 
                                  session from the filename
end_head                 none     end of header identifier 


9.2  Detailed Orthographic Transcriptions (.dot)
------------------------------------------------

The specifications for the format of the detailed orthographic
transcriptions are located in the file, "dot_spec.doc",
under the "/wsj1/doc" directory.

The transcriptions for all utterances in a session are concatenated
into a single file of the form, "<SSS><T><EE>00.dot" and each
transcription includes a corresponding utterance-ID code.  The format
for a single utterance transcription entry in this table is as
follows:

        <TRANSCRIPTION-TEXT> (<UTTERANCE-ID>)<NEW-LINE>

example:

     The December contract rose one point oh seven cents a pound to sixty eight
     point six two cents at the Chicago Mercantile Exchange (013c020l)


There is one ".dot" file for each speaker-session.


9.3  Lexical SNOR Transcriptions (.lsn)
---------------------------------------

The lexical Standard Normal Orthographic Representation (lexical SNOR)
(.lsn) transcriptions are word-level transcriptions derived from the
".dot" transcriptions with capitalization, non-speech markers,
prosodic markings, fragments, and "\" character escapes filtered out.

The .lsn transcriptions are of the same form as the .dot
transcriptions and will be identified by a ".lsn" filename extension.

example:

     THE DECEMBER CONTRACT ROSE ONE POINT OH SEVEN CENTS A POUND TO SIXTY EIGHT
     POINT SIX TWO CENTS AT THE CHICAGO MERCANTILE EXCHANGE (013C00L)

There is one ".lsn" file for each speaker-session.


9.4  Prompting Texts (.ptx)
---------------------------

The prompting texts for all read Wall Street Journal utterances in a
session including the utterances' utterance-IDs and prompt IDs are
concatenated into a single file of the form, "<SSS><T><EE>00.ptx".
The prompt ID is Doug Paul's Wall Street Journal sentence index.
The format for this index is:

     <YEAR>.<FILE-NUMBER>.<ARTICLE-NUMBER>.<PARAGRAPH-NUMBER>.<SENTENCE-NUMBER>

The format for a single prompting text entry in the .ptx file is as follows:

     <PROMPTING-TEXT> (<UTTERANCE-ID> <PROMPT-ID>)

example:

     The December contract rose one point oh seven cents a pound to sixty-eight
     point six two cents at the Chicago Mercantile Exchange (013c020l
     87.120.871013-0032.14.2)
     

The inclusion of both the utterance ID and prompt ID allows the utterance to be
mapped back to its source sentence text and surrounding paragraph.

There is be one .ptx file for each read speaker-session.


10.0  Online Documentation
--------------------------

In addition to prompts and transcriptions, this disc, NIST Speech Disc
13-34.1, contains online documentation for the WSJ1 corpus.  The
documentation is located under the "wsj1/doc" directory and consists
of training and development test indices, data collection information,
a summary of the CD-ROM distribtion, directories of each CD-ROM,
specifications for the transcription format, collated waveform headers
for the entire corpus, source texts, vocabularies, and a language
model for the read material.