Wall Street Journal-based Continuous Speech Recognition (CSR) Corpus Phase II (WSJ1) Training and Development Test Texts and Documentation April 1994 Contents -------- 1.0 Introduction 2.0 Training data 3.0 "Generic" development test data 4.0 Hub and Spoke development test suite 5.0 The ARPA CCCC Hub and Spoke test paradigm 6.0 CD-ROM data distribution 7.0 Directory structure 8.0 Filenaming formats 9.0 Data types 9.1 Waveforms (.wv?) 9.2 Detailed Orthographic Transcriptions (.dot) 9.3 Lexical SNOR transcriptions (.lsn) 9.4 Prompting texts (.ptx) 10.0 Online documentation 1.0 Introduction ----------------- These 34 discs contain a corpus of speech collected to facilitate the development and evaluation of large vocabulary, speaker-independent, continuous speech recognition systems. This is the second phase in the collection of such corpora - a Phase I pilot corpus (WSJ0) was collected in 1991 and was used in ARPA benchmark tests late in 1991 and in the Fall of 1992. Collection of this corpus began in the Fall of 1992 and was completed during the Summer of 1993. This corpus (in conjunction with an evaluation test suite, available separately) was used in the November 1993 ARPA benchmark tests. Unlike the pilot corpus, WSJ1 contains no verbal punctuation and the prompting texts for read portions of the corpus have not been "pre-filtered" to insure unambiguous pronunciations of words. WSJ1 contains approximately 78,000 training utterances (~73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 (5,000-word and 20,000-word vocabulary) "generic" development test utterances (~ 8 hours of speech), 6,800 of which are from spontaneous dictation. As with WSJ0, all of the training portion of the corpus was collected using 2 microphones: a Sennheiser close-talking head-mounted microphone, and a secondary microphone of varying types. In early 1993, the ARPA CSR Corpus Coordinating Committee (CCCC) designed a "Hub and Spoke" test paradigm. Similarly designed development test and evaluation test suites were collected in mid-1993. The Hub and Spoke development test suite is included in this release. The Hub and Spoke evaluation test suite is available separately. The Hub and Spoke development and evaluation test suites each contain approximately 7,500 waveforms (~11 hours of speech). To minimize the storage requirements for such a large corpus, the waveforms have been compressed using the SPHERE-embedded "Shorten" compression algorithm which was developed at Cambridge University. The use of "Shorten" has approximately halved the storage requirements for WSJ1. This disc, NIST speech disc 13-34.1, contains all of the prompts, transcriptions, and documentation for the entire WSJ1 training and development test corpora. The MIT Lincoln Laboratory WSJ '87-89 language models have also been included as well as a collation of all speech waveform file headers and a program to search them. The disc also contains indices for each Hub and Spoke development test set as well as an index for the "standard" WSJ1 training set. See the "readme.doc" file in each high-level directory of the disc for more information. The collection and publication of the Phase I and Phase II corpora have been sponsored by the Advanced Research Projects Agency Software and Intelligent Systems Technology Office (ARPA-SISTO) and the Linguistic Data Consortium (LDC). Guidance was provided on the design of the corpus by the ARPA Continuous Speech Recognition Corpus Coordinating Committee (CCCC). MIT Lincoln Laboratory developed the text selection tools and the WSJ '87-89 language models. The corpus was collected at SRI International and produced on CD-ROM by the LDC and the National Institute of Standards and Technology (NIST). 2.0 Training Data ------------------ The training portion of the corpus consists of read and spontaneous speech components amounting to approximately 78,000 utterances. Subjects were recruited by the SRI data collectors to read Wall Street Journal article paragraphs excerpted from the ACL/DCI CD-ROM. The read texts were pseudo-randomly selected using MIT Lincoln Laboratory's "parselct" and "pargrep" utilities. A subset of the subjects, who were journalists with varying degrees of experience in dictation, also dictated spontanous news articles on various selected topics. The training portion of the corpus is apportioned as follows. The numbers are approximate since the number of sentences each subject read for each session was rounded to the nearest paragraph boundary. For the read WSJ data, the prompts were evenly selected from 5K and 20K vocabularies. Training [77,800 utts]: for 200 non-journalist subjects: block adaptation - 40 predefined sentences [8,000 utts] read WSJ speech - 150 sentences [30,000 utts] for 25 non-journalist subjects: block adaptation - 40 predefined sentences [1,000 utts] read WSJ speech - 1200 sentences [30,000 utts] for 20 journalist subjects: block adaptation - 40 predefined sentences [800 utts] read WSJ speech - 200 sentences [4,000 utts] spontaneous speech - 200 sentences (minimum) [4,000 utts] 3.0 "Generic" Development Test Data ------------------------------------ A "generic" development test set was created at the inception of the corpus and collected with the training data. However, the CCCC Hub and Spoke development test suite described below supplants this corpora and actually makes use of some of it. The entire original "generic" development test corpora have been included for completeness and is described as follows: "Generic" Development Test [8,200 utts]: for 10 non-journalist subjects: block adaptation - 40 predefined sentences [400 utts] read WSJ speech - 100 sentences [1,000 utts] for 20 journalist subjects: block adaptation - 40 predefined sentences [800 utts] read WSJ speech - 100 sentences [2,000 utts] spontaneous speech - 200 sentences (minimum) [4,000 utts] 4.0 Hub and Spoke Development Test Suite ----------------------------------------- The ARPA CSR Corpus Coordinating Committee (CCCC) designed a "Hub and Spoke" test paradigm which consists of general "hub" core tests and optional "spoke" tests to probe specific areas of interest and/or difficulty. Two "hub" test sets were designed and speech data was collected for them: 1. 64,000-word lexicon WSJ read baseline (Sennheiser mic) 2. 5,000-word lexicon WSJ read baseline (Sennheiser mic) Nine "spoke" test sets were designed and speech data was collected for them: 1. Language model adaptation (Sennheiser mic) 2. Domain-independence (Sennheiser mic) 3. SI Recognition Outliers - non-native speakers (Sennheiser mic) 4. Incremental speaker adaptation (Sennheiser mic) 5. Microphone independence (Sennheiser + Second mic of unknown varying type) 6. Known alternate microphone (Sennheiser + Audio Technica/telephone) 7. Noisy environments (Sennheiser + Audio Technica/telephone) 8. Calibrated noise sources (Sennheiser + Audio Technica) 9. Spontaneous WSJ-style dictation (Sennheiser mic) A set of test corpora exists for each hub and spoke test. Indices for each test set have been created to indicate the location of the test data on disc. The indices are located in the "/wsj1/doc/indices" directory. 5.0 The ARPA CCCC Hub and Spoke Test Paradigm ---------------------------------------------- The following is the Evaluation paradigm developed by the ARPA CCCC committee which describes the usage of the Hub and Spoke Development Test data. ======================================================================= ------------------------------------------------------------------------------- Final Proposal for the 1993 CSR Evaluation -- Hub and Spoke Paradigm. ------------------------------------------------------------------------------- Rev 9: 6-10-93 ========== MOTIVATION ========== This evaluation proposal attempts to accomodate research over a broad variety of important problems in CSR, to maintain a clear program-wide focus, and to extract as much information from the results as possible. It consists of a compact 'Hub' test, on which every site would evaluate, and a variety of problem-specific 'Spoke' tests, which would be run at the discretion of the participating sites. =============== GENERAL REMARKS =============== Participating sites will be asked to commit to evaluate on the appropriate Hub test set and a specific set of Spoke tests of their choosing. Firm commitments for the Spoke tests will be solicited before the evaluation data is collected (tentatively scheduled to begin in August '93). Site commitments are used to control evaluation and to manage evaluation resources. It is imperative that sites honor their commitments in order for the evaluation to have beneficial R&D impact. Sites must notify the CCCC chairman as soon as possible, prior to the distribution of the evaluation data, if it appears that a commitment may not be honored. Defaulting on a commitment may jeopardize ARPA support for participation in subsequent evaluations. Results from all primary conditions (P0) are due at NIST by November 22, 1993. Results from all contrast conditions (both required and optional) are due at NIST any time before December 13, 1993. The 'total required utts' listed below for each test set indicates the number of utterances that would need to be run to complete the required portion of the test. P0 indicates the primary test condition, CX indicates a contrastive test condition, (req) indicates a required condition, and (opt) indicates an optional one. Speakers are balanced for gender in each dataset below. In total, there will be only 40 different speakers used in this proposal -- 10 for S3. (SI Recognition Outliers), 10 for microphone-adaptation in S6. (Known Alternate Microphone), 10 for the ATIS data in S2. (Domain-Independence), and 10 for all the rest of the test and rapid enrollment data. These speaker sets are labeled, A (test), B (ATIS), C (outliers), and D (mic-adapt) below. ======= THE HUB ======= All sites are required to run on H1. Sites that can't handle the size of the H1 test may run on H2. H1. Read WSJ Baseline. ---------------------- DATA: 10 speakers * 20 utts = 200 utts (500 utts collected) 64K-word read WSJ data, Sennheiser mic. CONDITIONS: total required utts = 200 P0: (opt) any test paradigm, grammar, and acoustic training. C1: (req) Static SI test with standard 20K trigram open-vocab grammar and choice of either SI-few or SI-many of both WSJ0 and WSJ1 (37.2K utts). C2: (opt) Static SI test with standard 20K bigram open-vocab grammar and choice of either SI-few or SI-many of both WSJ0 and WSJ1 (37.2K utts). H2. 5K-Word Read WSJ Baseline (for sites that can't handle H1). --------------------------------------------------------------- DATA: 10 speakers * 20 utts = 200 utts (500 utts collected) 5K-word read WSJ data, Sennheiser mic. CONDITIONS: total required utts = 200 P0: (opt) any test paradigm, grammar, and acoustic training. C1: (req) Static SI test with standard 5K bigram closed-vocab grammar and choice of either SI-few or SI-many subcorpus from WSJ0 (7.2K utts). ========== THE SPOKES ========== Sites will commit in advance to evaluate on some number of Spoke tests. The number of Spokes supported for the evaluation is expected to shrink to 4-5. The final set should be determined in early August. For the 5K vocab test sets (Spokes S3-S8) it is assumed, but not required, that a 5K closed LM will be used. The SITE field below is included only to show who might participate if the Spoke were supported in the November '93 evaluation. If a site name includes a digit, it indicates priority with 1 being of highest interest. A ?? mark indicates a potential for participation and constitutes a placeholder. The ARPA PM's ranking is also included in these lists. When present below, METRICS, indicates that a measure other than the standard overall word error rate is recommended. Spokes S1 through S4 support problems in adaptation. S1. Language Model Adaptation. ------------------------------ DATA: 4 A spkrs * 1-5 articles (~100 utts) = 400 utts Read unfiltered WSJ data from 1990 publications in TIPSTER corpus, Sennheiser mic, minimum of 20 sentences per article. [NOTE: 1993 WSJ texts may be used for the evaluation] GOAL: evaluate an incremental LM adaptation algorithm. CONDITIONS: total required utts = 800 P0: (req) incremental supervised LM adaptation, closed vocab, any LM trained from 1987-89 WSJ0 C1: (req) S1-P0 system with LM adaptation disabled C2: (opt) S1-P0 system with LM and acoustic adaptation disabled C3: (opt) incremental supervised LM adaptation with open vocabulary C4: (opt) incremental unsupervised LM adaptation METRICS: standard measures as function of utt context. S2. Domain-Independence. ------------------------ DATA: 10 B spkrs * 1-3 sessions (~20 utts) = 200 utts (ATIS) 10 A spkrs * 1 article (~20 utts) = 200 utts (Mercury) Sennheiser mic data from ATIS and San Jose Mercury, minimum of 7 queries per session from ATIS and 20 sentences per article from Mercury. GOAL: evaluate techniques for dealing with a domain different from training. CONDITIONS: total required utts = 800 P0: (req) any test paradigm, grammar, and acoustic training BUT no training whatsoever from the 2 test domains. C1: (req) S2-P0 system on H1 data S3. SI Recognition Outliers. ---------------------------- DATA: 10 C spkrs * 40 utts = 400 utts (test) 10 C spkrs * 40 utts = 400 utts (rapid enrollment from test speakers) 5K-word read WSJ data, Sennheiser mic, collected from non-native speakers of American English (British, European, Asian dialects, etc.). GOAL: evaluate a speaker adaptation algorithm. CONDITIONS: total required utts = 1200 P0: (req) some form of speaker adaptation C1: (req) S3-P0 system with speaker adaptation disabled C2: (req) S3-P0 system on H2 data S4. Incremental Speaker Adaptation. ----------------------------------- DATA: 4 A spkrs * 100 utts = 400 utts (test) 4 A spkrs * 40 utts = 160 utts (rapid enrollment from test speakers) 5K-word read WSJ data, Sennheiser mic. GOAL: evaluate an incremental speaker adaptation algorithm. CONDITIONS: total required utts = 1200 P0: (req) incremental unsupervised speaker adaptation, C1: (req) S4-P0 system with speaker adaptation disabled C2: (req) S4-P0 system on H2 data C3: (opt) incremental supervised adaptation C4: (opt) rapid enrollment speaker adaptation METRICS: standard measures on each quarter of the data in sequence, plus total run time for each condition. Spokes S5 through S8 support problems in channel and noise compensation. S5. Microphone-Independence. ---------------------------- DATA: 10 A spkrs * 20 utts = 200 utts (second channel from H2) 5K-word read WSJ data, from 10 different mics not in training. GOAL: evaluate an unsupervised channel compensation algorithm. CONDITIONS: total required utts = 600 P0: (req) unsupervised channel compensation enabled C1: (req) S5-P0 system with compensation disabled C2: (req) S5-P0 system on Sennheiser data C3: (opt) S5-C1 system on Sennheiser data METRICS: augment standard with %change between contrasts and primary. S6. Known Alternate Microphone. ------------------------------- DATA: 10 A spkrs * 20 utts * 2 mics = 400 utts (test, 2 channels) 10 D spkrs * 40 utts * 2 mics = 800 utts (mic-adapt, 2 channels) 5K-word read WSJ data, from an Audio-Technica directional stand-mounted mic and telephone handset over external lines, plus stereo mic adaptation data. GOAL: evaluate a supervised microphone adaptation algorithm. CONDITIONS: total required utts = 1200 P0: (req) supervised mic adaptation enabled C1: (req) S6-P0 system with mic adaptation disabled C2: (req) S6-C1 system on Sennheiser data METRICS: augment standard with %change between contrasts and primary. S7. Noisy Environments. ----------------------- DATA: 10 A spkrs * 10 utts * 2 mics * 2 envs = 400 utts (test, 2 channels) 5K-word read WSJ data, same 2 secondary mics as in S6, collected in two environments with a background A-weighted noise level of about 47-61 dB. GOAL: evaluate a noise compensation algorithm with known alternate mic. CONDITIONS: total required utts = 1200 P0: (req) noise compensation enabled C1: (req) S7-P0 system with compensation disabled C2: (req) S7-P0 system on Sennheiser data C3: (opt) S7-C1 system on Sennheiser data METRICS: augment standard with %change between contrasts and primary. S8. Calibrated Noise Sources. ----------------------------- DATA: 10 A spkrs * 10 utts * 2 sources * 3 levels = 600 utts (test, 2 channels) 5K-word read WSJ data collected with competing recorded music or talk radio in the background at 0, 10, and 20 dB SNR, same stand-mounted mic from S6. GOAL: evaluate a noise compensation algorithm with known alternate mic. CONDITIONS: total required utts = 1800 P0: (req) noise compensation enabled C1: (req) S8-P0 system with compensation disabled C2: (req) S8-P0 system on Sennheiser data C3: (opt) S8-C1 system on Sennheiser data METRICS: augment standard with %change between contrasts and primary. S9. Spontaneous WSJ Dictation. ------------------------------ DATA: 10 A speakers * 20 utts = 200 utts Spontaneous WSJ-like dictations, Sennheiser mic. CONDITIONS: total required utts = 400 P0: (req) any test paradigm, grammar, and acoustic training C1: (req) S9-P0 system on H1 data ======================================================================= 6.0 CD-ROM Data Distribution ----------------------------- This corpus is contained on 34 discs. Discs 1 - 31 contain waveform data only. This disc, 13-34.1, contains the prompts, transcriptions, language model, and indicies for all of the waveform data on discs 1 - 31. Discs 13-32.1 and 13-33.1 contain the data and instructions for implementing the ARPA November 1993 CSR Hub and Spoke Benchmark Tests and are available separately. Three text files are included in the root directory of each waveform disc and contain descriptors for the contents of the disc. The file, ".dir" contains a list of all directories and files on the disc. The file, "discinfo.txt" and ".txt" both contain a high-level description of the contents of the disc. The static filename, "discinfo.txt" is used across all discs; and a variable filename determined by the disc ID are unique for each disc - this allows flexibility in using the information. The following is an example of the contents of one of these sets of files (filenames - discinfo.txt and 13_5_1.txt): disc_id: 13_5_1 data_types: si_tr_s:39:7466 channel_ids: 1 The first field, "disc_id", identifies the disc number. The second field, "data_types", contains entries for subcorpora (directories) separated by commas with each subfield containing an entry identifying the subcorpora, number of speakers, and number of waveforms. The third field, "channel_ids", contains a comma-separated list of the channels contained on the disc. This field normally has a value value of "1" (Sennheiser) or "2" (Other mic.) for this corpora. 7.0 Directory Structure ------------------------ The following depicts the directory structure of the subcorpora on different discs. Subcorpora categories are denoted by the directory names in level 2. top level: wsj1/ (Phase 2 corpus) 2nd level: doc/ (online documentation - on text disc) * trans/ (prompts and transcriptions - on text disc) si_tr_s/ (SI, training, 150 WSJ sentences) si_tr_l/ (SI, training, 1200 WSJ sentences) si_tr_j/ (SI, training, journalists, 200 WSJ sentences) si_tr_jd/ (SI, training, journalists, spon. dictation) si_dt_20/ (Hub 1 test data) si_dt_05/ (Hub 2 test data) si_dt_jd/ (Spoke 9 test data) si_dt_s1/wsj/ (Spoke 1 test data) si_dt_s2/sjm/ (Spoke 2 test data) ** atis/ si_dt_s3/non_nat/ (Spoke 3 test data) si_dt_s4/inc_adp/ (Spoke 4 test data) si_dt_s5/mic_ind/ (Spoke 5 test data) si_dt_s6/at_ad/ (Spoke 6 test data) at_te/ th_ad/ th_te/ si_dt_s7/at_e1/ (Spoke 7 test data) at_e2/ th_e1/ th_e2/ si_dt_s8/mu_0/ (Spoke 8 test data) mu_10/ mu_20/ tr_0/ tr_10/ tr_20/ speaker level: / (speaker-ID, where XXX = "001" to "zzz", base 36) data level: (corpora files, see below for format and types) * The directory structure under the "trans" directory on the text disc, which contain all of the prompts and transcriptions for the training and development test portions of the corpus, matches the directory structure on the waveform discs. Therefore, the waveforms (on multiple discs) and the texts (on the text disc) can be logically merged by directory and filename. ** Note: the Spoke 2 test set contains native ATIS2 data which is formatted according to ATIS2 filenaming, waveform, and transcription conventions. Please see the documentation disc for ATIS2 (CD12-1) in the ATIS2 CD-ROM set for information about the structure and contents of these files. 8.0 Filenaming Formats ----------------------- Data types are differentiated by unique filename extensions. All files associated with the same utterance have the same basename. All Filenames are unique across all WSJ corpora. Utterance ID's (basenames) are not be re-used. The filename format is as follows: . where, UTTERANCE-ID ::= where, SSS ::= 001 | ... | zzz (base-36 speaker ID) T ::= (speech type code) c (Common read) | s (Spontaneous) | a (Adaptation read) | x (calibration recording ) EE ::= 01 | ... | zz (base-36 session ID) UU ::= 01 | ... | zz (base-36 within-session sequential speaker utterance code - always "00" for .ptx, .dot and .lsn session-level files) XXX ::= (data type) .wv1 (channel 1 - Sennheiser waveform) .wv2 (channel 2 - Other mic waveform) .ptx (prompting text for read material) .dot (detailed orthographic transcription) .lsn (Lexical SNOR transcription derived from .dot) 9.0 Data Types --------------- 9.1 Waveforms (.wv?) --------------------- The waveforms are SPHERE-headered, digitized, and compressed using the Cambridge University "Shorten" algorithm under SPHERE. Version 2.1 of SPHERE has been included in this disc which will permit the waveform files to be decompressed automatically as they are accessed. See the files under the "/sphere" directory. The filename extension for the waveforms contains the characters, "wv", followed by a 1-character code to identify the channel. The headers contain the following fields/types: Field Type Description - Probable defaults marked in () ----------------------- ------- --------------------------------------------- speaker_id string 3-char. speaker ID from filename speaking_mode string speaking mode ("spontaneous","read-common", "read-adaptation", etc.) recording_site string recording site ("SRI") recording_date string beginning of recording date stamp of the form DD-MMM-YYYY. recording_time -s11 string beginning of recording time stamp of the form HH:MM:SS.HH. recording_environment string text description of recording environment microphone string microphone description ("Sennheiser HMD-410", "Crown PCC-160", etc.) utterance_id string utterance ID from filename of the form SSSTEEUU as described in the filenames section above. prompt_id string WSJ source sentence text ID - see .ptx description below for format (only in read data). database_id string database (corpus) identifier ("wsj0" or "wsj1") database_version string database (corpus) revision ("1.0") channel_count integer number of channels in waveform ("1") speaker_session_number string 2-char. base-36 session ID from filename sample_count integer number of samples in waveform sample_max integer maximum sample value in waveform sample_min integer minimum sample value in waveform sample_rate integer waveform sampling rate ("16000") sample_n_bytes integer number of bytes per sample ("2") sample_byte_format string byte order (MSB/LSB -> "10", LSB/MSB -> "01") sample_coding string waveform encoding ("embedded-wavpack-v1.0") sample_checksum integer checksum obtained by the addition of all (uncompressed) samples into an unsigned 16-bit (short) and discarding overflow. sample_sig_bits integer number of significant bits in each sample ("16") session_utterance_number string 2-char. base-36 utterance number within session from the filename end_head none end of header identifier 9.2 Detailed Orthographic Transcriptions (.dot) ------------------------------------------------ The specifications for the format of the detailed orthographic transcriptions are located in the file, "dot_spec.doc", under the "/wsj1/doc" directory. The transcriptions for all utterances in a session are concatenated into a single file of the form, "00.dot" and each transcription includes a corresponding utterance-ID code. The format for a single utterance transcription entry in this table is as follows: () example: The December contract rose one point oh seven cents a pound to sixty eight point six two cents at the Chicago Mercantile Exchange (013c020l) There is one ".dot" file for each speaker-session. 9.3 Lexical SNOR Transcriptions (.lsn) --------------------------------------- The lexical Standard Normal Orthographic Representation (lexical SNOR) (.lsn) transcriptions are word-level transcriptions derived from the ".dot" transcriptions with capitalization, non-speech markers, prosodic markings, fragments, and "\" character escapes filtered out. The .lsn transcriptions are of the same form as the .dot transcriptions and will be identified by a ".lsn" filename extension. example: THE DECEMBER CONTRACT ROSE ONE POINT OH SEVEN CENTS A POUND TO SIXTY EIGHT POINT SIX TWO CENTS AT THE CHICAGO MERCANTILE EXCHANGE (013C00L) There is one ".lsn" file for each speaker-session. 9.4 Prompting Texts (.ptx) --------------------------- The prompting texts for all read Wall Street Journal utterances in a session including the utterances' utterance-IDs and prompt IDs are concatenated into a single file of the form, "00.ptx". The prompt ID is Doug Paul's Wall Street Journal sentence index. The format for this index is: .... The format for a single prompting text entry in the .ptx file is as follows: ( ) example: The December contract rose one point oh seven cents a pound to sixty-eight point six two cents at the Chicago Mercantile Exchange (013c020l 87.120.871013-0032.14.2) The inclusion of both the utterance ID and prompt ID allows the utterance to be mapped back to its source sentence text and surrounding paragraph. There is be one .ptx file for each read speaker-session. 10.0 Online Documentation -------------------------- In addition to prompts and transcriptions, this disc, NIST Speech Disc 13-34.1, contains online documentation for the WSJ1 corpus. The documentation is located under the "wsj1/doc" directory and consists of training and development test indices, data collection information, a summary of the CD-ROM distribtion, directories of each CD-ROM, specifications for the transcription format, collated waveform headers for the entire corpus, source texts, vocabularies, and a language model for the read material.