DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2) Training, Extended Training, Development Test and Evaluation Test Data Scope of Corpus: --------------- This corpus is a speaker-dependent extension to the Resource Management (RM1) corpus. It contains Resource Management domain sentences for 4 speakers, 2 female and 2 male. The following material is included for each of the 4 speakers: 2 "SA" dialect calibration sentences 10 "SB" rapid adaptation sentences 600 "SR" standard RM1 speaker-dependent training sentences 1800 "T" newly-generated extended training sentences 120 "T" newly-generated development-test sentences 120 "T" newly-generated evaluation-test sentences to be used as the test material for the June 1990 DARPA benchmark test. ---- 2652 Total utterances per speaker X 4 speakers = 10,608 sentence utterances CD-ROM File Organization: ------------------------ Filenaming conventions for RM2 are identical to that of RM1 with one exception. In RM2, sentence texts (prompts) were generated specifically for each speaker. Sentence ID's/filenames identify the speaker as well as the prompt. The speaker identifiers in the sentence ID's consist of the letters a-d and are mapped to the speaker initials as follows: a - jrm0 b - bjw0 c - lpn0 d - jls0 Example sentence ID. ta0231 (speaker a, sentence 0231) This new coding scheme was implemented by Texas Instruments because the existing RM1 coding scheme could not accommodate the number of prompts in RM2. Please note that the numbers after the speaker identifier in the sentence ID's are NOT unique and must be combined with the speaker identifier to uniquely identify the source prompt (i.e., ta0001 was not derived from the same prompt as tb0001). Also note that the same sentence text may be identified by different prompt numbers - across and within speakers. NIST has discovered that some overlap exists in prompts between the training, extended training, and test material for each speaker - a given sentence text (under different prompt numbers) may be spoken by the same speaker more than once. This is unfortunate, but it is something we will have to live with. NIST is currently in the process of determining the extent of the overlap and will advise the community of its findings. Because of CD-ROM space constraints, all of the speech material for the two female speakers (bjw0 and jrm0) was placed on the first disc, and all of the speech material for the two male speakers (lpn0 and jls0) was placed on the second disc. The directory structure for RM2 material is the same on both discs except for the level containing the speaker identifiers. The speech material is organized on each CD-ROM as follows: CD3-1.1: /rm2/dev_tst/bjw0_2/(120 .wav files) {development-test material} jrm0_8/(120 .wav files) /rm2/evl_tst1/bjw0_2/(120 .wav files) {evaluation-test 1 material jrm0_8/(120 .wav files) for June 1990 DARPA benchmark test} /rm2/ex_train/bjw0_2/(1800 .wav files) {extended training material} jrm0_8/(1800 .wav files) /rm2/train/bjw0_2/(612 .wav files) {standard training material including jrm0_8/(612 .wav files) dialect and rapid-adapt. sentences} CD3-1.1: /rm2/dev_tst/lpn0_7/(120 .wav files) {development-test material} jls0_4/(120 .wav files) /rm2/evl_tst1/lpn0_7/(120 .wav files) {evaluation-test 1 material jls0_4/(120 .wav files) for June 1990 DARPA benchmark test} /rm2/ex_train/lpn0_7/(1800 .wav files) {extended training material} jls0_4/(1800 .wav files) /rm2/train/lpn0_7/(612 .wav files) {standard training material including jls0_4/(612 .wav files) dialect and rapid-adapt. sentences} In summary, the speech directories on both discs are structured as follows: ::= /rm2/// ::= dev_tst | evl_tst1 | ex_train | train ::= _ ::= bjw0 | lpn0 | jls0 | jrm0 ::= 1 | 2 | ... | 8 ::= sa[1-2].wav | sb[01-10].wav | sr[001-600].wav | t[a-d][0001-2400].wav The CD-ROMs' directory hierarchy is structured so that a FULL path/filename UNIQUELY IDENTIFIES an utterance (the database, data usage, speaker, and sentence). Note: Identical filenames DO EXIST and, therefore, care must be taken when copying files from the CD-ROM to other media. However, if the original path/filename is lost, a file may be disambiguated by the utterance identifier in the header. Online Documentation: -------------------- The directory, /rm2/doc contains documentation pertaining to the corpus. As in RM1, sentence text prompts have been included but "official" transcriptions (orthographic, phonetic, etc.) do not exist and have, therefore, not been included. For the purpose of system testing, it has been assumed that the Standard Normalized Orthographic Representation (SNOR) version of the prompts represent an accurate orthographic transcription of the utterances. In addition to speech corpora information, a file has been included which outlines the implementation of the June 1990 DARPA benchmark test. The following documentation files can be found in the /rm2/doc directory: al_sents.snr - Complete listing of all RM2 sentences on CD-ROMs in SNOR form. SNOR, an acronym for Standard Normalized Orthographic Representation, is a uniform way of writing the words and sentences in this corpus. SNOR-format sentence texts are required as reference material for the DARPA/NIST standard scoring software. al_sents.txt - Complete listing of all RM2 sentences on CD-ROMs in prompt form. al_spkrs.txt - Listing of RM2 speakers and their attributes. header.def - NIST header object definitions. lexicon.snr - Complete Resource Management lexicon in SNOR form. Note: the lexicon is identical for RM1 and RM2. rm2_evl1.txt - Outline of instructions for implementing the June 1990 DARPA benchmark test using RM2. See CD3-2.1: /score/readme.txt for instructions for formatting the recognizer output for scoring and for implementing the NIST scoring software. spkr_map.sam - Mapping of all DARPA TIMIT, RM1, and RM2 speaker codes into format compatible with ESPRIT SAM requirements; used by "convert" software. wp_gram.txt - Resource Management Word-Pair Grammar (developed at BBN).