File: ffmtimit.doc, updated 6/2/95 FFMTIMIT Acoustic-Phonetic Continuous Speech Corpus Far Field Microphone Recordings Training and Test Data NIST Speech Disc 21-1.1 The TIMIT corpus of read speech has been designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. The FFMTIMIT corpus contains the previously unreleased secondary microphone recordings of the TIMIT corpus. The speech recordings for TIMIT resulted from the joint efforts of several sites under sponsorship from the Advanced Research Projects Agency (ARPA). Text corpus design was a joint effort among the Massachusetts Institute of Technology (MIT), Stanford Research Institute (SRI), and Texas Instruments (TI). The speech was recorded at TI, transcribed at MIT, and has been maintained, verified, and prepared for CD-ROM production by the National Institute of Standards and Technology (NIST). This file contains a brief description of the FFMTIMIT Speech Corpus. Additional information including the referenced material and some relevant reprints of articles may be found in the TIMIT companion booklet. 1. Corpus Speaker Distribution -- --------------------------- FFMTIMIT contains Breul and Kjaer microphone recordings for 613 of the 630 TIMIT corpus speakers (the B&K data for the remaining 17 speakers was unrecoverable). FFMTIMIT contains a total of 6130 sentences, 10 sentences spoken by each of 613 speakers from 8 major dialect regions of the United States. Table 1 shows the number of speakers for the 8 dialect regions, broken down by sex. The percentages are given in parentheses. A speaker's dialect region is the geographical area of the U.S. where they lived during their childhood years. The geographical areas correspond with recognized dialect regions in U.S. (Language Files, Ohio State University Linguistics Dept., 1982), with the exception of the Western region (dr7) in which dialect boundaries are not known with any confidence and dialect region 8 where the speakers moved around a lot during their childhood. Table 1: Dialect distribution of speakers Dialect Region(dr) #Male #Female Total ---------- --------- --------- ---------- 1 29 (63%) 17 (37%) 46 (8%) 2 70 (71%) 29 (29%) 99 (16%) 3 75 (77%) 23 (23%) 98 (16%) 4 69 (69%) 31 (31%) 100 (16%) 5 57 (62%) 35 (38%) 92 (15%) 6 30 (65%) 16 (35%) 46 (8%) 7 74 (75%) 25 (25%) 99 (16%) 8 22 (67%) 11 (33%) 33 (5%) ------ --------- --------- ---------- 8 426 (69%) 187 (31%) 613 (100%) The dialect regions are: dr1: New England dr2: Northern dr3: North Midland dr4: South Midland dr5: Southern dr6: New York City dr7: Western dr8: Army Brat (moved around) 2. Corpus Text Material -- -------------------- The text material in the TIMIT prompts (found in the file "prompts.doc") consists of 2 dialect "shibboleth" sentences designed at SRI, 450 phonetically-compact sentences designed at MIT, and 1890 phonetically-diverse sentences selected at TI. The dialect sentences (the SA sentences) were meant to expose the dialectal variants of the speakers and were read by all 613 speakers. The phonetically-compact sentences were designed to provide a good coverage of pairs of phones, with extra occurrences of phonetic contexts thought to be either difficult or of particular interest. Each speaker read 5 of these sentences (the SX sentences) and each text was spoken by 7 different speakers (SEE NOTE #1). The phonetically-diverse sentences (the SI sentences) were selected from existing text sources - the Brown Corpus (Kuchera and Francis, 1967) and the Playwrights Dialog (Hultzen, et al., 1964) - so as to add diversity in sentence types and phonetic contexts. The selection criteria maximized the variety of allophonic contexts found in the texts. Each speaker read 3 of these sentences, with each sentence being read only by a single speaker. Table 2 summarizes the speech material in FFMTIMIT. Table 2: FFMTIMIT speech material Sentence Type #Sentences #Speakers Total #Sentences/Speaker ------------- ---------- --------- ----- ------------------ Dialect (SA) 2 613 1226 2 Compact (SX) 366 7 2562 5 83 6 498 . 1 5 5 . Diverse (SI) 1839 1 1839 3 ------------- ---------- --------- ----- ---------------- Total 2291 6130 10 3. Suggested Training/Test Subdivision -- ----------------------------------- The speech material has been subdivided into portions for training and testing. The criteria for the subdivision is described in the file "testset.doc". THIS SUBDIVISION HAS NO RELATION TO THE DATA DISTRIBUTED ON THE PROTOTYPE VERSION OF THE CD-ROM, AND THE SUGGESTED TRAINING AND TEST SETS IDENTIFIED ON THIS RELEASE DIFFER SOMEWHAT FROM THOSE ON THE COMPLETE TIMIT DISC, BECAUSE SOME B&K DATA COULD NOT BE RECOVERED. Core Test Set: The test data has a core portion containing 24 speakers, 2 male and 1 female from each dialect region. The core test speakers are shown in Table 3. Each speaker read a different set of SX sentences. Thus the core test material contains 192 sentences, 5 SX and 3 SI for each speaker, each having a distinct text prompt. Table 3: The core test set of 24 speakers Dialect Male Female ------- ------ ------ 1 DAB0, WBT0 ELC0 2 TAS1, WEW0 PAS0 3 JMP0, LNT0 PKT0 4 LLL0, TLS0 JLM0 5 BPM0, KLT0 NLP0 6 CMJ0, JDH0 MGD0 7 GRT0, NJM0 DHC0 8 JLN0, PAM0 MLD0 Complete Test Set: A more extensive test set was obtained by including the sentences from all speakers that read any of the SX texts included in the core test set. In doing so, no sentence text appears in both the training and test sets. This complete test set contains a total of 162 speakers and 1296 utterances, accounting for about 21% of the total speech material. The resulting dialect distribution of the 162 speaker test set is given in Table 4. The complete test material contains 606 distinct texts. Table 4: Dialect distribution for complete test set Dialect #Male #Female Total ------- ----- ------- ----- 1 6 3 9 2 18 7 25 3 22 3 25 4 16 16 32 5 15 11 26 6 8 3 11 7 15 8 23 8 8 3 11 ----- ----- ------- ------ Total 108 54 162 4. CDROM FFMTIMIT Directory and File Structure -- ------------------------------------------- The speech and associated data is organized on the CD-ROM according to the following hierarchy: /////. where, CORPUS :== ffmtimit USAGE :== train | test DIALECT :== dr1 | dr2 | dr3 | dr4 | dr5 | dr6 | dr7 | dr8 (see Table 1 for dialect code description) SEX :== m | f SPEAKER_ID :== where, INITIALS :== speaker initials, 3 letters DIGIT :== number 0-9 to differentiate speakers with identical initials SENTENCE_ID :== where, TEXT_TYPE :== sa | si | sx (see Section 2 for sentence text type description) SENTENCE_NUMBER :== 1 ... 2342 FILE_TYPE :== wav | txt | wrd | phn (see Table 5 for file type description) Examples: /ffmtimit/train/dr1/fcjf0/sa1.wav (FFMTIMIT corpus, training set, dialect region 1, female speaker, speaker-ID "cjf0", sentence text "sa1", speech waveform file) /ffmtimit/test/dr5/mbpm0/sx407.phn (FFMTIMIT corpus, test set, dialect region 5, male speaker, speaker-ID "bpm0", sentence text "sx407", phonetic transcription file) Online documentation and tables are located in the directory "ffmtimit/doc". A brief description of each file in this directory can be found in Section 6. 5. File Types -- ---------- The FFMTIMIT corpus includes several files associated with each utterance. In addition to a speech waveform file (.wav), three associated transcription files (.txt, .wrd, .phn) exist. These associated files have the form: . . . where, BEGIN_SAMPLE :== The beginning integer sample number for the segment (Note: The first BEGIN_SAMPLE of each file is always 0) END_SAMPLE :== The ending integer sample number for the segment (Note: Because of the transcription method used, the last END_SAMPLE in each transcription file may be less than the actual last sample in the corresponding .wav file) TEXT :== | | where, ORTHOGRAPHY :== Complete orthographic text transcription WORD_LABEL :== Single word from the orthography PHONETIC_LABEL :== Single phonetic transcription code (See "phoncode.doc" for description of codes) Table 5: Utterance-associated file types File Type Description --------- ------------------------------------------------------ .wav - SPHERE-headered speech waveform file. (See the "/sphere" directory for speech file manipulation utilities.) .txt - Associated orthographic transcription of the words the person said. (Usually this is the same as the prompt, but in a few cases the orthography and prompt disagree.) .wrd - Time-aligned word transcription. The word boundaries were aligned with the phonetic segments using a dynamic string alignment program (see the printed documentation section "Notes on the Word Alignments" and the lexical pronunciations given in "timitdic.txt".) Note also that the time-alignments differ from those in the TIMIT corpus to account for a propagation delay of 20 samples, corresponding to the placement of the B&K microphone at approximately 16" from the Sennheiser microphone. .phn - Time-aligned phonetic transcription. (See the reprint of the article by Seneff and Zue (1988), in the printed documentation, and the section "Notes on Checking the Phonetic Transcriptions" for more details on the phonetic transcription protocols.) Note also that the time-alignments differ from those in the TIMIT corpus to account for a propagation delay of 20 samples, corresponding to the placement of the B&K microphone at approximately 16" from the Sennheiser microphone. Example transcriptions from the utterance in "/ffmtimit/test/dr5/fnlp0/sa1.wav" Orthography (.txt): 0 61748 She had your dark suit in greasy wash water all year. Word label (.wrd): 7490 11382 she 11382 16020 had 15440 17523 your 17523 23380 dark 23380 28380 suit 28380 30980 in 30980 36991 greasy 36991 42310 wash 43140 47500 water 49041 52204 all 52204 58860 year Phonetic label (.phn): (Note: beginning and ending silence regions are marked with h#) 0 7490 h# 7490 9860 sh 9860 11382 iy 11382 12928 hv 12928 14780 ae 14780 15440 dcl 15440 16020 jh 16020 17523 axr 17523 18560 dcl 18560 18970 d 18970 21073 aa 21073 22220 r 22220 22760 kcl 22760 23380 k 23380 25335 s 25335 27663 ux 27663 28380 tcl 28380 29292 q 29292 29952 ih 29952 30980 n 30980 31890 gcl 31890 32570 g 32570 33273 r 33273 34680 iy 34680 35910 z 35910 36991 iy 36991 38411 w 38411 40710 ao 40710 42310 sh 42310 43140 epi 43140 43926 w 43926 45500 ao 45500 46060 dx 46060 47500 axr 47500 49041 q 49041 51368 ao 51368 52204 l 52204 54167 y 54167 56674 ih 56674 58860 axr 58860 61700 h# 6. Online Documentation -- -------------------- Compact documentation is located in the "/ffmtimit/doc" directory. Files in this directory with a ".doc" extension contain freeform descriptive text and files with a ".txt" extension contain tables of formatted text which can be searched programmatically. Lines in the ".txt" files beginning with a semicolon are comments and should be ignored on searches. The following is a brief description of their contents: phoncode.doc - Table of phone symbols used in phonemic dictionary and phonetic transcriptions prompts.txt - Table of sentence prompts and sentence-ID numbers spkrinfo.txt - Table of speaker attributes spkrsent.txt - Table of sentence-ID numbers for each speaker testset.doc - Description of suggested train/test subdivision timitdic.doc - Description of phonemic lexicon timitdic.txt - Phonemic dictionary of all orthographic words in prompts A more extensive description of corpus design, collection, and transcription can be found in the printed documentation. NOTES ===== #1) Because only 613 of the original 630 speakers were recovered, not all of the phonetically-compact sentences were spoken by exactly seven speakers. Listed below are the sentences (sx) in exception. Sentences Spoken 5 times: 277 Sentences Spoken 6 times: 3 4 5 6 7 8 9 10 11 21 56 57 58 59 60 90 91 92 93 94 95 96 97 98 99 100 101 146 147 148 149 150 180 181 182 183 184 185 186 187 188 189 190 191 236 237 238 239 240 270 271 272 273 274 275 276 278 279 280 281 326 328 329 330 360 361 362 363 364 365 367 368 369 370 371 416 417 418 419 420 450 451 452