NIST Speech Disc 6-1.1
September, 1991
The "Road Rally" corpora were designed for the development and testing of word-spotting systems and were collected in a conversational domain using a road rally planning task as the topic. The corpora actually consists of 2 sub-corpora, "Stonehenge" and "Waterloo". The Stonehenge corpus contains road rally planning conversations as well as some read speech collected using high quality microphones and a telephone-simulating filter. The Waterloo corpus contains read road rally planning domain speech which was collected using actual telephone lines.
The Stonehenge corpus was collected from subjects using telephone handsets which were modified to contain a high quality microphone. To gather conversational data, 2 talkers were located in separate rooms, given a road map, and asked to participate in a road rally planning task. Their objective was to form a path between two locations on the map which would maximize their road rally point score. They were also given a time limit in which to complete the task to increase their responsiveness. Their speech was recorded on a stereo tape recorder with each subject's speech on a separate track. The tracks were digitized and the speech was edited to remove silences longer than a second or so. This resulted in approximately 3 minutes of continuous speech per subject. The speech was filtered using a 300Hz to 3300Hz PCM FIR bandpass filter to simulate telephone bandwidth quality. (See the file, "fltrcoef.txt", for the filter coefficients.) Over 100 subjects participated in the effort, but some were excluded from the final corpus because of unsuitable response or technical problems.
Note: The conversational speech files contain single-channel speech. Therefore, these files contain only the speech for one of the speakers in a conversation. The two speech files which make up a conversation are not identified and it is possible that they do not both exist on this CD-ROM.
Twenty words were identified in Stonehenge as "key" words and much of the speech files in the corpus have corresponding text files which identify key word occurences and locations. See Section 3.2, "Key Word Marking Files" below.
The Stonehenge corpus contains three "styles" of speech data:
Speaker/Speech Style Distribution Speakers Style -------- -------------------------------- 01-16 conversation 33-64 conversation, paragraph, carrier 65-96 paragraph, carrier 201-204 carrier (no marking files)
The Waterloo corpus was collected as an extension to Stonehenge to provide similar domain speech under different conditions. The corpus was collected from subjects using conventional telephones and dialed up telephone lines in the Massachussetts area. Unlike the Stonehenge speech, the Waterloo speech is naturally band-limited by the telephones/lines but for consistency, the speech was also filtered using the Stonehenge 300Hz to 3300Hz PCM FIR bandpass filter. The corpus consists of 56 speakers (28 males and 28 females) each speaking a (read) paragraph of road rally domain speech.
NOTE: Although numbered similarly, the Waterloo subjects have no relation to the Stonehenge subject population. Also, the read paragraph used in Waterloo is not the same text read by the Stonehenge subjects.
The 20 Stonehenge "key" words were also identified as the key words in Waterloo. All of the speech files for the 56 speakers have corresponding text files which identify the key word occurences and locations. See Section 3.2, "Key Word Marking Files" below.
The road rally corpora are made up of two sub-corpora, Stonehenge and Waterloo. This division is reflected in the CD-ROM directory structure as follows:
/rdrally1 (top level directory) /rdrally1/stonheng (subdirectory containing Stonehenge corpora) /rdrally1/waterloo (subdirectory containing Waterloo corpora)Each of these subdirectories contains two file types: (1) speech waveform files, and (2) auxiliary "key" word marking files. The format for these file types is consistent between Stonehenge and Waterloo. Waveform and marking files are named with a unique utterance ID code and the file types are identified by a unique filename extension as follows:
ROADRALLY-FILE ::=The file types are described in more detail below.. where, UTTERANCE-ID ::= where, SUB-CORPORA-ID ::= s (Stonehenge) | w (Waterloo) SPEAKER-ID ::= [01 | ... | 96 | 201 | ... | 204] (Stonehenge) | [01 | ... | 56] (Waterloo) UTTERANCE-TYPE ::= c (conversation) | p (read paragraph) | s (read "carrier" sentences) (NOTE: Waterloo files contain no utterance type identifier) FILE-TYPE ::= wav (speech waveform) | mrk (key word marking file) examples: 1. sf01c.wav (Stonehenge corpus, female speaker, speaker-ID 1, conversation, speech waveform file) 2. wm35.mrk (Waterloo corpus, male speaker, speaker-ID 35, marking file)
The speech waveform files are identified by a ".wav" extension. These files are formatted with the NIST SPHERE header structure. Briefly, the .wav files contain a 1024-byte ASCII header followed by 16-bit, 10kHz, MSB-LSB sampled speech waveform data. The following is an example SPHERE header from the file, "/rdrally1/stonheng/sf01.wav":
NIST_1A 1024 database_id -s8 RDRALLY1 database_version -s3 1.0 utterance_id -s5 sf01c channel_count -i 1 sample_count -i 1981620 sample_rate -i 10000 sample_min -i -27728 sample_max -i 29088 sample_n_bytes -i 2 sample_byte_format -s2 10 sample_sig_bits -i 16 end_headThe self-describing header contains information pertainent to file identification and basic D/A operations. See /sphere/readme.doc for more information about SPHERE.
Auxiliary "Key" word marking files are included for most of the speech corpora in Stonehenge and all of the corpora in Waterloo. These files identify key words and their locations in the speech (.wav) files. The marking files share the same filenames (utterance ID's) as their corresponding speech files but contain ".mrk" extensions.
The following words are marked as key words in the Road Rally corpora:
1. Boonsboro 2. Chester 3. Conway 4. interstate 5. look 6. Middleton 7. minus 8. mountain 9. primary 10. retrace 11. road 12. secondary 13. Sheffield 14. Springfield 15. Thicket 16. track 17. want 18. Waterloo 19. Westchester 20. backtrack"Marking" (.mrk) files corresponding to speech waveform (.wav) files provide sample-number-aligned identification of occurrences of key words. The marking files are text files which contain the following tabular fields which are separated by one or more spaces:
where, KEY-WORD ::= one of the twenty key words listed above VARIANT-ID ::= 1 (base key word) | 2 ("s" plural form) | 3 ("ed" past tense) | 4 ("ing" present participle) | 81-84 (key word detected in the crosstalk of other talker) | 99. (mispronounced) UTTERANCE-ID ::= where, SUB-CORPORA-ID ::= s (Stonehenge) | w (Waterloo) SPEAKER-ID ::= [01 | ... | 96 | 201 | ... | 204] (Stonehenge) | [01 | ... | 56] (Waterloo) UTTERANCE-TYPE ::= c (conversation) | p (read paragraph) | s (read "carrier" sentences) (Note: Waterloo files contain no utterance type identifier) BEG-SAMPLE ::= sample number of start of key word in utterance (.wav) file END-OFFSET ::= offset from BEG-SAMPLE of end of key word NOTE ::= variant or key word and variant spelled out (this field is not used consistently thoughout the corpus) The following is an example of the marking format from the file, "/rdrally1/stonheng/sf01.mrk": middleton 1 sf01c 32681 3999 middleton 1 sf01c 39481 4200 mountain 1 sf01c 75682 3199 track 2 sf01c 78881 3199 road 1 sf01c 101881 3199 secondary 1 sf01c 116659 5326 road 1 sf01c 122059 4121 mountain 1 sf01c 145381 5499 track 1 sf01c 150881 2599 mountain 1 sf01c 168309 5571 middleton 1 sf01c 208281 5400 conway 1 sf01c 216900 6051 interstate 1 sf01c 265434 5848 minus 1 sf01c 325463 4314 interstate 1 sf01c 330463 6399 mountain 1 sf01c 376059 4002 track 1 sf01c 380061 2400
Unfortunately, the prompts used in the read corpora and clean transcriptions for the conversational corpora were not available to NIST at the time of this CD-ROM release. The following are an approximation of the prompts for the read Stonhenge and Waterloo corpora which have been derived at NIST by transcription and comparison of several of the read passages. Conformance to the texts was loosely enforced and some of the material contains rewording, false starts, verbal editing, and even dialogue with the recording engineer. Note that the Stonehenge speakers stated their speaker-ID code, the date of recording, and the name of the text at the beginning of each of the read texts.
Stonehenge Road Rally Passage
I am speaker < SPEAKER-ID>.
Today's date is < MONTH> < DATE>, < YEAR>.
Road Rally Passage.
A good solution to the Road Rally task requires several decisions. Here are some broad guidelines you ought to follow. You should use tracks that involve mountain roads, secondary roads, and ferries whenever the travel time is not excessive and avoid backtracking. The first leg starting at Middleton and ending in Boonsboro has a target time of 7 1/2 hours. The highest scoring path for this leg goes through Conway and Thicket on the primary road. The route through Springfield looks attractive since it results in the shortest travel time. However, the minus score for the interstate highway results in an overall low score. The best track for the second leg goes through Chester and Sheffield resulting in a score of 20 points. The leg from Westchester to Waterloo has a top score of 110 points. You don't want to retrace roads between towns on the last four legs.
Stonehenge Carrier Sentences
I am speaker < SPEAKER-ID>.
Today's date is < MONTH> < DATE>, < YEAR>.
Key word sentences.
Speak backtrack please. Say backtrack again. Backtrack.
Speak Boonsboro please. Say Boonsboro again. Boonsboro.
Speak Chester please. Say Chester again. Chester.
Speak Conway please. Say Conway again. Conway.
Speak interstate please. Say interstate again. Interstate.
Speak look please. Say look again. Look.
Speak Middleton please. Say Middleton again. Middleton.
Speak minus please. Say minus again. Minus.
Speak mountain please. Say mountain again. Mountain.
Speak primary please. Say primary again. Primary.
Speak retrace please. Say retrace again. Retrace.
Speak road please. Say road again. Road.
Speak secondary please. Say secondary again. Secondary.
Speak Sheffield please. Say Sheffield again. Sheffield.
Speak Springfield please. Say Springfield again. Springfield.
Speak Thicket please. Say Thicket again. Thicket.
Speak track please. Say track again. Track.
Speak want please. Say want again. Want.
Speak Waterloo please. Say Waterloo again. Waterloo.
Speak Westchester please. Say Westchester again. Westchester.
Waterloo Road Rally Passage
To get from Middleton to Waterloo, you go on interstate 37 out of Middleton to Middleton Road. You want to turn right off the interstate primary onto Conway toward Chester. Then look for a primary road going up the mountain to Westchester Road. If you miss that, there is a secondary dirt track through Chester and you can backtrack to Westchester from there. When you get to Chester, take a sharp left turn that backtracks along the road you are on. If you go straight out of Chester, the track runs into a thicket and your car will get scratched up. The backtrack gets you to Westchester which is a primary road. Go south on Westchester to Conway Road. Look for a fork in the road. You want the right fork. The left fork is a secondary to Boonsboro. The secondary at Boonsboro runs through another thicket and dead ends. So if you get to Boonsboro, you want to retrace back through the thicket to Conway. Take Conway around Boonsboro through Sheffield to Springfield. On Conway, you look for Sheffield Road. You take a right off Conway Street. In Sheffield, you want to look for the Sheffield mountain track. Go across the secondary track, take a sharp right, and backtrack about a mile, looking for the entrance to the interstate. Take the primary interstate west into Springfield and get off at exit 2 on the other side of Springfield. The sign says Waterloo Road. You take Waterloo Road up the mountain and on the other side of the mountain you get to Waterloo. The longest it could take you from Middleton to Waterloo is about 5 hours. If you don't miss any turns, it would be 5 hours minus retracing from Chester, minus retracing through the thicket, minus the dead end in Boonsboro, minus retracing through the other thicket or about 3 hours.
Training Set: Read passages for all Waterloo speakers. Augmented Training Set: The above training set plus male Stonehenge speakers 3-10 and 13-16, and female Stonehenge speakers 1, 2, 11, 12, 42, and 58. Male Test Set: Conversations for Stonehenge speakers 33-41, and 43 (26 minutes of material containing 405 key words). Augmented Male Test Set: The above Male Test Set plus conversations for Stonehenge speakers 49-57, and 59 (additional 25 minutes of material containing 422 key words for a total of 51 minutes of material and 827 key words). Female Test Set: Conversations for Stonehenge speakers 44-48 and 60-64 (approx. 25 minutes of material). Cross-Sex Test Set: The above Male Test Set plus the above Female Test Set. (approx. 50 minutes of material).
The following procedures for training, testing, and evaluating keyword-spotting systems were contributed by members of the word-spotting research community for inclusion with these corpora. Other word-spotting training, testing, and evaluation procedures have been proposed and are known to be in use. The use of other procedures, in conjuction with the following procedures may provide a basis for informative comparisons of system benchmark performance.
Wordspotting Test Procedure
The following is a description of the procedure for testing and evaluating an algorithm designed to spot keywords in continuous speech. While other test and evaluation strategies can be used, the following regime permits comparison with other wordspotters. This procedure utilizes the Road Rally data base which itself is comprised of two sets of talkers, the Waterloo data base and the Stonehenge data base. Descriptions of these two data bases are attached.
The Waterloo data base was constructed to train models for keywords in the conversational Stonehenge data base. It consists of 56 talkers reading a paragraph which contains several examples of each keyword in varying contexts recorded over live AT&T telephone lines and phones. Each talker is a separate pcm file and there are associated mark files which contain the location and duration of each of the keywords. An important feature of this data base is the fact that it consists of 28 female talkers followed by 28 male talkers. In the running of sex specific tests it is permissible to train only on talkers of the same sex as the test talkers, however both male and female Waterloo talkers may be used in training. Additional talkers which may be used to augment the training data are male Stonehenge talkers 3-10 and 13-16 and female Stonehenge talkers 1, 2, 11, 12, 42 and 58. The combination of these Stonehenge talkers with all the Waterloo talkers provides enough data to allow splitting it into two training sets. This allows training of any secondary classifier, adjusting any word-dependent thresholds required to equalize false alarm rates across words and setting any other classifier parameters.
The testing is to be done on talkers from the Stonehenge data base. The wordspotting algorithm is to search each of the prescribed test talker's speech files assigning a score to putative locations of the keyword. Algorithmic thresholds and parameters used to determine putative keyword locations are to be set automatically or fixed before the test begins. For each keyword the locations and associated scores combined across all test talkers are to be output to an ascii file for evaluation. The following sets of talkers from the Stonehenge data base are suggested for use in three experiments. 1. Talkers 33-41, 43 which are 26 minutes of male speakers containing 405 keywords. 2. Talkers 33-41, 43, 49-57, 59 which are 51 minutes of male speakers containing 827 keywords. 3. Talkers 33-41, 43 with talkers 44-48 and 60-64 which are 10 male and 10 female talkers.
The evaluation is done in a way that eliminates the need to compare scores across keywords. The putative keyword locations are first ordered by score from best to worst across all talkers for each individual keyword. Then a tally is made of the number of words found as the 1st, 2nd, etc. false alarm for each keyword is encountered. These are the number of words which would be found if the detection threshold were set at the score of each false alarm in turn. At each false alarm level the tallies are added across keywords and expressed as a percentage of the total number of keywords in the test material. A graph is then formed of the percentages found (the y axis) up to each of the false alarms (the x axis). This yields an ideal performance curve (ROC) based upon aposteriori thresholds set for each keyword and each false alarm level.
If the putative keyword list resulted from spotting on a fraction of an hour (T) and it is desired to scale the performance curve so that the x axis reads in terms of false alarms per hour, the following two concepts are applied. First, if the ideal threshold is placed at the score of the Nth false alarm the percentage of true hits found up to that level is reported at the (N-1/2) false alarm. A heuristic justification for this is that setting the threshold epsilon higher would have produced the same percentage and N-1 false alarms would have been encountered, while setting the threshold epsilon lower would have still produced the same percentage, but N false alarms would have been encountered. Therefore, setting the threshold of the N false alarm level is considered the average of the N-1 false alarm point and the N false alarm point which is the N-1/2 false alarm level. The second concept is that the number of false alarms that would be encountered at a given performance level is proportional to the fraction of an hour (T) that was spotted. Thus the performance percentage at the Nth false alarm in terms of false alarms per hour is labelled on the x axis as the (N-1/2)/T false alarm per hour level.
A single figure of merit (FOM) can be calculated as the average of the scores up to 10 false alarms per keyword per hour as follows:
pi is the percentage of true hits found before the ith false alarm. FOM = (p1 + p2 + p3 ...+ pN +apN+1)/10T where N is first integer >= 10T-1/2 T=fraction of an hour of test talkers and a=10T-N (a is a factor that interpolates to 10 false alarms per hour)This FOM yields a mean performance over zero to ten false alarms per keyword per hour which is a useful statistic. Reporting at zero or one false alarm is very important but these statistics tend to have a large variance over test data while the performance at 10 false alarms is less interesting but more stable. The above FOM is a reasonable compromise.
Note that the evaluation using aposteriori thresholding just described allows comparison of wordspotting performance but does not address the issue of apriori threshold setting. To test threshold stability between training and test data a method of evaluation using apriori threshold setting is suggested. Apriori thresholds are set at each false alarm level from zero to 10 false alarms per hour using a second training data set. For each word tested two curves are constructed using the training false alarm levels as the abscissa. The first is an ROC from the test data of the percentage hits found at the apriori thresholds. The second is a curve of the number of false alarms occurring in the test data at the apriori thresholds.
An example of the format suggested for use in outputting the putative keyword locations follows:
want sm33c 39000 1400 -14.893 conway sm33c 38300 2900 -25.639 waterloo sm33c 39000 2200 -28.499 middleton sm33c 37600 3800 -6.466 mountain sm33c 37700 3800 10.409 interstate sm33c 36600 5000 -28.135 thicket sm33c 40000 2400 -38.437 retrace sm33c 39900 4400 -17.553 backtrack sm33c 40800 4300 -40.514 boonsboro sm33c 40100 5300 -18.945 want sm33c 44200 1200 -19.260 waterloo sm33c 43400 2000 -33.477 look sm33c 44500 1000 -14.461 minus sm33c 43300 2200 -22.635 mountain sm33c 43800 1900 -24.615 primary sm33c 43500 2600 -35.958 road sm33c 42500 3800 8.871 track sm33c 41200 5100 15.970 conway sm33c 42100 4500 -16.875 The corresonding C format statement is: fprintf(out_fp,"%23s %8s %10d %5d %10.3f\n", word,pcmroot,hstart,hlength,score); where: word is the putative keyword detected. pcmroot is the root portion of the pcm file name containing the putative keyword. hstart is the file's pcm sample number marking the beginning of the putative keyword. hlength is the length in samples of the putative keyword. score indicates the relative likelihood of this putative keyword being a true keyword hit.
The following is a comparison of the two Road Rally sub-corpora:
Stonehenge Waterloo --------------------------- --------------------------- Sample format 16-bit 2's complement same as Stonehenge Sample rate 10,000 kHz same as Stonehenge Microphone high-quality mic in dialed-up telephone lines telephone handset and conventional telephone handsets Filtering PCM 300Hz to 3300Hz same as Stonehenge (telephone bandwidth) Domain road rally planning task same as Stonehenge Talkers 84 (56 males/28 females) 56 (28 males/28 females) (96 - 16 withheld + (no overlap with 4 extra carrier speakers) Stonehenge) Speech style conversational (spkrs 1-64) read passage (not same as read passage (spkrs 33-96) Stonehenge) carrier sents. (spkrs 33-96) Target words 20 same as Stonehenge Target labeling word, variant, uttid, begin same as Stonehenge sample, offset of end sample (all utterances are from begin sample, note labeled (spkrs 65-204 are not labeled)