The Road Rally Word-Spotting Corpora
(RDRALLY1)

NIST Speech Disc 6-1.1
September, 1991

The "Road Rally" corpora were designed for the development and testing of word-spotting systems and were collected in a conversational domain using a road rally planning task as the topic. The corpora actually consists of 2 sub-corpora, "Stonehenge" and "Waterloo". The Stonehenge corpus contains road rally planning conversations as well as some read speech collected using high quality microphones and a telephone-simulating filter. The Waterloo corpus contains read road rally planning domain speech which was collected using actual telephone lines.

Table of Contents

  1. Stonehenge
  2. Waterloo
  3. Road Rally directory and file structure
    3.1 Speech waveform files
    3.2 Key word marking files
  4. Text corpora
  5. Suggested training and test material
  6. Suggested wordspotting training and test procedures
  7. Comparison of Stonehenge and Waterloo

1 Stonehenge

The Stonehenge corpus was collected from subjects using telephone handsets which were modified to contain a high quality microphone. To gather conversational data, 2 talkers were located in separate rooms, given a road map, and asked to participate in a road rally planning task. Their objective was to form a path between two locations on the map which would maximize their road rally point score. They were also given a time limit in which to complete the task to increase their responsiveness. Their speech was recorded on a stereo tape recorder with each subject's speech on a separate track. The tracks were digitized and the speech was edited to remove silences longer than a second or so. This resulted in approximately 3 minutes of continuous speech per subject. The speech was filtered using a 300Hz to 3300Hz PCM FIR bandpass filter to simulate telephone bandwidth quality. (See the file, "fltrcoef.txt", for the filter coefficients.) Over 100 subjects participated in the effort, but some were excluded from the final corpus because of unsuitable response or technical problems.

Note: The conversational speech files contain single-channel speech. Therefore, these files contain only the speech for one of the speakers in a conversation. The two speech files which make up a conversation are not identified and it is possible that they do not both exist on this CD-ROM.

Twenty words were identified in Stonehenge as "key" words and much of the speech files in the corpus have corresponding text files which identify key word occurences and locations. See Section 3.2, "Key Word Marking Files" below.

The Stonehenge corpus contains three "styles" of speech data:

  1. the spontaneous conversations described above
  2. a read paragraph of road rally domain speech which contains at least 1 occurence of each of the 20 key words (about 1 minute long)
  3. a set of read "carrier" sentences which list the key words in vowel contexts, consonant contexts, and in isolation.
The complete Stonehenge corpus consists of speech from 96 speakers (speaker codes 01-96). The speech for 16 of these speakers has been reserved for future testing. This distribution, therefore, contains speech for the remaining 80 speakers: 28 females and 52 males. In addition, the carrier sentences were recorded for 4 male auxiliary speakers (speaker codes 201-204) and are also included. The distribution of speakers and corpora is listed in the table below.

        Speaker/Speech Style Distribution

Speakers	Style				
--------	--------------------------------
01-16		conversation 
33-64		conversation, paragraph, carrier
65-96		paragraph, carrier 
201-204		carrier (no marking files)

2 Waterloo

The Waterloo corpus was collected as an extension to Stonehenge to provide similar domain speech under different conditions. The corpus was collected from subjects using conventional telephones and dialed up telephone lines in the Massachussetts area. Unlike the Stonehenge speech, the Waterloo speech is naturally band-limited by the telephones/lines but for consistency, the speech was also filtered using the Stonehenge 300Hz to 3300Hz PCM FIR bandpass filter. The corpus consists of 56 speakers (28 males and 28 females) each speaking a (read) paragraph of road rally domain speech.

NOTE: Although numbered similarly, the Waterloo subjects have no relation to the Stonehenge subject population. Also, the read paragraph used in Waterloo is not the same text read by the Stonehenge subjects.

The 20 Stonehenge "key" words were also identified as the key words in Waterloo. All of the speech files for the 56 speakers have corresponding text files which identify the key word occurences and locations. See Section 3.2, "Key Word Marking Files" below.

3 Road Rally Directory and File Structure

The road rally corpora are made up of two sub-corpora, Stonehenge and Waterloo. This division is reflected in the CD-ROM directory structure as follows:

/rdrally1           (top level directory)
    
/rdrally1/stonheng  (subdirectory containing Stonehenge corpora)
/rdrally1/waterloo  (subdirectory containing Waterloo corpora)
Each of these subdirectories contains two file types: (1) speech waveform files, and (2) auxiliary "key" word marking files. The format for these file types is consistent between Stonehenge and Waterloo. Waveform and marking files are named with a unique utterance ID code and the file types are identified by a unique filename extension as follows:
    ROADRALLY-FILE ::= .

    where,

        UTTERANCE-ID ::= 
     
        where,

            SUB-CORPORA-ID ::= s (Stonehenge) |
                               w (Waterloo)
            SPEAKER-ID ::= [01 | ... | 96 | 201 | ... | 204] (Stonehenge) |
                           [01 | ... | 56] (Waterloo)
            UTTERANCE-TYPE ::= c (conversation) |
                               p (read paragraph) |
                               s (read "carrier" sentences)
                               (NOTE: Waterloo files contain no utterance type
                                      identifier)

        FILE-TYPE ::= wav (speech waveform) |
                      mrk (key word marking file)

examples:

  1.  sf01c.wav

      (Stonehenge corpus, female speaker, speaker-ID 1, conversation, 
      speech waveform file)

  2. wm35.mrk

     (Waterloo corpus, male speaker,  speaker-ID 35, marking file)
The file types are described in more detail below.

3.1 Speech Waveform Files

The speech waveform files are identified by a ".wav" extension. These files are formatted with the NIST SPHERE header structure. Briefly, the .wav files contain a 1024-byte ASCII header followed by 16-bit, 10kHz, MSB-LSB sampled speech waveform data. The following is an example SPHERE header from the file, "/rdrally1/stonheng/sf01.wav":

 
    NIST_1A
       1024
    database_id -s8 RDRALLY1
    database_version -s3 1.0
    utterance_id -s5 sf01c
    channel_count -i 1
    sample_count -i 1981620
    sample_rate -i 10000
    sample_min -i -27728
    sample_max -i 29088
    sample_n_bytes -i 2
    sample_byte_format -s2 10
    sample_sig_bits -i 16
    end_head
   
The self-describing header contains information pertainent to file identification and basic D/A operations. See /sphere/readme.doc for more information about SPHERE.

3.2 Key Word Marking Files

Auxiliary "Key" word marking files are included for most of the speech corpora in Stonehenge and all of the corpora in Waterloo. These files identify key words and their locations in the speech (.wav) files. The marking files share the same filenames (utterance ID's) as their corresponding speech files but contain ".mrk" extensions.

The following words are marked as key words in the Road Rally corpora:

	1.  Boonsboro
	2.  Chester
	3.  Conway
	4.  interstate
	5.  look
	6.  Middleton
	7.  minus
	8.  mountain
	9.  primary
	10.  retrace
	11.  road
	12.  secondary
	13.  Sheffield
	14.  Springfield
	15.  Thicket
	16.  track
        17.  want
	18.  Waterloo
	19.  Westchester
	20.  backtrack
"Marking" (.mrk) files corresponding to speech waveform (.wav) files provide sample-number-aligned identification of occurrences of key words. The marking files are text files which contain the following tabular fields which are separated by one or more spaces:
     

where,

    KEY-WORD ::= one of the twenty key words listed above
    VARIANT-ID ::= 1 (base key word) |
                   2 ("s" plural form) |
                   3 ("ed" past tense) |
                   4 ("ing" present participle) |
                   81-84 (key word detected in the crosstalk of other talker) |
                   99. (mispronounced)
    UTTERANCE-ID ::= 
     
    where,

        SUB-CORPORA-ID ::= s (Stonehenge) |
                           w (Waterloo)
        SPEAKER-ID ::= [01 | ... | 96 | 201 | ... | 204] (Stonehenge) |
                       [01 | ... | 56] (Waterloo)
        UTTERANCE-TYPE ::= c (conversation) |
                           p (read paragraph) |
                           s (read "carrier" sentences)
                           (Note: Waterloo files contain no utterance type
                                  identifier)

    BEG-SAMPLE ::= sample number of start of key word in utterance (.wav) file
    END-OFFSET ::= offset from BEG-SAMPLE of end of key word
    NOTE ::= variant or key word and variant spelled out (this field is not
             used consistently thoughout the corpus)

The following is an example of the marking format from the file,
"/rdrally1/stonheng/sf01.mrk":

middleton             1 sf01c         32681  3999         
middleton             1 sf01c         39481  4200         
mountain              1 sf01c         75682  3199         
track                 2 sf01c         78881  3199         
road                  1 sf01c        101881  3199         
secondary             1 sf01c        116659  5326         
road                  1 sf01c        122059  4121         
mountain              1 sf01c        145381  5499         
track                 1 sf01c        150881  2599         
mountain              1 sf01c        168309  5571         
middleton             1 sf01c        208281  5400         
conway                1 sf01c        216900  6051         
interstate            1 sf01c        265434  5848         
minus                 1 sf01c        325463  4314         
interstate            1 sf01c        330463  6399         
mountain              1 sf01c        376059  4002         
track                 1 sf01c        380061  2400         

4 Text Corpora

Unfortunately, the prompts used in the read corpora and clean transcriptions for the conversational corpora were not available to NIST at the time of this CD-ROM release. The following are an approximation of the prompts for the read Stonhenge and Waterloo corpora which have been derived at NIST by transcription and comparison of several of the read passages. Conformance to the texts was loosely enforced and some of the material contains rewording, false starts, verbal editing, and even dialogue with the recording engineer. Note that the Stonehenge speakers stated their speaker-ID code, the date of recording, and the name of the text at the beginning of each of the read texts.

Stonehenge Road Rally Passage

I am speaker < SPEAKER-ID>.

Today's date is < MONTH> < DATE>, < YEAR>.

Road Rally Passage.

A good solution to the Road Rally task requires several decisions. Here are some broad guidelines you ought to follow. You should use tracks that involve mountain roads, secondary roads, and ferries whenever the travel time is not excessive and avoid backtracking. The first leg starting at Middleton and ending in Boonsboro has a target time of 7 1/2 hours. The highest scoring path for this leg goes through Conway and Thicket on the primary road. The route through Springfield looks attractive since it results in the shortest travel time. However, the minus score for the interstate highway results in an overall low score. The best track for the second leg goes through Chester and Sheffield resulting in a score of 20 points. The leg from Westchester to Waterloo has a top score of 110 points. You don't want to retrace roads between towns on the last four legs.

Stonehenge Carrier Sentences

I am speaker < SPEAKER-ID>.

Today's date is < MONTH> < DATE>, < YEAR>.

Key word sentences.

Speak backtrack please. Say backtrack again. Backtrack.
Speak Boonsboro please. Say Boonsboro again. Boonsboro.
Speak Chester please. Say Chester again. Chester.
Speak Conway please. Say Conway again. Conway.
Speak interstate please. Say interstate again. Interstate.
Speak look please. Say look again. Look.
Speak Middleton please. Say Middleton again. Middleton.
Speak minus please. Say minus again. Minus.
Speak mountain please. Say mountain again. Mountain.
Speak primary please. Say primary again. Primary.
Speak retrace please. Say retrace again. Retrace.
Speak road please. Say road again. Road.
Speak secondary please. Say secondary again. Secondary.
Speak Sheffield please. Say Sheffield again. Sheffield.
Speak Springfield please. Say Springfield again. Springfield.
Speak Thicket please. Say Thicket again. Thicket.
Speak track please. Say track again. Track.
Speak want please. Say want again. Want.
Speak Waterloo please. Say Waterloo again. Waterloo.
Speak Westchester please. Say Westchester again. Westchester.

Waterloo Road Rally Passage

To get from Middleton to Waterloo, you go on interstate 37 out of Middleton to Middleton Road. You want to turn right off the interstate primary onto Conway toward Chester. Then look for a primary road going up the mountain to Westchester Road. If you miss that, there is a secondary dirt track through Chester and you can backtrack to Westchester from there. When you get to Chester, take a sharp left turn that backtracks along the road you are on. If you go straight out of Chester, the track runs into a thicket and your car will get scratched up. The backtrack gets you to Westchester which is a primary road. Go south on Westchester to Conway Road. Look for a fork in the road. You want the right fork. The left fork is a secondary to Boonsboro. The secondary at Boonsboro runs through another thicket and dead ends. So if you get to Boonsboro, you want to retrace back through the thicket to Conway. Take Conway around Boonsboro through Sheffield to Springfield. On Conway, you look for Sheffield Road. You take a right off Conway Street. In Sheffield, you want to look for the Sheffield mountain track. Go across the secondary track, take a sharp right, and backtrack about a mile, looking for the entrance to the interstate. Take the primary interstate west into Springfield and get off at exit 2 on the other side of Springfield. The sign says Waterloo Road. You take Waterloo Road up the mountain and on the other side of the mountain you get to Waterloo. The longest it could take you from Middleton to Waterloo is about 5 hours. If you don't miss any turns, it would be 5 hours minus retracing from Chester, minus retracing through the thicket, minus the dead end in Boonsboro, minus retracing through the other thicket or about 3 hours.

5 Suggested Training and Test Material

The following is a suggested usage of the Road Rally corpora in training and comparative testing:
           Training Set:  Read passages for all Waterloo speakers.  

 Augmented Training Set:  The above training set plus male Stonehenge speakers
                          3-10 and 13-16, and female Stonehenge speakers
                          1, 2, 11, 12, 42, and 58.

          Male Test Set:  Conversations for Stonehenge speakers 33-41, and 
                          43 (26 minutes of material containing 405 key words).

Augmented Male Test Set:  The above Male Test Set plus conversations for
                          Stonehenge speakers 49-57, and 59 (additional 25 
                          minutes of material containing 422 key words for
                          a total of 51 minutes of material and 827 key words).

        Female Test Set:  Conversations for Stonehenge speakers 44-48 and 
                          60-64 (approx. 25 minutes of material).

     Cross-Sex Test Set:  The above Male Test Set plus the above Female 
                          Test Set. (approx. 50 minutes of material).

6 Suggested Wordspotting Training and Test Procedures

The following procedures for training, testing, and evaluating keyword-spotting systems were contributed by members of the word-spotting research community for inclusion with these corpora. Other word-spotting training, testing, and evaluation procedures have been proposed and are known to be in use. The use of other procedures, in conjuction with the following procedures may provide a basis for informative comparisons of system benchmark performance.

Wordspotting Test Procedure

The following is a description of the procedure for testing and evaluating an algorithm designed to spot keywords in continuous speech. While other test and evaluation strategies can be used, the following regime permits comparison with other wordspotters. This procedure utilizes the Road Rally data base which itself is comprised of two sets of talkers, the Waterloo data base and the Stonehenge data base. Descriptions of these two data bases are attached.

The Waterloo data base was constructed to train models for keywords in the conversational Stonehenge data base. It consists of 56 talkers reading a paragraph which contains several examples of each keyword in varying contexts recorded over live AT&T telephone lines and phones. Each talker is a separate pcm file and there are associated mark files which contain the location and duration of each of the keywords. An important feature of this data base is the fact that it consists of 28 female talkers followed by 28 male talkers. In the running of sex specific tests it is permissible to train only on talkers of the same sex as the test talkers, however both male and female Waterloo talkers may be used in training. Additional talkers which may be used to augment the training data are male Stonehenge talkers 3-10 and 13-16 and female Stonehenge talkers 1, 2, 11, 12, 42 and 58. The combination of these Stonehenge talkers with all the Waterloo talkers provides enough data to allow splitting it into two training sets. This allows training of any secondary classifier, adjusting any word-dependent thresholds required to equalize false alarm rates across words and setting any other classifier parameters.

The testing is to be done on talkers from the Stonehenge data base. The wordspotting algorithm is to search each of the prescribed test talker's speech files assigning a score to putative locations of the keyword. Algorithmic thresholds and parameters used to determine putative keyword locations are to be set automatically or fixed before the test begins. For each keyword the locations and associated scores combined across all test talkers are to be output to an ascii file for evaluation. The following sets of talkers from the Stonehenge data base are suggested for use in three experiments. 1. Talkers 33-41, 43 which are 26 minutes of male speakers containing 405 keywords. 2. Talkers 33-41, 43, 49-57, 59 which are 51 minutes of male speakers containing 827 keywords. 3. Talkers 33-41, 43 with talkers 44-48 and 60-64 which are 10 male and 10 female talkers.

The evaluation is done in a way that eliminates the need to compare scores across keywords. The putative keyword locations are first ordered by score from best to worst across all talkers for each individual keyword. Then a tally is made of the number of words found as the 1st, 2nd, etc. false alarm for each keyword is encountered. These are the number of words which would be found if the detection threshold were set at the score of each false alarm in turn. At each false alarm level the tallies are added across keywords and expressed as a percentage of the total number of keywords in the test material. A graph is then formed of the percentages found (the y axis) up to each of the false alarms (the x axis). This yields an ideal performance curve (ROC) based upon aposteriori thresholds set for each keyword and each false alarm level.

If the putative keyword list resulted from spotting on a fraction of an hour (T) and it is desired to scale the performance curve so that the x axis reads in terms of false alarms per hour, the following two concepts are applied. First, if the ideal threshold is placed at the score of the Nth false alarm the percentage of true hits found up to that level is reported at the (N-1/2) false alarm. A heuristic justification for this is that setting the threshold epsilon higher would have produced the same percentage and N-1 false alarms would have been encountered, while setting the threshold epsilon lower would have still produced the same percentage, but N false alarms would have been encountered. Therefore, setting the threshold of the N false alarm level is considered the average of the N-1 false alarm point and the N false alarm point which is the N-1/2 false alarm level. The second concept is that the number of false alarms that would be encountered at a given performance level is proportional to the fraction of an hour (T) that was spotted. Thus the performance percentage at the Nth false alarm in terms of false alarms per hour is labelled on the x axis as the (N-1/2)/T false alarm per hour level.

A single figure of merit (FOM) can be calculated as the average of the scores up to 10 false alarms per keyword per hour as follows:

 pi is the percentage of true hits found before the ith false alarm.
 
    FOM = (p1 + p2 + p3 ...+ pN +apN+1)/10T

    where N is first integer >= 10T-1/2
     T=fraction of an hour of test talkers
     and a=10T-N (a is a factor that interpolates to 10 false alarms per 
     hour)
This FOM yields a mean performance over zero to ten false alarms per keyword per hour which is a useful statistic. Reporting at zero or one false alarm is very important but these statistics tend to have a large variance over test data while the performance at 10 false alarms is less interesting but more stable. The above FOM is a reasonable compromise.

Note that the evaluation using aposteriori thresholding just described allows comparison of wordspotting performance but does not address the issue of apriori threshold setting. To test threshold stability between training and test data a method of evaluation using apriori threshold setting is suggested. Apriori thresholds are set at each false alarm level from zero to 10 false alarms per hour using a second training data set. For each word tested two curves are constructed using the training false alarm levels as the abscissa. The first is an ROC from the test data of the percentage hits found at the apriori thresholds. The second is a curve of the number of false alarms occurring in the test data at the apriori thresholds.

An example of the format suggested for use in outputting the putative keyword locations follows:

want                    sm33c         39000  1400    -14.893
conway                  sm33c         38300  2900    -25.639
waterloo                sm33c         39000  2200    -28.499
middleton               sm33c         37600  3800     -6.466
mountain                sm33c         37700  3800     10.409
interstate              sm33c         36600  5000    -28.135
thicket                 sm33c         40000  2400    -38.437
retrace                 sm33c         39900  4400    -17.553
backtrack               sm33c         40800  4300    -40.514
boonsboro               sm33c         40100  5300    -18.945
want                    sm33c         44200  1200    -19.260
waterloo                sm33c         43400  2000    -33.477
look                    sm33c         44500  1000    -14.461
minus                   sm33c         43300  2200    -22.635
mountain                sm33c         43800  1900    -24.615
primary                 sm33c         43500  2600    -35.958
road                    sm33c         42500  3800      8.871
track                   sm33c         41200  5100     15.970
conway                  sm33c         42100  4500    -16.875

The corresonding C format statement is:
          fprintf(out_fp,"%23s %8s %10d %5d %10.3f\n",
               word,pcmroot,hstart,hlength,score);

where:     word is the putative keyword detected.
           pcmroot is the root portion of the pcm file name containing 
                the putative keyword.
           hstart is the file's pcm sample number marking the
                beginning of the putative keyword.
           hlength is the length in samples of the putative keyword.
           score indicates the relative likelihood of this putative 
                keyword being a true keyword hit.

7 Comparison of Stonehenge and Waterloo

The following is a comparison of the two Road Rally sub-corpora:

                    Stonehenge                      Waterloo
                    ---------------------------     ---------------------------
Sample format       16-bit 2's complement           same as Stonehenge

Sample rate         10,000 kHz                      same as Stonehenge

Microphone          high-quality mic in             dialed-up telephone lines
                    telephone handset               and conventional telephone
                                                    handsets

Filtering           PCM 300Hz to 3300Hz             same as Stonehenge
                    (telephone bandwidth)

Domain              road rally planning task        same as Stonehenge

Talkers             84 (56 males/28 females)        56 (28 males/28 females)
                    (96 - 16 withheld +             (no overlap with 
                    4 extra carrier speakers)       Stonehenge)

Speech style        conversational (spkrs 1-64)     read passage (not same as
                    read passage (spkrs 33-96)      Stonehenge)
                    carrier sents. (spkrs 33-96)

Target words        20                              same as Stonehenge

Target labeling     word, variant, uttid, begin     same as Stonehenge
                    sample, offset of end sample    (all utterances are
                    from begin sample, note         labeled
                    (spkrs 65-204 are not
                    labeled)