Editorial note: the following consists of a revised version of material prepared to document DARPA benchmark speech recognition tests. The two papers describing the first of these tests, in March 1987, are more formal than the others and were originally prepared for inclusion in the proceedings of the March 1987 DARPA speech recognition workshop. More informal notes were prepared for distribution within the DARPA speech research community to provide information on the tests conducted prior to the October 1987 and June 1988 meetings. The June 1988 note consists of an adaptation of the October 1987 note. Still more informal notes were prepared to outline the test procedures for the February 1989 and October 1989 benchmark tests. Editorial revisions primarily include some changes of tense, inclusion of references to directories in this cd-rom in which the material might be found, and substitution of the term "corpus" or "corpora" for "database" or "databases". Other editorial changes include insertion of comments as appropriate for clarification. In most cases, these editorial revisions are included within square brackets. **************************************** [MEMO TO EXPECTED PARTICIPANTS IN] October 1989 DARPA Resource Management Benchmark Speech Recognition Tests [From: David S. Pallett NIST August 1989] ********** What is the same as for previous tests? Most of the details are the same. (See [FEBRUARY 1989 DARPA SPEECH RECOGNITION RESOURCE MANAGEMENT BENCHMARK TESTS ***AMENDED JANUARY 23, 1989***]). But... What is different this time? (1) New action dates (of course): Distribution of the test material: Week of August 28 Required date for receipt of data at NIST: October 2, 1989. (2) Protocols for testing/reporting results on Speaker Adaptive Systems (see below). (3) Change of filenaming convention for test material (see below). (4) Changed format for reporting results to NIST for summary prior to the October Meeting (see below). (5) Optional use of new "statistical classs grammar" (see below). (6) New Test Material (see below) ********** [1] Action Dates: NIST has been asked by Charles Wayne to provide a tabulation of results prior to the meeting, as we did for the February meeting. At that time, some sites did not sent the data to us on time, nor in a convenient "standard" format for us to efficiently process through the scoring software. This time, we have asked for two weeks to process the data and prepare a summary, using some of the statistical tools we're developing (i.e., implementations of McNemar's test and the matched pairs test that Larry Gillick and Steve Cox advocate). Late submissions of data (at most a day or two late) are ONLY acceptable if the format convention described below is scrupulously adhered to. [2] Protocols for Testing Speaker Adaptive/Short Training Systems: Until recently, CMU (Kai-Fu Lee) had indicated that they (he) planned to report on performance of their speaker independent system using the 10 "rapid adaptation" sentences provided for each speaker. At present [August 1989] this appears unlikely. If other sites intend to report on "rapid adaptation" on this test set, please contact me (DSP) at NIST to describe the test protocol that you propose, and we can discuss this issue. BBN has advised that they intend to report results on their speaker dependent system using a set of 40 sentences for "short training". There is to be comparisons of performance for systems fully trained using the set of 600 (570 ?) training sentences for each speaker with systems trained only on the set of 40 sentences. If others intend to report on "short training", please contact me. [3] Change of Test Filenaming Convention: Some criticism of our benchmark tests has involved the fact that the identity of the filenames (and hence the reference transcriptions) and the speakers (for speaker independent systems) was known. One suggestion was to go to "blind filenames", revealing the text of the reference string and the identity of the speakers (for speaker independent systems) only after the tests had been run and the output submitted to NIST for "official scoring". Although we seriously entertained this idea, we've decided not to implement the suggestion at this time with this set of resource management data. We have, however, chosen to modify the filenaming convention for this round of test material. The intent of the revision is to facilitate generation of test results in a standard format and processing of those results by NIST. The filename will contain the minimum amount of information required to uniquely identify the utterance and, as such, can be used as an unambiguous utterance identifier in the test result files. An original filename such as "tdde-pgh01-st0123-b.adc" would now be represented as "pgh0-st0123.adc". [4] Changed Format for Reporting Results to NIST (Note that these instruction override those in the January-February 1989 instructions.) The format for Fall '89 test results submission is slightly different from the previous test format and should be easier to generate and score. The new format provides unambiguous utterance identification by adding a speaker identifier to the sentence identification field at the end of the hypothesis string (see below for an example). The new utterance identifier is intentionally identical to the utterance filename (excluding the ".adc" extension) and should be extracted from the filename when producing hypothesis strings. The new format allows all of the speaker hypothesis strings to be stored in one submission file and easily split into separate test files and then further parsed into "speaker" files for scoring. For your convenience, we have included "csh" and "sh" shell programs below [Appendix A] which will create multiple old-format speaker ".hyp" files from one input file containing hypothesis strings with the new utterance identifier. (Note: the input file must contain ONLY hypothesis strings from ONE test) A future addition of the NIST scoring software will handle speaker parsing automatically. Format for submission of Fall '89 recognition test results: # # # # #----- #1. [Test title 1][CR] #2. [Test title 2][CR] #. #. #. #n. [Test title n][CR] #----- #1. [Test title 1][CR] [hypothesis string] ([speaker id]-[sentence id])[CR] [hypothesis string] ([speaker id]-[sentence id])[CR] . . . [hypothesis string] ([speaker id]-[sentence id])[CR] #2. [Test title 2][CR] . . . etc. ***** Excerpt of an Example Submission ***** #The following data contains the results of two recognition tests. #In the speaker-independent-no-grammar test please note that # ..... #----- #1. Speaker Independent/No Grammar #2. Speaker Independent/Word-Pair Grammar #----- #1. Speaker Independent/No Grammar IS JASON+S MAXIMUM SUSTAINED SPEED SLOWER THAN JUPITER+S (JDM2- ST0009) TURN AREAS OFF AND REDRAW CURRENT AREA (JDM2-ST0029) IS TRIPOLI IN THE HOOKED PORT (JDM2-ST0064) . . . HAS SWORDFISH REPORTED ANY TRAINING PROBLEMS (JDM2-ST0091) WHAT SPEED IS EISENHOWER GOING (CMH1-ST0160) HOW SOON CAN ESTEEM CHOP TO ATLANTIC FLEET (CMH1-ST0186) . . . DEFINE AREA ALERTS FOR GULF OF CALIFORNIA (CMH1-ST0203) . . #2. Speaker Independent/Word-Pair Grammar . . . etc. [5] Optional Use of New Grammar BBN has developed an alternative to the use of the "word-pair" grammar, a statistical class grammar. As you are undoubtedly aware, the grammar has been thoroughly documented and made available over the net. I think they are to be commended for developing and offering an alternative, one that may offer a number of advantages. Alan Derr has outlined the motivation for use of this grammar in a net note that some might not have received. We are advised that BBN will make use of this grammar (as a complement to the present "standard" use of the no-grammar and word-pair conditions) in the tests to be reported at this meeting. Thus use of it is, for the time being, optional. [6] Planned Test Material: The test material to be distributed at the end of August consists of both new and previously used test material. The new material consists of the following: (a) one set of 300 sentence utterances (25 sentence utterances/speaker X 12 speakers) chosen from the Speaker Dependent Evaluation subset, and (b) one set of 300 sentence utterances (30 sentence utterances/speaker X 10 speakers) chosen from the Speaker Independent Evaluation Subset. This is comparable to what was distributed for the February 1989 tests. This test material will be new (previously unreleased). The previously used test material has been drawn from the test set used in February. It consists of the following: (a) one set of 150 sentence utterances (25 sentence utterances/speaker X 6 speakers) chosen from the February '89 Speaker Dependent Evaluation subset, and (b) one set of 150 sentence utterances (30 sentence utterances/speaker X 5 speakers) chosen from the February '89 Speaker Independent Evaluation subset. We have designated this a "Test/Retest Set", and are interested in using the results, in conjunction with the statistical tools, to see if we can identify significant progress since the last test. We recognize that the sample size may be too small, but hope that the increment of progress may be shown to be significant in the no grammar case. Because it involves extra processing time, we request only that each site use it for one pass through what you believe to be your "best performing" system, in the "no grammar" condition. Since this amounts to processing 150 utterances once, it should not present a significant burden. Note that in the February 1989 memo, in the section "USE OF THIS TEST MATERIAL", paragraph 6, it states that [the February 1989 material] "is to be used once and only once for each system configuration". We recognize that there may have been some reuse of the test material, nonetheless, and we ask that you specify the use you may have made of it since February so that we can be alert the likelihood of systems having been trained on the test material. Appendix A # # make_spkr_hyp_files.csh # # Converts sentences from a file using (-) format into # individual .hyp files with the old () format. # Make sure that no .hyp files exist in the current directory prior to # running this script or they may be appended. # if ($1 == "") then echo "Usage: make_spkr_hyp_files.csh " exit endif set count = 1 set total_lines = `cat $1 |wc -l` echo $total_lines while ($count <= $total_lines) set line=`tail +$count $1 | head -1` set spkrfile=`echo $line | awk -F'(' '{print substr($2,1,4) ".hyp"}' -` set newline=`echo $line | awk -F'(' '{print $1 "(" substr($2,6,25)}' -` echo $newline >> $spkrfile echo $newline appended to $spkrfile @ count++ end ----------------------------------------------------------------- #! /bin/sh # make_spkr_hyp_files.sh # # Converts sentences from a file using (-) format into # individual .hyp files with the old () format. # Make sure that no .hyp files exist in the current directory prior to # running this script or they may be appended. # if test $# -ne 1 then echo "Usage: make_spkr_hyp_files.sh " > /dev/tty exit 0 fi count=1 total_lines=`cat $1 |wc -l` while test $count -le $total_lines do line=`tail +$count $1 | head -1` spkrfile=`echo $line | awk -F'(' '{print substr($2,1,4) ".hyp"}' -` newline=`echo $line | awk -F'(' '{print $1 "(" substr($2,6,25)}' -` echo $newline >> $spkrfile echo $newline appended to $spkrfile > /dev/tty count=`expr $count + 1` done