Editorial note: the following consists of a revised version of material prepared to document DARPA benchmark speech recognition tests. The two papers describing the first of these tests, in March 1987, are more formal than the others and were originally prepared for inclusion in the proceedings of the March 1987 DARPA speech recognition workshop. More informal notes were prepared for distribution within the DARPA speech research community to provide information on the tests conducted prior to the October 1987 and June 1988 meetings. The June 1988 note consists of an adaptation of the October 1987 note. Still more informal notes were prepared to outline the test procedures for the February 1989 and October 1989 benchmark tests. Editorial revisions primarily include some changes of tense, inclusion of references to directories in this cd-rom in which the material might be found, and substitution of the term "corpus" or "corpora" for "database" or "databases". Other editorial changes include insertion of comments as appropriate for clarification. In most cases, these editorial revisions are included within square brackets. **************************************** SELECTED TEST MATERIAL, TEST AND SCORING PROCEDURES FOR THE JUNE 1988 DARPA BENCHMARK TESTS David S. Pallett Institute for Computer Sciences and Technology National Bureau of Standards Gaithersburg, MD 20899 ABSTRACT This paper describes the test material, procedures and scoring conventions used for the June 1988 DARPA Benchmark Tests. [In contrast to other DARPA Benchmark Tests, only] one selected subset of recorded speech corpus material was identified and this subset was used for tests of both Speaker Dependent and Speaker Independent technologies. [Further,] in contrast to previous DARPA Benchmarks Tests, there were no "live" tests during this series of tests. The same [SNOR] convention used in previous tests was to be used for representing the output of the systems, and a standardized string matching and scoring convention was to be used in reporting the results of the Benchmark Tests. INTRODUCTION At the Fall 1986 DARPA Speech Recognition Meeting, plans were discussed for implementing Benchmark Tests of the Continuous Speech Recognition Systems being developed under DARPA sponsorship. These plans were further developed during the period October 1986 - March 1987. A preliminary implementation of the test procedures was conducted prior to the March 1987 Meeting. At that meeting, BBN reported on the performance of their Speaker-Dependent technology using a set of 100 sentence utterances (25 sentence utterances from each of 4 designated speakers) chosen from the Resource Management Development Test Set portion of the DARPA Resource Management Speech Corpus [rm1/dep/dev/]. Approximately one month later, CMU reported the results of similar tests [for the CMU ANGEL system] involving another set of 100 sentence utterances (10 sentence utterances from each of 10 designated speakers) [from rm1/ind/dev_aug]. Each site also reported on the results of tests conducted on a set of 90 sentence utterances provided by "live talkers" (30 utterances from each of three designated speakers). The results of these tests were analyzed at NBS and distributed within the DARPA community. In preparation for the October 1987 Speech Recognition Meeting, NBS selected additional test material for similar tests. NBS participated in the implementation of the "live talker" test procedures and developed and distributed standardized scoring software to be used in reporting performance results at the October 1987 and future Meetings. The results of these Benchmark Tests conducted by BBN and CMU were reported at the October 1987 Meeting. In the Spring of 1988, MIT's Lincoln Laboratories conducted Benchmark Tests using the specified test material and test protocol, and reported those results within the community [9]. In preparation for the June 1988 Speech Recognition Meeting, NBS selected additional previously unused test material [from rm1/dep/dev/], made minor modifications to the scoring software (primarily in providing a standardized format for data tabulation and minor revisions to the table of homophones), and distributed the test material and scoring software to several sites. This paper serves to identify the test material used for the Benchmark Tests conducted in June 1988, describes the test procedures, and outlines the characteristics of the scoring software. RESOURCE MANAGEMENT SPEECH CORPUS TEST MATERIAL Test Material Previous Tests Speaker Independent Test Material For the March '87 tests, a set of ten speakers was identified, drawn from the Speaker Independent Development Test Subset of the Resource Management Speech Corpus [rm1/ind/dev_aug/] recorded at TI [1]. Data on these speakers is presented in papers presented at the March '87 Meeting [1,2]. Ten sentence utterances for each of these speakers were used for test purposes. For the October '87 tests, an additional set of six speakers was identified from the Speaker Independent Development Test Subset, and 10 utterances were specified for each of these speakers, amounting to a total of 60 sentence utterances in this portion of the test material. [EDITORIAL NOTE: Of these six speakers, five were new, and one (subject gtw0) had been included in the March '87 tests.] An additional set of four new speakers was chosen from the Speaker Dependent Development Test Subset [rm1/dep/dev/], and 25 sentence utterances were specified for each of these speakers. An additional 100 sentence utterances were thus identified in this portion of the test material. A total of 160 sentence utterances by a total of ten speakers were thus specified for use in the October '87 Benchmark Tests of Speaker Independent technology. [Editorial Note: Results for the CMU SPHINX Speaker-Independent System were reported in April of 1988, using a 15 speaker (150 sentence utterance) test set derived from the March and October 1987 DARPA Benchmark Test sets.] Speaker Dependent Test Material ------- --------- ---- -------- For the March '87 tests, a set of four speakers was identified, including 3 males and 1 female [2]. Twenty-five sentence utterances for each of these speakers were specified [from rm1/dep/dev/] and used for test purposes. In previous descriptions of these tests [10], selection of particular sentence utterances was [originally] referred to as a "random process" because no effort was taken to review or select specific sentence texts. This was not in fact a truly random selection process, since the selected sentence utterances were typically the first 25 sentences contained on the data tape for that individual. The selection of texts per se is believed to be random, but the selected sentence utterances were consecutively recorded within a recording session. This proved to establish an unfortunate precedent, since there was no effort to randomize selection from within an individual recording session, and "within session effects" may have occurred. The most commonly observed within session effect seems to be relatively carefully read speech at the beginning of a session, and a tendency for more casually read speech toward the end of the session. For the October '87 tests, another set of 25 sentence utterances from each of the same four speakers was specified [from rm1/dep/dev/] to be used for test purposes. For each speaker, 10 of these sentence utterances are the same as for the March '87 tests, and were selected as sentence utterances for which at least one error occurred under at least one of the test conditions in BBN's March '87 tests. The purpose of selecting these particular "difficult" sentence utterances was to facilitate demonstrations of incremental progress. (In fact, no analysis to identify such incremental progress was completed.) It is recognized that by selecting particularly difficult sentence utterances, the statistics for performance evaluation for the composite set of 25 sentences for each speaker would be biased somewhat lower than if all of the sentences had been randomly chosen, and separate reporting for the subsets of 10 difficult sentences and 15 "randomly chosen" sentences would be appropriate. An additional set of four (new) speakers was identified from the Speaker Dependent Development Test Subset [rm1/dep/dev/], and a set of 25 sentence utterances was specified for each of these speakers. This set of 100 sentence utterances is the same as that [also] used in tests of Speaker Independent Technology in the October '87 tests. A total of 200 sentence utterances by a total of eight speakers (40 "difficult" + 60 "randomly chosen" sentences by the four speakers used in the March '87 tests, plus 100 sentences for the four new speakers) were thus identified for use in the October '87 Benchmark Tests of Speaker Dependent technology. June '88 Tests -------------- For the June '88 tests, it was decided to select one set of test material drawn from the Speaker Dependent Development Test Set (designated tddd in the [original] corpus) [rm1/dep/dev/ in the CD-ROM version] and to use this one set of test material for both Speaker Independent and Dependent technologies. This was intended to permit comparisons of relative performance on identical test material. It was thought wise not to use the task domain Speaker Independent (and Dependent) Evaluation Test sets (tdie and tdde) materials [some of which is to be found in rm1/ind_eval and rm1/dep_eval] at this time, but to withhold it for future use. [EDITORIAL NOTE: Some of this material was released for the February '89 and October '89 Benchmark Tests: still more is presently unreleased and designated for future tests.] It was also decided to increase the size of the designated test materials to include 300 test sentence utterances. Accordingly, following precedents established in the March and October '87 tests, blocks of 25 unused test sentence utterances were selected for each of the 12 speakers for the speaker dependent material [from rm1/dep/dev]. This resulted in introduction of 4 new speakers, in addition to the 8 used in previous tests. Table 1 lists some information for the 12 speakers in this test material. Year of Subject Sex Region Race Birth Education ------- --- ------ ---- ----- --------- (Speakers also used in previous Benchmark Tests) BEF0 M NORTH MIDLAND WHT 1952 PHD CMR0 F NORTHERN WHT 1951 MS JWS0 M SOUTH MIDLAND WHT 1940 BS RKM0 M SOUTHERN BLK 1956 BS DTB0 M NORTH MIDLAND AMR* 1942 BS DTD0 F SOUTHERN* BLK 1954 BS PGH0 M NEW ENGLAND WHT 1963 BS TAB0 M WESTERN WHT 1960 BS ----------------------------------------------------------------- (Speakers not previously used for Benchmark Tests) DAS1** F NORTHERN WHT 1959 MS DMS0 F SOUTH MIDLAND WHT 1954 BS ERS0 M WESTERN WHT 1957 MS HXS0 F NEW YORK CITY WHT 1941 BS ---------------------------------------------------------------- * ("AMR" indicates American Indian) ** ("1" indicates second individual with initials DAS) TABLE 1 Test Speakers for June '88 Benchmark Tests Perhaps unfortunately, no effort was taken to randomize selection of the individual files, and the utterances [selected for use] in this test set may have been recorded later in the [original] recording session and may reflect "within session effects". In future selections of test material, efforts should be taken to randomize selection of all aspects of the test material (e.g. selection of texts and within the recording session). LIVE TALKER TESTS In the March '87 and October '87 tests, a "live talker" test protocol was defined and implemented. This test protocol was described in material made available at the October '87 Meeting [10]. In general, these tests served the purpose of demonstrating the ability of systems at both BBN and CMU to accommodate direct input from a microphone and to respond with recognized speech in about 10 times the duration of the utterance (typically 30 seconds or less). It was also shown that the results for the "live talkers" were comparable to the speakers in the corpus recorded at TI, but had higher error rates than for some of the highly experienced speakers used for demonstrations at some sites. It was noted that the effort to implement these tests was considerable, since dedicated facilities for real-time digitization and (in some cases) audio transmission lines had to be made available, and the live test speakers had to make site visits. For the June '88 tests, it was decided that it was not necessary to implement similar tests. TEST PROCEDURES Experimental Design ------------ ------ The June '88 Benchmark Tests are intended to be very similar to those implemented in March '87 and in October '87. The earlier procedures are described in an earlier paper [3] and the handout at the October '87 Meeting [10]. Specified utterances for the designated speakers in the Resource Management Speech Corpus are to be processed with and without the use of imposed grammars. The case without the use of an imposed grammar has been termed the "all word" case or the "full branching" case. For the current test, the use of one specific imposed grammar is required: this is the word-pair grammar developed at BBN for the Resource Management Speech Corpus. It has a typical test set perplexity of about 60. Comparably detailed results are to be reported for both conditions. No other parameters are to be changed for these comparative tests. Use of the material for "rapid adaptation" is optional. [On Speaker Independent System Training] ---------------------------------------- [At the May 1988 IEEE Arden House Speech Workshop, several DARPA researchers met to identify a consistent set of training material to be used for the June 1988 DARPA Benchmark Tests. It was recognized that 8 of the 12 speakers in the June 1988 test material were also included in the Speaker Independent Development Test set (tdid), and that these speakers should not be included in the system training material. Accordingly, a 72 speaker "standard" training set was defined, excluding the speakers in the June '88 test set. This "standard 72 speaker" training set appears in the CD-ROM set in rm1/ind_trn/. Further discussion of the rationale for choice of these test sets is contained in the material describing the February 1989 Benchmark Tests.] Vocabulary/Lexicon/Output Convention ------------------------- ---------- The conventions used for representing the system output, and, for comparison in scoring, for the reference strings, are described in a previous paper [3]. TI provided an implementation of the rules in Standard Normalized Orthographic Representations (SNORs) [rm1/doc/al_sents.snr], and these are to be used for scoring purposes. In this representation, there are a total of 991 distinct lexical entries [rm1/doc/lexicon.snr], as derived from the set of 2800 sentences developed at BBN. It has been noted that this lexicon is not logically complete. But it is all inclusive in the sense that it covers all entries in the recorded corpus, and thus should be provisionally sufficient for the purposes of scoring when using test material derived from the set of 2800 sentences and/or from the recorded Resource Management Corpus. Analysis of system responses provided by BBN and CMU for the March '87 tests disclosed that the different sites used different lexical conventions for both internal representations and for system output, complicating scoring. For example, at CMU there were instances of a lexical entry "CITRUS-1" presumed to represent one of the alternative pronunciations of "CITRUS". For such a system response to be scored as correct, there either has to be post- processing of the responses or special adaptation of the scoring software. At BBN, the city (place name) San Diego, represented in the SNOR lexicon as "SAN-DIEGO", was represented as two entries "SAN" and "DIEGO", giving rise to other scoring complications. There is "a natural assumption that the units used for scoring should be as similar as possible to the lexical units used in a system" [4]. Given differences between systems and the differing lexical representations in different systems, there is need for a standard representation for each word. For the Resource Management task, the SNOR convention and lexicon fill this role, and they are to be used for the October '87 tests and for the June '88 tests. However, it has become evident that no consistency is to be expected with regard to internal representations. To assist in understanding what is meant by an "N-Word System" (e.g. the 1000 Word systems presently under study in the DARPA program), it is proposed that the lexical words used by particular systems should always be specified (be they words, phrases, sentences, or combinations thereof). Mappings or postprocessing between the internal representations and the system output (used for evaluation by comparison with the reference strings) should be documented. SCORING PROCEDURE At the time the March '87 tests were implemented, no general agreement had been reached concerning the software to be used for scoring the system output. Scoring software was provided to NBS by BBN, CMU and TI for comparative usage with preliminary system output. Subsequently, additional software was provided by Lincoln Laboratories, and C-language code was written at NBS to implement what seemed to be the most attractive features of each software package, as well as including some new capabilities. The intended purpose of developing this standardized scoring software is to provide a versatile and consistent set of scoring tools. Dynamic Programming String Matching Algorithm ------- ----------- ------ -------- --------- Scoring data is derived from comparisons of reference strings and system outputs, using a dynamic programming string alignment algorithm. The C-language string alignment procedure was adapted from code written by Doug Paul at Lincoln Laboratories (following discussions with Rich Schwartz and Francis Kubala at BBN). It is similar to the ERRCOM scoring utility written in Zetalisp at BBN. Both are "functionally identical dynamic programming algorithms for computing the lowest cost alignment between two strings (possibly not unique) given the following constraints on the cost function used to score the alignments: (1) An exact match incurs no penalty. (2) Deletion and insertion errors incur equal penalties. (3) The sum of one deletion and one insertion error penalty is greater than one substitution error penalty. For ties (multiple best alignments) an arbitrary choice is made. This decision cannot affect the alignment score but merely reorders adjacent substitution and deletion/insertion errors. It is worth mentioning that the NBS and BBN programs make this choice differently, therefore alignments may vary, but scores won't."[5] Subsequent to the October '87 Meeting, Hunt has brought attention to certain inherent deficiencies in the use of dynamic programming string alignment for scoring [11, 12]. The imprecision or bias in dynamic programming string alignment arises because the string alignment process inevitably finds random sequences that can be lined up with correct sequences. Specifically, it has been noted that in the presence of a high rate of insertion errors, the dynamic programming algorithm will tend to seriously underestimate the true rate of insertion errors, and in general the error rates may be seriously underestimated. Hunt has implemented an analysis to estimate the biases in using dynamic programming string alignment for the case of most direct relevance to the DARPA case in these benchmark tests: for string lengths (sentences) of 8 words and with a vocabulary size of 1000 words. In Hunt's simulation of the DARPA case, and using string alignment penalties that are identical, two cases of interest to the DARPA Benchmark Tests were considered: (1) 96% correctly recognized, with a substitution error rate of 2.5%, an insertion error rate of 2.5% and deletion error rate of 1.5% (which is taken as representative of the best results obtained in the October '87 Benchmark Tests), and (2) 20% correctly recognized, with a substitution error rate of 78%, an insertion error rate of 75%, and deletion error rate of 1.5%. This second case is taken as representative of results obtained without use of an imposed grammar and is in some sense a worst case of the October tests. For the first case, using 4000 strings, Hunt found that the dynamic programming based estimate of proportion correct agreed perfectly with the true value. A DP estimate of insertion errors of 2.46% corresponded with a true value of 2.54%. A DP deletion error estimate of 1.58% corresponded with a true value of 1.66%. These results confirm the generally valid observation that the proportions of deletion and insertion errors tend to be underestimated even though the proportion estimated correct may be relatively more accurate. However, for high performance systems, and with short strings and large vocabularies, the bias is less severe than otherwise. For the second (worst) case, however there "is a different story. The difference between DP estimates of the insertion rate and the deletion rate is determined by the average difference between the true string lengths and the lengths of the recognizer output strings. If the recognizer output is mostly incorrect, and if an insertion plus a deletion costs more than a substitution [as is the case in the DARPA scoring software], the DP scoring will interpret as many errors as possible as substitutions. In the case of output strings that are longer than their correct counterparts, the estimated deletion error rate will be close to zero.... The deletion and insertion error rates must be grossly underestimated [12]. For the case of 400 strings with true string length of 8 and a vocabulary size of 1000, with a true proportion correct of 19.0% and a deletion error rate of 26.9% and an insertion error rate of 100%, Hunt obtained DP estimates of 19.4% correct, deletion error rate of 0.16% and insertion error rate of 73.2%. Not only is the already high insertion error rate underestimated (73.2% vs. 100%), but the deletion error rate is underestimated by two orders of magnitude (0.16% vs. 26.9%). Note, however, that the proportion correct remains relatively accurate (19.4% vs. 19%). Hunt suggests that the proportion of words correctly recognized is a more reliable performance measure than is the total number of errors as a proportion of the total number of words. Error Taxonomy and Statistics ----- -------- --- ---------- The standard error taxonomy resulting from use of the software includes data on the percentage of words (in the reference string) that are correctly recognized, the percentage of substitutions, percentage of deletions, percentage of insertions, and the total percent error (where this total includes substitutions, deletions and insertions). In BBN's error taxonomy, they have inferred "word accuracy" as [100-(percent error)]. Note, however, that word accuracy is not in general equal to the percent correctly recognized words because of the (potential) significance of insertions. Splits and Merges: Contractions ------ --- ------- ------------ Recent discussions of alternative error taxonomies that have appeared in the literature [6] include discussion of the occurrence of errors called "splits" and "merges". In our error taxonomy, a split is decomposed into consecutive substitution and insertion errors (or an insertion and a substitution). An example of such an error would be a contraction such as "ISN'T" being reported as "IS" and "NOT". Similarly, a "merge" would be decomposed into consecutive substitution and deletion errors (or a deletion followed by a substitution). Correspondingly, the sequence "IS NOT" might be reported as "ISN'T". In these DARPA Benchmark Tests, there is no explicit consideration of splits or merges, nor is there any special provision for reporting these errors, although the scoring software does permit analysis of the occurrence of these errors. Special Classes of Insertions: Pre- and Post-Shadowing ------- ------- -- ----------- ---- --- -------------- Other references in the recent literature [6,7] cite the occurrence of "pre-shadowing" or "post-shadowing". Pre-shadowing might occur when an initial fragment of a poly-syllabic word such as "MAXIMUM" is reported as "MAX" and then the system (correctly) responds with the correct answer, in this case "MAXIMUM". The NBS software does not detect the occurrence of these special cases of insertions, nor does our proposed error taxonomy include them. Homophone Errors --------- ------ A number of scoring software packages include special provision for scoring substitution errors involving homophones. This may be particularly appropriate for those cases in which there is no imposed grammar, and the probability of homophone errors may be high. The NBS scoring software can refer to a table of homophones when classifying errors involving substitutions to determine if the errors involve homophones. In previous DARPA Benchmark Tests, the option to score homophone substitution errors as correct was not implemented. There seems to be a growing consensus that this is unreasonable for the "no grammar" case, since the acoustic-phonetic evidence is not sufficient to disambiguate homophones. For the case of imposed grammars however, since, in general, there is [or should be] additional information available to disambiguate homophones [e.g., probabilistic, syntactic or semantic information] it it continues to seem reasonable to score substitution errors involving homophones as true errors. Thus for the current tests, homophone substitution errors are to be scored as correct for the "no grammar" case, but are not to be scored as correct for the case of imposed grammars. [EDITORIAL NOTE: This represents a change in scoring procedure over that used in the March and October '87 tests.] A revised table of acceptable homophones was included in the scoring software distributed prior to this meeting. The impact of this change of policy should be to slightly improve indicated performance in the "no grammar" case. Synonyms -------- Analysis of the March '87 results identified some errors involving substitutions of synonyms (e.g. "MAX" for "MAXIMUM"). It can be argued that these errors are semantically acceptable, and the NBS software can be set to classify substitution errors involving synonyms by reference to a table of acceptable synonyms. This option is not ordinarily used, but it is provided as a diagnostic tool. Deletions of "THE" --------- -- ----- "Deletions of the token "THE" are typically a large proportion of the errors observed in high performance systems and the majority of these deletions leave the semantic intention of the utterance intact" [5]. Thus it has been argued that it would be valuable to compute the proportion of errors of this type, and correspondingly, to score sentences whose only errors involve deletions only the word "THE" as "semantically OK" . The current NBS scoring software can account for this class of error. Time Criterion for Word Beginnings ---- --------- --- ---- ---------- It has been argued that the most appropriate criterion for scoring the output of a speech recognition system would include some measure of the accuracy of a system in reporting on the identified word boundaries. One proposal to this effect suggests that the test material be (manually) marked with word boundary information (word beginning and end times) and that the scoring algorithm refer to this information as well as the correctness of the word in classifying the response as correct. Hunt's procedure for evaluating the performance of connected-word speech recognition systems makes use of end-point information, and defines a procedure for association of boundaries found by the recognizer with the "actual start and end points of the words in the test data" [11]. In the DARPA Benchmark Tests, it was judged that the need for reliance on manually labelled word boundary information is an unattractive aspect of such a scoring procedure, and that the costs associated with marking the speech corpus with word boundary information exceeded the benefits of increased precision and reduced bias, particularly for high performance systems. Processing Times ---------- ----- In the Benchmark Tests, the processing times and system configuration are to be reported. Sentence Level Scoring -------- ----- ------- In the NBS scoring software, a sentence is scored as correctly recognized only if all words have been recognized and there are no insertion or deletion errors. Supplemental analyses (such as the fraction of sentences for which the only errors involve deletions of the word "THE") are permitted, but only to supplement the basic data. Characterizing the Imposed Grammar -------------- --- ------- ------- At present, no general agreement has been reached for a completely unambiguous procedure for characterizing the imposed grammars in all systems. A proposal for characterizing the complexity of a language model in terms of the "test-set perplexity" has been circulated [6], but it appears that different descriptions of imposed grammars may be used by different sites for the present tests. This matter needs to be actively addressed. "Standard Scoring Procedure" --------- ------- ---------- The standard scoring software distributed by NBS with the specified test files is to be used. The reference strings to be used for scoring are to be the SNOR representations [rm1/doc/al_sents.snr] developed from the lexical/output convention described in [3]. The lexical convention for system output for the present tests is thus that for the 991 word SNOR lexicon [rm1/doc/lexicon.snr]. The option to score homophone substitution errors as correct IS TO BE USED for the case of "no grammar", but IS NOT TO BE USED in the case of imposed grammars. The option to score errors involving synonyms as correct IS NOT to be used. The option to analyze splits and merges and to identify errors involving contractions IS NOT to be used. At least TWO system configurations ARE to be run: one with no imposed grammar, and a second with the word-pair grammar [rm1/wp_gram.txt] developed at BBN from the Resource Management material. The test set perplexity of this grammar has been estimated at approximately 60, for a sufficiently large and random sampling of material. Other grammars, with different perplexities, are acceptable, but the perplexity should be stated for each. Summary statistics on the percentages of words (based on the number of words in the reference strings) correctly recognized, substitutions, deletions, and insertions ARE TO BE reported. The total percent error is to include substitutions, deletions, and insertions. These statistics are to be reported for each speaker under each test condition as well as for the larger test subsets to which the test speakers belong. As mentioned previously, data are to be reported for tests conducted without use of an imposed grammar as well as with imposed grammar(s). System output data is to be made available to NBS for additional analysis. ACKNOWLEDGEMENTS The development of the NBS scoring software described in this paper involved contributions from a number of individuals at several organizations. The scoring software was developed from contributions from Doug Paul at Lincoln Labs, Francis Kubala at BBN, George Doddington and Bill Fisher at TI, and Rich Stern at CMU. Alex Rudnicky at CMU contributed significantly in clarifying the need for consistent lexical representations. Bill Fisher and Patti Price (at BBN) collaborated with us in defining the SNOR lexicon and files for the reference sentence strings to be used in scoring. Each of the proposed scoring packages had demonstrable merits: we sought to combine the best features of each. Discussions with Melvyn Hunt at the NRC in Ottawa have been very helpful in identifying limitations on the use of DP-based scoring algorithms, particularly for low-performance systems. Selection of the test material used in the June '88 tests was done at NBS from material previously defined and recorded with the collaboration of Patti Price at BBN, Jared Bernstein at SRI, and Bill Fisher at TI. They bear no responsibility, however for imperfections in the selection of this round of test material (e.g. unanticipated within-session effects): that responsibility is the author's. At NBS, credit for coding the original scoring software goes to Stan Janet. Revisions and improvements were implemented by Mike Garris. John Garofolo, Stan and Mike deserve significant credit for producing copies of the speech corpus tapes and for distributing the scoring software and Benchmark Test material. References [1] W.M. Fisher, "The DARPA Task Domain Speech Recognition Database", Proceedings of the March 1987 DARPA Speech Recognition Workshop. [2] D.S. Pallett, "Selected Test Material for the March 1987 DARPA Benchmark Tests", Proceedings of the March 1987 DARPA Speech Recognition Workshop. [3] D.S. Pallett, "Test Procedures for the March 1987 DARPA Benchmark Tests", Proceedings of the March 1987 DARPA Speech Recognition Workshop. [4] Private Communication with Alex Rudnicky, September 21, 1987. [5] Private Communication with Francis Kubala, et al., August 13, 1987. [6] R.D. Rodman, M.G. Joost and T.S. Moody, "Performance Evaluation of Connected Speech Recognition Systems", Proceedings of Speech Tech '87, New York, NY, April 28-30, 1987, pp. 269-274. [7] F. Dreizin, R. Kittredge and D. Korelsky, "Semantic Support in Speech Recognition: An Application to Fire Control Dialogues", Proceedings of AVIOS '86 Voice I/O Systems Applications Conference, Alexandria, VA, September 16-18, 1986, pp. 339-354. [8] S. Roucos, "Measuring Perplexity of Language Models Used in Speech Recognizers", unpublished manuscript circulated within DARPA research community, September 1987. [9] Private Communication with Doug Paul, April 1988. [10] D.S. Pallett, "Selected Test Material, Test and Scoring Procedures for the October 1987 DARPA Benchmark Tests", unpublished manuscript distributed at the October 1987 DARPA Meeting. [11] M.J. Hunt, "Evaluating the Performance of Connected-Word Speech Recognition Systems", Proceedings IEEE Int. Conf. Acoustics, Speech & Sig. Proc., ICASSP-88, New York, April 1988. [12] Private Communication with Melvyn Hunt, March 22, 1988.