Editorial note: the following consists of a revised version of material prepared to document DARPA benchmark speech recognition tests. The two papers describing the first of these tests, in March 1987, are more formal than the others and were originally prepared for inclusion in the proceedings of the March 1987 DARPA speech recognition workshop. More informal notes were prepared for distribution within the DARPA speech research community to provide information on the tests conducted prior to the October 1987 and June 1988 meetings. The June 1988 note consists of an adaptation of the October 1987 note. Still more informal notes were prepared to outline the test procedures for the February 1989 and October 1989 benchmark tests. Editorial revisions primarily include some changes of tense, inclusion of references to directories in this cd-rom in which the material might be found, and substitution of the term "corpus" or "corpora" for "database" or "databases". Other editorial changes include insertion of comments as appropriate for clarification. In most cases, these editorial revisions are included within square brackets. **************************************** TEST PROCEDURES FOR THE MARCH 1987 DARPA BENCHMARK TESTS David S. Pallett Institute for Computer Sciences and Technology National Bureau of Standards Gaithersburg, MD 20899 ABSTRACT This paper describes test procedures that were to be used in conducting benchmark performance tests prior to the March 1987 DARPA meeting. These tests were to be conducted using selected speech database material and input from "live talkers", as described in a companion paper. INTRODUCTION At the Fall 1986 DARPA Speech Recognition meeting, plans were discussed for implementing benchmark tests using the Task domain Speech Corpus. There was additional discussion of the desirability of developing and implementing "live tests" using speech material provided by speakers at the contractors' facilities, emulating in some sense the process of inputting speech material during a demonstration of real-time performance. Following the Fall Meeting, the Task Domain Speech Database [Resource Management Corpus] was recorded at TI and significant portions of it were made available for system development and training purposes through NBS to both CMU and BBN. Another portion was selected for use in implementing these benchmark tests [1], and this test material was distributed to CMU and BBN during the last week of February, 1987. This paper outlines test procedures to be used to implement these tests prior to the March 1987 Meeting. A number of informal documents have circulated within the DARPA Speech Recognition community that outline proposed test procedures. A Strategic Computing draft document dated Dec. 6, 1985 [2] identified key issues in some detail. Portions of this document were heavily annotated and distributed to several sites during June 1986 and were the subject of discussions involving the author and representatives of CMU, BBN, Dragon Systems, MIT and TI during visits during June and early July 1986. These discussions were valuable in developing an outline of benchmark test procedures [3] that was discussed at the Fall 1986 DARPA Meeting, and which was structured after a model for performance assessment tests outlined in an earlier NBS publication [4]. Thus the present proposed test procedure represents the most recent and specifically focused in a series of documents outlining test procedures for the DARPA Speech Recognition Program. EXPERIMENTAL DESIGN There were to be two distinct types of tests conducted prior to the March 1987 DARPA meeting: (1) Tests based on use of a subset of the Task Domain (Resource Management) Development Test Set Speech Database. This subset was to include use of 100 sentence utterances in either the Speaker Independent or Speaker Dependent portions of the database. The process of selecting speakers and specific utterances is described in Reference [1]. In each case, there was considerable freedom to choose system-dependent factors such as the amount of training material for Speaker Dependent technology and the most appropriate grammar. All of the 100 specified test sentences were to be processed and reported on at the meeting. "Spell-mode" material (spelled-out representations of the letter strings for items in the lexicon) was available for use, but processing this material was not required. These sentence utterances were to be processed both with and without the use of imposed grammars. In the case of using no grammar, the perplexity is essentially to be nominally 1000. Comparable detailed results are to be reported for both conditions. No other parameters are to be changed for these comparitive tests. Optionally, the same data may be processed using the "rapid adaptation" sentences for system adaptation. There is to be no use of adaptation during processing of the test material. (2) Tests based on input provided from "live talkers". The test talkers visited both CMU and BBN prior to the March meeting. Each of the talkers spoke the "rapid adaptation" sentences and read a script containing 30 sentences drawn from the task domain sentence corpus. Data derived from the input from "live talkers" was to be analyzed and reported on at the March meeting. LIVE TEST PROTOCOL The microphone was to be the same as that used at TI for the Resource Management Corpus, the Sennheiser HMD 414-6. This is a headset-mounted noise cancelling microphone similar to the Shure SM-10 family of microphones. The headset is a supra-aural headset that allows the subject to be aware of nearby conversation or instructions for prompting. The test environment was to be a conference room or computer lab. There was to be no background speech at the time the test material is provided. Test utterances could be rejected (and the subject asked to repeat the sentence) if in the judgment of the person(s) administering the tests there was some noise artifact (e.g. coughs or paper-shuffling noises) or severe mis-articulation of the test sentence. Evidence of this could be obtained by play-back of the digitized utterance. For systems that require time to develop speaker-adaptive models, the subjects were to provide the 10 "rapid adaptation" sentences prior to the tests (e.g. the evening prior to the tests). For one of the speakers, the 30 test sentences were to be read in and processing (automatic recognition) could take place "off-line". For the other two speakers, the test sentences were to be read in, one at a time, waiting for the system to recognize each sentence before proceeding to the next sentence. At the end of 30 minutes, if all 30 sentences had not been read in and recognized, the remaining sentences were to be read in for "off-line" processing. In practice, only three to five sentences were recognized interactively within the 30 minute period, and the remaining sentences were then read in. The elapsed time for each speaker providing the test material in this manner was typically 45 minutes. If requested, each speaker was to read in 10 words randomly chosen from the "spellmode" vocabulary subset. PROCESSING OF LIVE INPUT The systems were to process the test material in a manner similar to that used for the Resource Management database test material. Statistics comparable to those for the 100 sentence subsets were to be prepared and reported on at the March meeting. ADAPTATION Although the use of the "rapid adaptation" sentences was to be permitted, it appears that the only use made of the rapid adaptation sentences was in adapting the Speaker Dependent system at BBN for the "live test" speakers. There was to be no use of any of the test material to enroll, adapt or to optimize system performance for the test material through repeated analyses and re-use of the test material. Intended allowable exceptions to this prohibition against re-use of the test material include demonstrating the effects of using different grammars, different strategies for enrollment, different algorithms for auditory modelling, acoustic-phonetic feature extraction, different HMM techniques, system architectures, etc. It is recognized that the breadth of these exceptions in effect limit the future use of this test material, since such extensive use of test material to demonstrate parametric effects constitutes training on test material. Since a finite set of task domain sentences was developed at BBN, and the entire corpus of task domain [Resource Management] sentences was made available to both CMU and BBN, in some cases the grammars used for these tests have been adapted to this finite set of sentences, including the test material. VOCABULARY/LEXICON/OUTPUT CONVENTIONS The task domain sentences in effect define the vocabulary. Internal representations (lexicon entries) may be at the system designer's choice, but for the purposes of implementing uniform scoring procedures, a convention was defined, drawing on material provided by CMU [5], BBN and TI. This convention includes the following considerations: Case differences are not preserved. All input (reference) strings and output strings are in upper case. There is no end-of-sentence punctuation. Nor is there any required special symbol to denote silences (either pre-pended, within the sentence utterance, or appended) or to indicate failure of a system to parse the reference string or input speech. Apostrophes are represented by plusses. Words with apostrophes (embedded or appended) are represented as single words. Thus "it's" becomes "IT+S". Abbreviations become single words. All periods indicating abbreviations are removed and the word is closed up (e.g. "U. S. A." becomes "USA"). Hyphenated items count as single words. In general, compound words that do not normally appear as separate words in the context of the assumed task domain model are entered as single, hyphenated items. The exception to this rule are compounds that include a geographic term, such as STRAIT, SEA or GULF. Thus entries such as the following count as single "words": HONG-KONG, SAN-DIEGO, ICE-NINE, PAC-ALERT, LAT-LON, PUGET-1, M-RATING, C-CODE, SQQ-23, etc. However, BERING STRAIT is to count as two words since this compound includes the geographic term "STRAIT", and it is not to be hyphenated. Acronyms count as single words, and the output representation is not the form of the acronym made easier to interpret or pronounce (e.g. "PACFLT", not PAC-FLEET or PAC FLEET). Mixed strings of alpha-numerics are treated as acronyms. Thus, "A42128" is treated as a one-word acronym, even though the prompt form of this indicates that this is to be pronounced as "A-4-2-1-2-8". Strings of the alpha set are also treated as acronyms (e.g. "USA"). Strings of digits are entered in a manner that takes into account the context in which they appear. Thus for a date such as 1987, it is represented as three words: "NINETEEN" "EIGHTY" and "SEVEN". If it is referred to as a cardinal number it would be represented as "ONE" "THOUSAND" "NINE" "HUNDRED" "EIGHTY" "SEVEN". SCORING THE TEST MATERIAL For results to be reported at the March 1987 meeting, the use of different scoring software [at each contractor's site] was acceptable. Each contractor was free to use software consistent with the following general requirements: Data are to be reported at two levels: sentence level and word level. At the sentence level, a sentence is to be reported as correctly recognized only if all words are correctly recognized and there are no deletion or insertion errors (Other than insertions of a word or symbol for silence or a pause). The percent of sentences correctly recognized is to be reported, along with the percent of sentences that contain (at least one) insertion error(s), the percent of sentences that contain (at least one) deletion error(s) and the percent of sentences that contain (at least one) substitution error(s). The number to be used for the denominator in computing these percentages is the number of input sentences in the relevant test subset, without allowing for rejection of sentences or utterances that may not parse or for which poor scores result. At the word level, data that were to be reported included the percent of words in the reference string that have been correctly recognized. For these tests, "correct recognition" does not require that any criterion be satisfied with regard to word beginning or ending times. It was valuable, but not required, to report the percent of insertion, deletion, and substitution errors occurring in the system output. For those systems that provide sentence or word lattice output, scoring was to be based on the top-ranked sentence hypothesis. Additional scoring based on lower-ranked alternative hypotheses was acceptable, provided the data were compared with comparable data for the top-ranked hypothesis. System response timing statistics was to be reported. Data resulting from these tests was provided to NBS following the March [1987] meeting for detailed analysis and in evaluating alternative scoring software. DOCUMENTATION Documentation on the characteristics of the imposed grammar(s) must be provided. This information should describe any use of the material from which the test material was drawn (i.e. the set of 2200 task domain sentences developed at BBN and used by TI in recording the Resource Management Speech Database). The system architecture and hardware configuration used for these tests should be documented. REFERENCES: [1] D.S. Pallett, "Selected Test Material for the March 1987 DARPA Benchmark Tests", Proceedings of the March 1987 DARPA Speech Recognition Workshop. [2] (Anonymous) "Integration, Transition and Performance Evaluation of Generic Artificial Intelligence Technology", Strategic Computing Program draft document dated Dec. 6, 1985 (For Official Use Only). [3] D.S. Pallett, "Benchmark Test Procedures for Continuous Speech Recognition Systems", draft document dated August 29, 1986 distributed prior to the Fall 1986 DARPA meeting. [4] D.S. Pallett, "Performance Assessment of Automatic Speech Recognizers", Journal of Research of the National Bureau of Standards, Volume 90, Number 5, September-October 1985, pp. 371- 387. [5] A.I. Rudnicky, "Rules for Creating Lexicon Entries", note dated 11 February, 1987. **************************************************************************** SELECTED TEST MATERIAL FOR THE MARCH 1987 DARPA BENCHMARK TESTS David S. Pallett Institute for Computer Sciences and Technology National Bureau of Standards Gaithersburg, MD 20899 ABSTRACT This paper describes considerations in selecting test material for the March '87 DARPA Benchmark Tests. Using a subset of material available from the Task domain (Resource Management) Development Test Set, two sets of 100 sentence utterances were identified. For Speaker Independent technology, 10 speakers each provide 10 test sentences. For Speaker Dependent technology, 4 speakers each provide 25 test sentences. For "live talker" test purposes, three 30-sentence scripts were identified, using a total of 70 unique sentence texts. The texts of all of these test sentences were drawn from a set of 2200 sentences developed by BBN in modelling the (resource management) task domain. INTRODUCTION In order to implement benchmark tests of speech recognition systems to be reported at the March '87 DARPA Speech Recognition Meeting, it was necessary to specify selected test material. This test material is drawn from two sources: (a) the Task Domain speech corpus recorded at Texas Instruments (also referred to as the "Resource Management Corpus"), and (b) the use of "live talkers" in site visits. In each case, the texts of the sentences were drawn from a set of sentences developed by BBN. Selection of test material using the Resource Management Corpus includes two separate components, a Speaker Independent component and a Speaker Dependent component. This paper outlines the process of defining these subsets of speech material. At the time the Resource Management Speech Corpus was designed, it was intended that approximately equal volumes of material would be available for system development (research) purposes and for two rounds of benchmark tests. Consequently, approximately half of the available material is designated as "training" material, and the remaining portion is designated for test purposes. The test material is designated as "Development Test" or "Evaluation Test" sets, each including 1200 test sentence utterances in each portion (Speaker Independent or Speaker Dependent). The design and collection of this Task Domain (Resource Management Speech Corpus is described elsewhere in a paper by Fisher [1]. Thus, as originally intended, two sets of 1200 sentence utterances [i.e., one set of 1200 sentence utterances for Speaker Dependent technology, and another set of 1200 sentence utterances for Speaker Independent technology] were to be available for the March '87 tests. [However,] during January 1987, discussions involving representatives of CMU, BBN, MIT, NBS and the DARPA Program Manager determined that use of this large a volume of test material was not necessary to establish performance of current technology when pragmatic considerations of processing times and expected performance levels were made. Consequently, it was agreed that subsets of 100 sentence utterances were to be defined for these tests, and that NBS would specify the appropriate subset. To complement the use of the recorded speech database material, a test protocol for the use of "live talkers" emulating in some sense procedures to be used in future demonstrations of these systems was defined, and texts were selected for this purpose. RESOURCE MANAGEMENT SPEECH DATABASE TEST MATERIAL Speaker Independent Test Material ------- ----------- ---- -------- For the March '87 tests, a set of ten speakers was identified, drawn from material recorded at TI and made available to NBS in December '86 and January '87. Each speaker provided two "dialect" sentences [i.e., sentence utterance files ending in sa01.sph or sa02.sph] and the ten "rapid adaptation" sentences [i.e., sentence utterance files sb01.sph through sb10.sph] in addition to a total of thirty test sentence utterances. For each speaker, a unique subset of ten sentence utterances were specified to be used for the March '87 tests, amounting to 100 sentence utterances in all (10 speakers times 10 sentence utterances per speaker). Seven male and three female test speakers were selected, reflecting the male/female balance throughout the Resource Management Speech Database. To aid in the selection of individual speakers, a preliminary set of approximately 16 speakers was identified. SRI was asked for advice on whether any of these would be regarded as anomalous on the basis of the "dialect" sentences obtained in the acoustic- phonetic database. SRI performed a clustering analysis and advised us that most of the speakers clustered in three groups of similar speakers with three other individuals categorized as exceptional in some sense (e.g. unusually slow rate of speech) [2]. The ten speakers identified for inclusion in the test subset include one of these "exceptional" speakers, the others being drawn from the three clusters to provide some degree of coverage of regional effects. Table 1 provides detailed information on the individual speakers' regional backgrounds, race, year of birth and educational level for the ten selected speakers in the March '87 Test Subset. Subject Sex Region Race Year of Birth Education ------- --- ------ ---- ------------- --------- DAB MALE NEW ENGLAND WHT '62 B.S. GWT MALE NORTHERN WHT '21 B.S. DLG MALE NORTH MIDLAND WHT '42 (?) CTT MALE SOUTHERN WHT '62 B.S. JFC MALE NEW YORK CITY WHT '59 B.S. BTH MALE WESTERN WHT '62 B.S. AWF FEMALE SOUTHERN WHT '58 B.S. BCG FEMALE "ARMY BRAT" (?) '59 B.S. SAH FEMALE NEW ENGLAND WHT '46 B.S. JFR MALE WESTERN WHT '39 M.S. Table 1. Speaker Independent Test Subset Analysis, by TI, of the lexical coverage provided by this subset of the test material indicates that 348 words occur at least once in this test material, and the total number of words is 836, for a mean length of each sentence of 8.36 words. Speaker Dependent Test Material ------- --------- ---- -------- For these tests, a set of four speakers was identified, also drawn from material recorded at TI and made available to NBS during December '86 and January '87. In this case, selection of the specific individuals was strongly influenced by the availability of training material. BBN expressed concern that the entire set of 600 sentence utterances intended for system training should be available for any test speakers. At the time of selection of test material, not all of the 12 speakers for this portion of the database had completed recording their training material. Four speakers were identified with this constraint in mind. Each speaker had previously recorded the ten "rapid adaptation" and "dialect" sentences, and the Development Test material included 100 sentence utterances for each speaker. From this, unique sets of 25 sentence utterances were identified for each of the four speakers, amounting to 100 sentence utterances in all for this portion of the test material. Three of the speakers were male and one was female. Table 2 provides additional data on these speakers. Subject Sex Region Race Year of Birth Education ------- --- ------ ---- ------------- --------- CMR: FEMALE NORTHERN WHT '51 M.S. BEF: MALE NORTH MIDLAND WHT '52 Ph.D. JWS: MALE SOUTH MIDLAND WHT '40 B.S. RKM: MALE SOUTHERN BLK '56 B.S. Table 2. Speaker Dependent Test Subset Analysis, by TI, of the lexical coverage provided by this subset of the test material indicates that 832 words occur at least once, with a total number of words of 832, for a mean sentence length of 8.32 words. This is quite similar to that for the Speaker Independent material, although the details of the distributions differ slightly. LIVE TALKER TEST MATERIAL For the "live Tests", it was necessary to select sentence texts that would be read by the test speakers. It was thought desirable to use three speakers, each speaker reading a total of 30 sentence texts in addition to the 10 "rapid adaptation" sentences. Ten of the thirty sentence texts were to be the same for all speakers, so that of the 90 sentence utterances to be used for testing, there would be three productions of each of the ten sentences, and 60 other sentences (20 for each of three speakers). A total of 70 unique sentence texts was thus required. The sentence texts were selected from a subset of 2200 Resource Management sentences. CMU representatives had indicated a preference for sentence texts that could be read in less than 6 seconds. Accordingly, the essentially random process of sentence text selection was perturbed slightly to exclude longer sentences. Lexical analysis, by TI, of the scripts developed from these sentences indicates that the three scripts are well-balanced in terms of mean sentence length and number of lexical entries. Each of the three scripts has a mean sentence length of 7.93 words (258 words/30 sentences), reflecting the intentional bias in sentence selection process toward slightly shorter sentences. The number of lexical entries in the three scripts is 153, 155 and 161. The prompt form of each of these scripts was made available to the "live talkers" in site visits conducted in March '87. Each of the test speakers was to use the Sennheiser HMD 414-6, the same microphone used at TI for the Resource Management Speech Database, and the test environment was to be a computer lab or conference room with no competing conversation. A portion of the test material was to be provided in an interactive manner (i.e. while waiting for system processing of the data) and the remainder was to be processed off line. GRAMMATICAL COVERAGE At the time that BBN developed the set of approximately 2800 sentence texts [e.g.,in rm1/doc/al_sents.txt] modelling this task domain, no explicit or formally defined grammar was used. Rather, a set of prototypical sentences was identified to provide coverage of the task, and the subset of vocabulary occurring in these sentence "patterns" was then expanded to approximately 1000 words. There were a total of approximately 950 sentence patterns [3]. By incorporation of the expanded vocabulary, the 2800 sentences were generated by including approximately three exemplars of each pattern. From these, 600 were designated to be used for speaker- dependent training material, leaving a remaining subset of 2200 sentences. All of the test material was randomly selected from this subset of 2200 sentences. No analysis to determine the representation of the basic sentence patterns in the test material has been conducted to date. REFERENCES [1] W.A. Fisher, "A Task Domain Database", Proceedings of the March 1987 DARPA Speech Recognition Workshop. [2] J. Bernstein, private communication, January 1987. [3] P. Price et al., oral presentation at the September 1986 DARPA Speech Recognition Workshop.