Editorial note: the following consists of a revised version of material prepared to document DARPA benchmark speech recognition tests. The two papers describing the first of these tests, in March 1987, are more formal than the others and were originally prepared for inclusion in the proceedings of the March 1987 DARPA speech recognition workshop. More informal notes were prepared for distribution within the DARPA speech research community to provide information on the tests conducted prior to the October 1987 and June 1988 meetings. The June 1988 note consists of an adaptation of the October 1987 note. Still more informal notes were prepared to outline the test procedures for the February 1989 and October 1989 benchmark tests. Editorial revisions primarily include some changes of tense, inclusion of references to directories in this cd-rom in which the material might be found, and substitution of the term "corpus" or "corpora" for "database" or "databases". Other editorial changes include insertion of comments as appropriate for clarification. In most cases, these editorial revisions are included within square brackets. **************************************** SELECTED TEST MATERIAL, TEST AND SCORING PROCEDURES FOR THE OCTOBER 1987 DARPA BENCHMARK TESTS David S. Pallett Institute for Computer Sciences and Technology National Bureau of Standards Gaithersburg, MD 20899 ABSTRACT This paper describes the test material, procedures and scoring conventions used for the October '87 DARPA Benchmark Tests. Selected subsets of previously recorded speech corpus material were identified and used for tests of both Speaker Dependent and Speaker Independent technologies. A set of 70 sentence texts was identified and was used by three designated "live talkers" in tests administered at BBN and CMU during September. A convention was defined for representing the output of the systems, and a standardized string matching and scoring convention was to be used in reporting the results of the Benchmark Tests. INTRODUCTION At the Fall 1986 DARPA Speech Recognition Meeting, plans were discussed for implementing Benchmark Tests of the Continuous Speech Recognition Systems being developed under DARPA sponsorship. These plans were further developed during the period October 1986 - March 1987. A preliminary implementation of the test procedures was conducted prior to the March 1987 Meeting. At that meeting, BBN reported on the performance of their Speaker-Dependent technology using a set of 100 sentence utterances (25 sentence utterances from each of 4 designated speakers) chosen from the Resource Management Speaker Dependent Development Test portion of the DARPA Resource Management Speech Corpus [rm1/dep/dev/]. Approximately one month later, CMU reported the results of similar tests involving another set of 100 sentence utterances (10 sentence utterances from each of 10 designated speakers) chosen from the Resource Management Speaker Independent Development Test portion [rm1/ind/dev_aug/]. [Editorial Note: These March 1987 Test Set results were for the "ANGEL" system, not the SPHINX system]. Each site also reported on the results of tests conducted on sets of 90 sentence utterances provided by "live talkers" (30 utterances from each of three designated speakers). The results of these tests were analyzed at NBS and distributed within the DARPA community. In preparation for the Fall 1987 Speech Recognition Meeting, NBS selected additional test material for similar tests. NBS participated in the implementation of the "live talker" test procedures and developed and distributed standardized scoring software to be used in reporting performance results at the Fall 1987 and future Meetings. This paper serves to identify the test material used for these tests, to describe the test procedures, and to outline the characteristics of the scoring software. RESOURCE MANAGEMENT SPEECH CORPUS TEST MATERIAL Speaker Independent Test Material ------- ----------- ---- -------- For the March '87 tests, a set of ten speakers was identified, drawn from the Speaker Independent Development Test Subset of the Resource Management Corpus [rm1/ind/dev_aug/] [1]. This set of ten speakers included 7 male speakers and 3 female speakers. Data on these speakers is presented in papers presented at the March '87 Meeting [1,2]. Ten sentence utterances for each of these speakers were used for test purposes. For the October '87 tests, a set of six speakers was identified, also from the Speaker Independent Development Test Subset [rm1/ind/dev_aug], and 10 utterances were specified for each of these speakers, amounting to a total of 60 sentence utterances from this portion of the Resource Management Corpus. [Editorial Note: Of these six speakers, five were new speakers, and one (subject gwt0) had been included in the March '87 test material. The 10 test utterances for subject gwt0 were the same for both the March '87 and October '87 tests. This portion of the October '87 test material thus includes 50 previously unused test utterances and 10 "retest utterances".] An additional set of four new speakers was chosen from the Speaker Dependent Development Test Subset [rm1/dep/dev/] and 25 sentence utterances were specified for each of these speakers. An additional 100 sentence utterances are thus identified in this portion of the test material. This portion is of particular interest since the same material was to be used in tests of the Speaker Dependent technology, providing some overlap of a subset of test material. Table 1 provides detailed information on the individual speaker's regional backgrounds, race, year of birth and educational level for these tests of Speaker Independent Technology. Subject Sex Region Race Year of Birth Education ------- --- ------ ---- ------------- --------- GWT M NORTHERN WHT '21 B.S. CTM M NORTHERN WHT '55 H.S. DPK M NEW ENGLAND WHT '60 B.S. LJD F NORTH MIDLAND WHT '61 B.S. LMK F SOUTHERN WHT '60 B.S. SJK M NEW YORK CITY WHT '31 B.S. Also used in tests of Speaker-Dependent technology: DTB M NORTH MIDLAND AMR* '42 B.S. DTD F SOUTHERN** BLK '54 B.S. PGH M NEW ENGLAND WHT '63 B.S. TAB M WESTERN WHT '60 B.S. ---------------------------------------------------------------- * (American Indian) ** (Subject reportedly "tried to change Southern accent to fit Chicago".) Table 1. Speaker Independent Test Speakers A total of 160 sentence utterances by a total of ten speakers are thus specified for use in the October 1987 Benchmark Tests of Speaker Independent technology. Speaker Dependent Test Material ------- --------- ---- -------- For the March '87 tests, a set of four speakers was identified, including 3 males and 1 female [2]. Twenty-five sentence utterances for each of these speakers [from rm1/dep/dev/] were specified and used for test purposes. In these tests, selection of particular sentence utterances was typically the first 25 sentences that had been recorded and were contained on the data tape for that individual. For the October '87 tests, another set of 25 sentence utterances (from each of the same four speakers) was specified [from rm1/dep/dev/] to be used for test purposes. However, for each speaker, 10 of this set of 25 sentence utterances are the same as some of those used for the March '87 tests. These were selected as sentences for which at least one error occurred under at least one of the test conditions in BBN's March tests. These sentence utterances for were selected for "retest" in order to permit demonstrations of incremental progress. The remaining 15 sentence utterances for these four speakers were randomly chosen [from the utterances for those speakers contained within rm1/dep/dev/]. It was recognized that by selecting particularly difficult sentence utterances, the statistics for performance evaluation for the composite set of 25 sentences for each speaker may be biased somewhat lower than if all of the sentences had been randomly chosen, and separate reporting for the subsets of 10 difficult sentences and 15 randomly chosen sentences would be appropriate. An additional set of four (new) speakers was identified from the Speaker Dependent Development Test Subset [rm1/dep/dev/], and a set of 25 sentence utterances was specified for each of these speakers. This set of 100 sentence utterances is the same as that to be used in tests of Speaker Independent Technology. Table 2 provides information on the speakers used for tests of the Speaker Dependent technology. Subject Sex Region Race Year of Birth Education ------- --- ------ ---- ------------- --------- March '87 Set: BEF M NORTH MIDLAND WHT '52 Ph.D CMR F NORTHERN WHT '51 M.S. JWS M SOUTH MIDLAND WHT '40 B.S. RKM M SOUTHERN BLK '56 B.S. Fall '87 Set (also used for Speaker Independent technology:) DTB M NORTH MIDLAND AMR* '42 B.S. DTD F SOUTHERN** BLK '54 B.S. PGH M NEW ENGLAND WHT '63 B.S. TAB M WESTERN WHT '60 B.S. ---------------------------------------------------------------- * (American Indian) ** (Subject reportedly "tried to change southern accent to fit Chicago".) Table 2. Speaker Dependent Test Speakers A total of 200 sentence utterances by a total of eight speakers (40 difficult + 60 randomly chosen sentences by the four speakers used in the March '87 tests, plus 100 randomly chosen sentences for the four new speakers) was thus designated for use in the October 1987 DARPA Benchmark Tests of Speaker Dependent technology. LIVE TALKER TEST MATERIAL In the March '87 tests, each of three speakers (JAS, TDY and DSP) read 30 sentence texts drawn from the text corpus [rm1/doc/al_sents.txt] for Resource Management. Ten of the 30 texts were the same for each of the three scripts. Thus a total of 90 sentence utterances were used for test purposes, drawn from a set of 70 unique sentence texts. Each of the three talkers also provided the set of 10 "Rapid Adaptation" sentences. Different productions of this test material was provided in tests occurring on different days at both BBN and CMU. Differences in vocal fatigue (and other factors) were noted when recording the utterances at the two sites. In the March '87 tests conducted at BBN, use was made of the rapid adaptation material to "adapt" their system for use by the live talkers. However, for a number of reasons, it was agreed at the March '87 Meeting that future "live tests" of the Speaker Dependent technology would not use the rapid adaptation material, but would be conducted in a formally speaker-dependent manner. During July of 1987, the three test subjects (JAS, TDY and DSP] visited BBN and recorded system training or enrollment material. At CMU, the live talker tests were to be conducted either in (at CMU's preference) a formally speaker-independent manner or with the use of rapid adaptation, so that no formal system training was necessary. The protocol agreed upon for recording the enrollment material at BBN involved two half-hour sessions for each subject. The set of 600 sentences in the training sentence subset [see sentence numbers sr001 through sr600 in rm1/doc/al_sents.txt] was organized so as to present the longer, more difficult to read, sentences toward the end of the script. Each subject was instructed to read the sentences without undue concern if a sentence was misread, going immediately on to the next sentence. Following the sessions, BBN staff listened to the material and deleted those sentences involving reading errors. It proved possible to collect more than 300 acceptable training sentence utterances from each of the test speakers during the course of the two half-hour sessions. The more difficult to read texts were not recorded because of the agreed-upon time limitation of one hour for each subject, and these texts were to appear toward the end of the script. The total duration of the speech material for each of the subjects was of the order of 17 minutes, of which all or any portion could be used for system training prior to the "live tests" conducted in September. The experience of the "live talkers" in the March '87 tests suggested that the longer sentences used in the those tests (randomly drawn from the set of [approximately] 2800 sentence texts [rm1/doc/al_sents.txt]) were in some cases difficult to read. [It was noted by JAS that they did not seem representative of those that might be used for interactive dialogue in an actual resource management application.] For the March '87 tests, the texts averaged 7.9 words in length. Accordingly, selection of "live talker" texts for the October '87 tests was biased somewhat toward shorter sentences, though in practice the effect was primarily to reduce the variation in sentence length, producing an average length of 6.8 words. As in the March '87 tests, each "live" test speaker read scripts presenting the prompt forms of these sentence texts in site visits conducted during September 1987. A Sennheiser HMD 414-6 microphone (as used at TI in collecting the Resource Management Speech Corpus) was to be used, and the test environment was to be a computer lab or conference room with little or no competing (background) conversation. A portion of the test material was processed "live" or "on-line" (during the test), and the remainder was processed off-line following the visit, but was to be reported upon at the Fall '87 meeting. TEST PROCEDURES Experimental Design ------------ ------ The October 1987 Benchmark Tests were intended to be very similar to those implemented in March '87. The earlier procedures are described in an earlier paper [3]. The specified utterances for the designated speakers in the Resource Management Speech Corpus were to be processed with and without the use of imposed grammars. Comparably detailed results were to be reported for both conditions. No other parameters were to be changed for these comparative tests. Use of the material for "rapid adaptation" was optional. In contrast to the earlier tests, BBN did not make use of the "rapid adaptation" mode in implementing the "live tests", but used the training material provided by the three specified "live talkers". Live Test Protocol ---- ---- -------- These protocols were changed slightly from those used in March '87. Experience had shown that representative system response times did not permit processing 30 sentences within 30 minutes elapsed time. After processing several sentences in real time, the remainder of the 30 minute subject time was devoted to collecting the remainder of the 30 sentences per live talker. "Off-line" processing was subsequently used for these sentence utterances. Every effort was to be made to process each of the three test speakers' data using identical processing. At BBN in March '87, there were problems in obtaining optimal performance (in terms of both performance and processing time) for one of the live speakers (TDY) using the speaker adaptive mode. Consequently, processing parameters were revised for this one speaker, and the performance statistics [reported for subject TDY in the March '87 BBN "live test" results] are believed to be in some sense "sub-optimal". Vocabulary/Lexicon/Output Convention ------------------------- ---------- The conventions used for representing the system output, and, for comparison in scoring, for the reference strings, are described in a previous paper [3]. TI provided an implementation of the rules for the Resource Management sentence texts in Standard Normalized Orthographic Representations (SNORs) [rm1/doc/al_sents.snr], and these were used for scoring purposes. In this representation, there are a total of 991 distinct lexical entries, as derived from the set of 2800 sentences developed at BBN [see rm1/doc/lexicon.snr for a list of these lexical entries.] It has been noted that this lexicon is not logically complete. However, it is all inclusive in the sense that it covers all entries in the recorded corpus, and thus should be provisionally sufficient for the purposes of scoring when using test material derived from the set of 2800 sentences [rm1/doc/al_sents.snr] and/or from the recorded Resource Management Corpus. Analysis of system responses provided by BBN and CMU for the March '87 tests disclosed that the different sites used different lexical conventions for both internal representations and for system output, complicating scoring. For example, at CMU there were instances of a lexical entry "CITRUS-1" presumed to represent one of the alternative pronunciations of "CITRUS". For such a system response to be scored as correct, there either has to be post- processing of the responses or special adaptation of the scoring software. At BBN, the city (place name) San Diego, represented in the SNOR lexicon as "SAN-DIEGO", was represented as two entries "SAN" and "DIEGO", giving rise to other scoring complications. There is "a natural assumption that the units used for scoring should be as similar as possible to the lexical units used in a system" [4]. Given differences between systems and the differing lexical representations in different systems, there is need for a standard representation for each word. For the Resource Management task, the SNOR convention and lexicon fill this role, and they are to have been used for the October '87 tests. However, it has become evident that no consistency is to be expected with regard to internal representations. To assist in understanding what is meant by an "N-Word System" (e.g. the 1000 Word systems presently under study in the DARPA program), it was proposed that the lexical words used by particular systems should always be specified (be they words, phrases, sentences, or combinations thereof). Mappings or postprocessing between the internal representations and the system output (used for evaluation by comparison with the reference strings) should be documented. SCORING PROCEDURE At the time the March '87 tests were implemented, no general agreement had been reached concerning the software to be used for scoring the system output. Scoring software was provided to NBS by BBN, CMU and TI for comparative usage with preliminary system output. Subsequently, additional software was provided by Lincoln Laboratories, and C-language code was written at NBS to implement what seemed to be the most attractive features of each software package, as well as including some new capabilities. The intended purpose of developing this standardized scoring software is to provide a versatile and consistent set of scoring tools. Dynamic Programming String Matching Algorithm ------- ----------- ------ -------- --------- Scoring data is derived from comparisons of reference strings and system outputs, using a dynamic programming string alignment algorithm. The C-language string alignment procedure is adapted from code written by Doug Paul at Lincoln Laboratories (following discussions with Rich Schwartz and Francis Kubala at BBN). It is similar to the ERRCOM scoring utility written in Zetalisp at BBN. Both are "functionally identical dynamic programming algorithms for computing the lowest cost alignment between two strings (possibly not unique) given the following constraints on the cost function used to score the alignments: (1) An exact match incurs no penalty. (2) Deletion and insertion errors incur equal penalties. (3) The sum of one deletion and one insertion error penalty is greater than one substitution error penalty. For ties (multiple best alignments) an arbitrary choice is made. This decision cannot affect the alignment score but merely reorders adjacent substitution and deletion/insertion errors. It is worth mentioning that the NBS and BBN programs make this choice differently, therefore alignments may vary, but scores won't." [5] Error Taxonomy and Statistics ----- -------- --- ---------- The standard error taxonomy resulting from use of the software includes data on the percentage of words (in the reference string) that are correctly recognized, the percentage of substitutions, percentage of deletions, percentage of insertions, and the total percent error (where this total includes substitutions, deletions and insertions). In BBN's error taxonomy, they have defined "Word Accuracy" as [100% - (total percent error)]. Note, however, that word accuracy is not in general equal to the percent correctly recognized words because of insertions. Splits and Merges: Contractions ------ --- ------- ------------ Discussions of alternative error taxonomies that have appeared in the literature [6] include discussion of the occurrence of errors called "splits" and "merges". In our error taxonomy, a split is decomposed into consecutive substitution and insertion errors (or an insertion and a substitution). An example of such an error might involve a contraction such as "ISN'T" being reported as "IS" and "NOT". Similarly, a "merge" can be decomposed into consecutive substitution and deletion errors (or a deletion followed by a substitution). Correspondingly, a merge might involve the sequence "IS NOT" recognized as "ISN'T". In general, splits and merges should be reported as (unrelated) substitutions, deletions, and insertions. In general, semantically acceptable splits and merges defy definition. However, it has been observed that "orthographic contractions are both common and nearly always semantically acceptable" [3]. The NBS scoring software [used in the 1987 tests] contained a table of splits and merges that could be referred to when split or merge "candidates" have been detected following implementation of the dynamic programming algorithm. If the candidate split or merge is a member of the class of (presumed) semantically acceptable splits or merges, it could be listed as such and statistics compiled. Special Classes of Insertions: Pre- and Post-Shadowing ------- ------- -- ----------- ---- --- -------------- Other references in the literature [6,7] cite the occurrence of "pre-shadowing" or "post-shadowing". Pre-shadowing might occur when an initial fragment of a poly-syllabic word such as "MAXIMUM" is reported as "MAX" and subsequently the system (correctly) responds with the correct answer, in this case "MAXIMUM". The NBS software does not detect the occurrence of these special cases of insertions, nor does our proposed error taxonomy include them. Homophone Errors --------- ------ A number of scoring software packages include special provision for scoring substitution errors involving homophones. This may be particularly appropriate for those cases in which there is no imposed grammar, and the probability of homophone errors may be high. The NBS scoring software can refer to a table of homophones when classifying errors involving substitutions to determine if the errors involve homophones. This option is not ordinarily used, but is provided as a diagnostic tool. [Editorial Note: See "Standard Scoring Procedure" (below) for usage of this option when implementing October 1987 DARPA Benchmark Tests.] Synonyms -------- Analysis of the March '87 results identified some errors involving substitutions of synonyms (e.g. "MAX" for "MAXIMUM"). It can be argued that these errors are semantically acceptable, and the NBS software can be set to classify substitution errors involving synonyms by reference to a table of acceptable synonyms. Use of this option, like that for homophones, is not ordinarily used, but it is provided as a diagnostic tool. Deletions of "THE" --------- -- ----- "Deletions of the token "THE" are typically a large proportion of the errors observed in high performance systems and the majority of these deletions leave the semantic intention of the utterance intact" [5]. Thus it has been argued that it would be valuable to compute the proportion of errors of this type, and correspondingly, to score sentences whose only errors involve deletions only the word "THE" as "semantically OK" . However, at present, the NBS scoring software does not specially account for this class of error. Time Criterion for Word Beginnings ---- --------- --- ---- ---------- It has been argued that the most appropriate criterion for scoring the output of a speech recognition system would include some measure of the accuracy of a system in reporting on the identified word boundaries. One proposal to this effect suggests that the test material be (manually) marked with word boundary information (word beginning and end times) and that the scoring algorithm refer to this information as well as the correctness of the word in classifying the response as correct. According to this proposal, word beginning times should be within a specified time window (tentatively set at 70 msec) of the value identified in the manual labelling of the test material. Procedures of this sort have been used in studies of segmentation and in evaluating the output of word spotting systems. There are several different algorithmic approaches to implementation of such a proposal. The scoring software [used in the October '87 tests] allows for the incorporation of such a criterion following implementation of the dynamic programming string matching algorithm, and requires reference strings and system output that contain the (additional) timing information. Imposition of such a criterion tends to increase the number of errors. Different systems err in different ways. In some cases the predominant additional errors are due to errors in detecting words beginning with weak fricatives preceded by words ending in stop consonants. In other systems, the occurrence of a deletion or insertion may skew successive reported word beginning times sufficiently so as to lead to multiple errors. The need for reliance on manually labelled word boundary information is an unattractive aspect of such a scoring procedure. For these reasons, although the NBS scoring software permits incorporating such a criterion, it is not ordinarily used. Processing Times ---------- ----- In the Benchmark Tests, the processing times and system configuration are to be reported. Sentence Level Scoring -------- ----- ------- In the NBS scoring software, a sentence is scored as correctly recognized only if all words have been recognized and there are no insertion or deletion errors. Supplemental analyses (such as the fraction of sentences for which the only errors involve deletions of the word "THE") are permitted, but only to supplement the basic data. Characterizing the Imposed Grammar -------------- --- ------- ------- At present, no general agreement has been reached for a completely unambiguous procedure for characterizing the imposed grammars in all systems. A proposal for characterizing the complexity of a language model in terms of the "test-set perplexity" has been circulated [6], but it appears that different descriptions of imposed grammars may be used by different sites for the present tests. This matter needs to be actively addressed. "Standard Scoring Procedure" --------- ------- ---------- The reference strings to be used for scoring are to be the SNOR representations [rm1/doc/al_sents.snr] developed from the lexical/output convention described in [3]. The lexical convention for system output for the present tests is that for the 991 word SNOR lexicon [rm1/doc/lexicon.snr]. The option to score errors that involve substitutions of homophones as "correct" is NOT to be used for standard scoring purposes. It may be used for supplementary analysis, if so desired. [Editorial Note: This procedure was changed for the June 1988 and subsequent DARPA Benchmark Tests. When using the "no grammar" test condition, homophone substitution errors were counted as correct in the later tests. However, when using the "word-pair grammar" condition, homophone substitution errors are counted as errors in the later tests.] The option to score errors involving synonyms as correct is NOT to be used. The option to analyze splits and merges and to identify errors involving contractions is NOT to be used. The option to impose constraints on the beginning time of each word is NOT to be used. It may be used for supplementary purposes, if so desired, but it is the responsibility of the organization choosing that option to mark the word boundaries on the reference strings (speech files), and to make that information available to other DARPA researchers on request. Summary statistics on the percentages of words (based on the number of words in the reference strings) correctly recognized, substitutions, deletions, and insertions are to be reported. The total percent error is to include substitutions, deletions, and insertions. These statistics are to be reported for each speaker under each test condition as well as for the larger test subsets to which the test speakers belong. As mentioned previously, data are to be reported for tests conducted without use of an imposed grammar as well as with imposed grammar(s). System output data is to be made available to NBS for additional analysis. ACKNOWLEDGEMENTS The development of the NBS scoring software described in this paper involved contributions from a number of individuals at several organizations. Doug Paul is to be thanked for making a C-language program available that implements a dynamic programming algorithm essentially identical to htat used at BBN. At BBN, Francis Kubala provided LISP code for the BBN ERRCOM utility and is to be thanked for his cooperation in implementing the perfromance evaluation tests. At CMU, Rich Stern is to be thanked for his cooperation in implementing the perfromance evaluation tests. At TI, George Doddington and Bill Fisher are to be thanked for providing another scorring utility, in FORTRAN. Each of these scoring packages has demonstrable merits: we sought to combine the most attractive features of each. Patti Price at BBN, Jared Bernstein at SRI and Bill Fisher at TI deserve special thanks for cooperating in the slelection of test material for the benchmark tests and in working toward consensus in the lexicon and output [SNOR] convention. Alex Rudnicky at CMU contributed significantly in clarifying the distinctions to be made between lexical representationms that are internal to systems and the need for specifying the mappings between these representations and the system output used for evaluation purposes. At NBS, Lynn Cherny was responsible for analyzing the results of the March '87 tests and in distinguishing between "errors" that were real and those that were artifacts due to differences in lexical/output convention and string alignments. Credit for coding the scoring software goes to Stan Janet. REFERENCES [1] W.M. Fisher, "The DARPA Task Domain Speech Recognition Database", Proceedings of the March 1987 DARPA Speech Recognition Workshop. [2] D.S. Pallett, "Selected Test Material for the March 1987 DARPA Benchmark Tests", Proceedings of the March 1987 DARPA Speech Recognition Workshop. [3] D.S. Pallett, "Test Procedures for the March 1987 DARPA Benchmark Tests", Proceedings of the March 1987 DARPA Speech Recognition Workshop. [4] Private Communication with Alex Rudnicky, September 21, 1987. [5] Private Communication with Francis Kubala et al., August 13, 1987. [6] R.D. Rodman, M.G. Joost, and T.S. Moody, "Performance Evaluation of Connected Speech Recognition Systems", Proceedings of Speech Tech '87, New York, NY, April 28-30, 1987, pp. 269-274. [7] F. Dreizin, R. Kittredge, and D. Korelsky, "Semantic Support in Speech recognition: An Application to Fire Control Dialogues", Proceedings of AVIOS '86 Voice I/O Systems Applications Conference, Alexandria, VA, September 16-18, 1986, pp. 339-354. [8] S. Roucos, "Measuring Perplexity of Language Models Used in Speech Recognizers", unpublished manuscript circulated within DARPA research community, September, 1987.