Editorial note: the following consists of a revised version of material prepared to document DARPA benchmark speech recognition tests. The two papers describing the first of these tests, in March 1987, are more formal than the others and were originally prepared for inclusion in the proceedings of the March 1987 DARPA speech recognition workshop. More informal notes were prepared for distribution within the DARPA speech research community to provide information on the tests conducted prior to the October 1987 and June 1988 meetings. The June 1988 note consists of an adaptation of the October 1987 note. Still more informal notes were prepared to outline the test procedures for the February 1989 and October 1989 benchmark tests. Editorial revisions primarily include some changes of tense, inclusion of references to directories in this cd-rom in which the material might be found, and substitution of the term "corpus" or "corpora" for "database" or "databases". Other editorial changes include insertion of comments as appropriate for clarification. In most cases, these editorial revisions are included within square brackets. **************************************** FEBRUARY 1989 DARPA SPEECH RECOGNITION RESOURCE MANAGEMENT BENCHMARK TESTS David S. Pallett National Computer Systems Laboratory National Institute of Standards and Technology Gaithersburg MD 20899 Telephone: 301/975-2935 Net: dave@ssi.icst.nbs.gov [Original Draft:January 12, 1989] [Amended: January 23, 1989 (See "More on Training" , "Speaker Dependent Tests", "Speaker-Dependent vs. Speaker-Independent", and "About Processing Times")] [Editorially Revised: January 4, 1990] TEST MATERIAL 1. Test Material has been selected for both Speaker-Dependent and Speaker-Independent technologies. For the Speaker-Dependent systems, this material consists of 300 test sentence utterances selected from the task domain Speaker Dependent Evaluation Test set (tdde): 25 test utterances for each of the 12 speakers in this subset [rm1/dep/eval/]. Some effort has been taken to select these utterances so as not to concentrate selection from the early portion of the recording sessions, following Doug Paul's observation about the possibility of within- session effects. For the Speaker-Independent systems, the material consists of a set of 300 test sentence utterances selected from the task domain Speaker Independent Evaluation Test Set (tdie): 30 test utterances for each of 10 speakers, selected from the 40 speakers in this subset [rm1/ind_eval]. Six of the ten speakers are male. All of the dialect regions are represented. None of the present set of Speaker Independent Benchmark Test Set speakers appear in any of the training material (i.e., in either [rm1/ind_trn] or [rm1/ind/dev_aug or rm1/ind/dev_excl] or [rm1/dep_trn]). Since the development of ("rapidly") speaker-adaptive technology had been anticipated at the time the corpus was designed, there are [in addition to the 30 test sentence utterances] an additional 12 files for each of the 10 speakers: the 2 dialect sentences ("sa1" and "sa2" files), and the 10 "rapid adaptation" sentences ("sb01" through "sb10" files). (At present I know of no site that intends to demonstrate rapidly adaptive technology [within the near future], but it makes sense to distribute the data at this time.) Thus the total number of files in the speaker-independent test [index file] is 420 or (300 + 120). It is important to note that none of this material has been previously released. It has been at NIST/(NBS) since it was recorded at TI, and no-one has worked with it to date [i.e., prior to January 1989]. 2. Distribution Format Most sites will receive "tar" copies of the test material during the week of January 16, 1988. For most sites, these will consist of copies of the data for the close-talking (Sennheiser) microphone data (".b") files with 16 kHz sample rate, as down- sampled at TI and sent to NBS (now NIST). 3. Data Integrity Check Each site is encouraged to load these tapes as soon as possible after receiving them. Check to make sure that all files are there. There should be exactly 300 sentence utterance files for the Speaker-Dependent set, and 420 files for the Speaker- Independent set. Please notify NIST (John Garofolo at 301/975-3193) in the event that there is a problem with reading the tapes. As a second check on data integrity, you are encouraged to listen to any or all of the files to ensure that there are no problems with truncation of the files. In the past, there were one or two instances of this, but we think the problems were fixed. Please notify NIST (Dave Pallett at 301/975-2935 or John Garofolo at 301/975-3193) if you find any problems on listening to the test material. The purpose of this listening test is only to verify the data integrity, and no system parameters may be set on the basis of results of this listening test. TI reviewed the spoken utterances at TI at the time the data was collected, so that it should not be necessary to check the spoken utterances against the reference strings. USE OF THIS TEST MATERIAL 1. This test material is to be used in Benchmark Tests to be reported on at the February 1989 DARPA Spoken Language Meeting. Some sites have agreed to report on both the Speaker-Dependent and Speaker-Independent sets, others only one set. For each case, tests are to be run using both the no-grammar ("all word" or "full-branching") case and the "word-pair" grammar [rm1/doc/wp_gram.txt] developed by BBN. The perplexities corresponding to these grammars are approximately 1000 and 60, depending on the specific test set and the lexicon used. The use of other grammars (e.g.,the probabilistic "bigram" grammar similar to the "word-pair" grammar) is permitted, but only to complement results reported for the two required grammars. 2. The SNOR lexicon (as defined in the NBS scoring package [rm1/doc/lexicon.snr]) is to be used in reporting results to NBS and in official scoring. As Alex Rudnicky proposed some time ago, any post-processing used to convert results from a lexicon that differs from the SNOR lexicon is to be documented, and differences in the lexicons are to be noted. 3. The "official" NBS scoring package is to used [/rm1/score]. Although other string alignment algorithms may be preferable in some senses, and different sites may have differing philosophies about how to count errors, we will use the NBS scoring package for the present. The option to count homophone substitutions errors as correct IS PERMITTED ONLY FOR the case of "no grammar". 4. The "official" SNOR reference strings that are to be used for scoring will be available from NIST early in February (following any feedback that may lead us to have to correct our present versions, to account for truncations, etc) [they are in rm1/doc/al_sents.snr]. Availability of reference strings would not seem to be a prerequisite to running speech recognition tests, in any event (though it is necessary for scoring). 5. It is not acceptable to perform any diagnostic tests (e.g., look in detail at the output of the scoring package for any fraction of the test results) while the tests are in progress at any site. On completion of all of the Benchmark Tests at any one site, analysis is permitted, but once the analysis is underway, no further use of this test material is permitted. 6. The February '89 Test Material is to be used once and only once for each system configuration, and all system parameters are to have been established prior to running the tests and certainly prior to any analysis of the test results. REPORTING RESULTS TO NIST 1. After running the tests, results are to be reported to NIST as for previous tests, sending NIST the hypothesis files that are system output in a format that is compatible with our standard scoring software. This is documented in the "readme" file in the scoring software package. Most sites have already done this previously, and we have had a dry run with MIT/Zue. 2. This time [prior to the February 1989 meeting], it will be desirable to make sure that the data [as reported to NIST] is segmented or broken in some way so that the data can be reported for each speaker, if possible. Previously, some sites have reported results on an aggregate set of (say) 300 files, while others have reported on individual speakers. We want the speaker identities to be recoverable, should there be overlap in the spoken texts. Contact NIST prior to transmitting the data to obtain approval for the form of separation (e.g. separate files for each speaker, or easily recognized text (speaker initials)). 3. NIST plans to process each site's reported results in a uniform manner and prepare summary reports for distribution at the February meeting. Your cooperation in getting the data to us will permit us to provide this summary data. SPEAKER INDEPENDENCE/ADAPTIVE/DEPENDENCE A great deal of discussion has recently taken place about what constitutes a "speaker-independent" system. Much of this discussion has been a consequence of inherent limitations in the Resource Management and TIMIT corpora, and in our use of these. Most strictly, a speaker-independent system is one that has been developed with absolutely no access to speech data for potential test speakers. This is a tough criterion to meet. There are a number of factors involved. Among these are potential knowledge of parameter ranges for the set of speakers in the speaker-dependent training and test sets. It has been noted that sites that have worked extensively with speaker-dependent systems may generalize these parameters when developing speaker-independent systems. If at some point [a speaker-independent] test set is drawn from the speaker-dependent set (as was the case for the June '88 test set), that generalization would be optimal in some sense. If the choice of parameters is deliberate, and the test set were known "a priori", then this would clearly violate the spirit of a description of such a system as "speaker-independent". Because of the overlap between the TIMIT and Resource Management Corpora speaker populations, sites that have worked extensively with TIMIT have in some sense and to an unknown degree been influenced by the data for real or potential test speakers. It would be very difficult to ignore phenomena that might have been observed by potential test speakers, particularly if those test speakers were unusual in any sense. So the presence of the [Resource Management Corpus] test speakers [in TIMIT] may have contaminated TIMIT as an allowable resource for the development of "pure" speaker-independent technology. This dependency possibly includes whatever influence may exist in VQ codebooks, phone models etc. It is very difficult to develop systems without involving the use of information from potential test speakers to some (unknown) degree. It has always been clearly understood (at least from NBS's point of view) that use of the TIMIT Corpus [for system development] was allowable. But explicit reference to it in optimizing results for a test set would be inadmissable. SYSTEM TRAINING Training is a factor that is difficult to "standardize" across sites. Recognizing the overlap [of speakers] in the Resource Management Corpus subsets, several of us met at the [May 1988] Arden House Meeting and agreed on two conditions. Doug Paul summarized this agreement: *************************************** [(Edited) Portion of a message from Doug Paul...] "My nomination for speaker independent training data for the Feb 89 evaluation tests is to repeat the June 88 training sets: (SD = spkr dependent: tddt and tddd, SI = spkr independent: tdit and tdid) [1] "Standard condition": 72 tdit speakers (80 - 8 SD overlap) 72 speakers x 40 sentences = 2880 total sentences [EDITORIAL NOTE: This corresponds to [rm1/ind_trn]] [2] "Augmented (or extra training data) condition": Standard Condition plus: 37 tdid speakers (40 - 3 SD overlap) [EDITORIAL NOTE: This corresponds to adding [rm1/ind/dev_aug] 37 speakers x 30 sentences = 1110 additional sentences (This does not include the dialect (2 per speaker) or rapid adaptation (10 per speaker) sentences.) Augmented condition totals (80 - 8) + (40 - 3) = 109 training speakers (72 x 40) + (37 x 30) = 3990 sentence [utterances] Comments: 1. tdit - SD overlap speakers: bef03 cmr02 dms04 dtb03 ers07 jws04 pgh01 rkm05 2. tdid - SD overlap speakers: dtd05 hxs06 tab07 3. non-overlapping SD speaker: das12 4. [Future] databases should not have any speaker overlap between SI and SD portions." *************************************** It is fair to say that no two sites have used exactly the same "training" set to the same degree of consistency. The "standardization" of training conditions is probably unnecessary, except (as has been pointed out) to permit distinguishing between improvements due to improved algorithms and those with more training. To track progress due to factors other than training, it is probably desirable to retain (or return to) the use of the "official" 72 speaker training set used in the June '88 Benchmark Tests. But it is clear that best performance will be demonstrated with better trained systems, and that encourages our full use of as much information as appropriate. Doug Paul has outlined the logic behind selection of the "standard" or "official" (as well as the augmented) training sets used in June. Hy Murveit points out the desirability of designating previous benchmark test sets as training or developmental test sets, [along with] the need to ensure that the test speakers are not in the training corpus: *************************************** [Portion of a message from Hy Murveit...] "... Since the best training set would allow the use of all previous test sets for development tests, I now suggest we remove the [June] '88 [test] speakers from this [best] training set. I still advocate as large a training set as possible. If we removed the [June] '88 test speakers from this training set we'd still have about 3500 sentences." ************************************** Hy Murveit and Mitch Weintraub brought to my attention an oversight on my (DSP) part involving the set of speakers designated for the "Augmented (or extra training data) condition". [See Appendix] It is that the set of 37 additional speakers (added to the base of 72, resulting in the set of 109) includes the subset of 15 that were used (at CMU and at SRI, and possibly elsewhere) for Speaker Independent System tests in '87, and for Kai-Fu Lee's work with SPHINX prior to June '88. As the tables in the Appendix indicate, by eliminating these speakers from an "augmented" training set, there would be a total of only 94 speakers available for an "augmented " training set and the total number of sentence utterances would then be 3540 [40 for each of the 72 speakers, plus 30 for each of the 22 speakers that have not been used in any test sets. This is the "about 3500" Hy referred to.] I had initially overlooked the overlap with the '87 test sets when endorsing the "Augmented Condition" consisting of 109 speakers and 3990 sentence utterances. I believe that an augmented training set of only 94 speakers is not sufficiently larger than the set of 72 to warrant its general use. The strongest argument for use [of a set of 94 speakers for training] that I have heard is that it does not include any of the '87 test speakers, so that tracking of progress on the '87 test set would continue to be possible [if a set of 94 speakers was standardized on for system training]. But the 72 speaker set is also free of this defect, and it can be used for this purpose as was outlined in the original version of this note. So I continue to advise use of the 72 speaker set for those cases in which progress is to be tracked on an algorithmic level (vs. the benefits of additional training, which can be demonstrated using the 109 speaker set outlined in the original of this note and in more detail below). As a cautionary note, by now [February 1989] the community will tend to discredit results cited against the '87 test set(s). Surely most sites will have run (and re-run) tests and diagnostics using these test sets so that results should show some training on these test sets. Reporting results on these tests should ONLY be for comparitive purposes. To summarize, here are the official listings of the two conditions that are outlined for future tests with the '88 and '89 test data: "Standard Condition": 72 tdit speakers not in '87 or '88 test sets (adg04 ahh05 aks01 apv03 bar07 bas04 bjk02 bma04 bmh05 bns04 bom07 bwm03 bwp04 cal03 cef03 ceg08 cft04 cke03 cmb05 crc05 csh03 cth07 cyl02 das05 daw18 dhs03 djh03 dlb02 dlh03 dlr07 dlr17 dmt02 drd06 dsc06 eeh04 ejs08 etb01 fwk04 gjd04 gmd05 gxp04 hbs07 hes05 hpg03 jcs05 jem01 jma02 jpg05 jrk06 jxm05 kes06 kkh05 lih05 ljc04 mah05 mcc05 mdm04 mgk02 mju06 mmh02 pgl02 rcg01 rgm04 rtk03 rws01 sdc05 tju06 tlb03 tpf01 utb05 vlo05 wem05) To form the "Augmented (or extra training data) condition (109 speakers), add to the set of 72 the following tdid speakers... Additional 37 (tdid) speakers: 37 (ajp06 awf05 bcg18 bgt05 bpm05 bth07 cae06 chh07 cpm01 ctm02 ctt05 dab01 dlc03 dpk01 ejl06 esd06 esj06 grl01 gwt02 hjb03 jfc06 jfr07 jlm04 jln08 jmd02 jsa05 lag06 ljd03 lmk05 rav05 rdd01 rjm12 sah01 sds06 sjk06 tdp05 wbt01) [This "Augmented" set is not to be used for training Speaker Independent Systems to be tested on the March '87 or October '87 Test Sets, since it includes material from 15 speakers included in that test set. For the March '87 or October '87 Test Sets, the "Standard" 72 speaker training set should be used.] [Note: this additional training material IS NOT TO INCLUDE the "dialect" sentence utterances files for these 37 speakers (sa1.sph and sa2.sph) or the "rapid adaptation" sentence utterances (sb01.sph through sb10.sph).] It is clear that conforming to a "standard" has a cost. The cost is often in system retraining or even re-running previous experiments using the new data. The bottom line is a recommendation to use either or both of the training sets outlined by Doug Paul above, as is most appropriate for your purposes. By using no more training speakers than these, [e.g., the 72 and 109 speaker trainig sets] we preserve the possibility of using the speaker-dependent speakers [once again] for [Speaker Independent System] test purposes at some future date, except (of course) to the extent that people may have repetitively used test or training material for development tests. [One major problem with tests of speaker independent technology is that each time a speaker has been used once in a development test, data from that speaker can never be used again. Speaker-Independent Systems are in some sense omnivorous!] So the recommendation is: EITHER...If the objective of your benchmark test data point is to demonstrate progress (at the algorithmic level) referred to some previous performance data point, it will probably be preferable to use the 72 speaker training that was used in the June '88 tests, so that the improvement will not be clouded by the issue of additional training. OR...If the objective of your benchmark test data point is to demonstrate the "optimal" performance, and comparisons with the June '88 results are not so important, and you can pay the price of the additional training, it will probably be preferable to use the augmented 109 speaker training described in Doug's note. OR ELSE, BOTH...Comparisons of the benefits of increased training can of course be made by using systems trained with both sets. These comparisons are very desirable, but end up being labor and machine-time intensive. Whether it is necessary to re-run old experiments to define optimal parameter settings for the augmented training set is a personal decision. IF AT ALL POSSIBLE, DO NOT REPORT BENCHMARK TEST RESULTS ON ANY OTHER TRAINING SETS. TWO DIFFERENT TRAINING SETS SHOULD BE ENOUGH, AND IT WILL NOT SERVE OUR COMMON INTEREST TO REPORT RESULTS ON SYSTEMS TRAINED WITH (SAY) 87 SPEAKERS, 105, 112 OR 120, OR SOME OTHER MAGIC NUMBER. 72 AND 109 MAY NOT BE ENOUGH, BUT ENOUGH IS ENOUGH! [EDITORIAL NOTE: The 105 speaker set used by Kai-Fu Lee at CMU in conjunction with the combined (15 speaker) March '87 and October '87 test material does not include any overlap between training. But it does overlap the June '88 Test set: so Kai-fu retrained his system with the 72 speaker "small training" ttraining set for the June '88 tests.] [ABOUT "TUNING"] "Tuning" is another issue that some people have wondered about as a consequence of a reference in Kai-Fu Lee's dissertation. I asked for clarification what this meant, and Kai-Fu asked that we provide the following information... ********************************* [From Kai-Fu Lee...] ... I'll tell you exactly what I did for "tuning SPHINX". A long time ago (Sep 87), I wanted to set a "language weight" parameter for the system. This parameter controls the number of insertions vs. deletions. Since I had used all of the training data for training, the only other source of data I had was the additional sentences for these 15 speakers [i.e. the test speakers] (actually, not all of them). So I tried about 5 different language weights, and chose the best one to evaluate the test set. More recently, I have tried to train on smaller training set and test on the remaining training set, and see what happened to the optimal choice of this parameter. It turned out that the same parameter was selected. So if I had the time, I could have tuned from the training set. Also, the choice of this parameter makes very little difference (usually fractions of a percent). Actually, there is another parameter that we tuned, but it only applied to the bigram, which nobody uses any more, so it doesn't matter. The place where I did a lot more with those 30 "tuning" sentences was in speaker adaptation, as I described in my thesis. I mentioned "tuning" just to be completely honest. In reality, such tuning was unnecessary. I would appreciate it if you could forward the above to anyone who has raised the question about "tuning SPHINX". ************************************ It seems clear that the use of the material from test speakers for "tuning" of SPHINX is a consequence of the limitations inherent to the data and our (chronic) need for more data. There is a continuing tension between setting aside data for future use for tests, on the one hand, and, on the other hand in using as much data as possible to develop the best-performing systems. This tension will not go away. As Kai-Fu notes "if I had the time, I could have tuned from the training set". Not many have the time to observe every possible restriction on the use of available data. In the light of all of the foregoing, it is very evident that it behooves every site to be very precise in explaining the use of data in training their systems. This is very important when reporting results to NIST and at the meeting. --------------------------------------------------------------- SPEAKER DEPENDENT TESTS Doug Paul has also brought to my attention the desirability of being more specific about the rules for testing Speaker-Dependent systems. It has been implicit that the standard condition for "enrollment" or training is to use the full subset of 600 sentence utterances per speaker in the tddt subset [rm1/dep_trn/]. This is what has been used in previous tests, and there is no reason for changing this now. In previous "Speaker-Dependent" tests Doug also advises me that "parameters which are set by multiple tests on the development test set (tddd) data [rm1/dep/dev/] should be the same for all speakers (i.e., speaker independent". This is a practice used in the BBN and MIT/LL comparisons in the past and there seems to be no need to change this rule, either. It is noted that this might result in slightly sub-optimal results for the speaker-dependent systems, but it minimizes the tinkering with system parameters for individual speakers, amounting to defining one speaker-dependent system configuration, and enrolling that system for a new speaker with the 600 sentence utterance training matarial. We might wish to change this in the future, but there seems to be no good reason to change this at present. For systems that are "speaker-adaptive", comparisons should be referenced to the 600-sentence-utterance condition of full training. Please advise me of any plans to demonstrate "adaptation". Be sure to document or describe the process of adaptation, particularly the degree to which it is supervised or unsupervised, since this issue seems to attract a lot of attention. SPEAKER-INDEPENDENT VS. SPEAKER-DEPENDENT A further note from Rich Schwartz asks for clarification of the arrangements regarding comparisons of "speaker-dependent" and "speaker-independent" systems. Sites developing only speaker-dependent systems (i.e., BBN) were not expected to test on speaker-independent test material. Sites developing speaker-independent systems had the potential of testing on the speaker-dependent test material, but were not required to do so in all cases. Some sites objected to the sheer volume of processing two complete (300 sentence utterance) sets of test material. Some noted that previous use of the speaker-dependent speakers for test purposes (as in the June '88 tests) renders the use of these speakers invalid for test purposes for speaker- independent systems. Still, it is obviously of interest to be able to compare the performance of two systems (e.g., one speaker-dependent and another speaker-independent) on the same test material. In such a test, the additional set of test material intended for speaker-dependent systems should simply be regarded as additional speaker-independent test material, and tested without special enrollment. Obviously, this takes time and not all sites have agreed to do this. Perhaps some sites will be able to report on these tests AFTER the deadline prior to the meeting, or even after the meeting. ABOUT PROCESSING TIMES... In earlier DARPA Benchmark Tests, we required that each site report processing times and system configuration (e.g. hardware). More recently, this information hasn't been reported. Now, once again, with the expansion of the test sets, I hear comments about limitations on the ability to run tests based on processing and/or set-up or training time. It is recommended that reported results include discussion of processing time/sentence utterance, at the least. At one point in the past (when processing "live" test utterances for speakers JAS, DSP and TDY) there were tales of processing times of the order of nearly an hour for some utterances, and reporting on this factor served to underscore the desirability of real-time systems. While there are no formal constraints on maximum processing time or allowable hardware, it will be useful to document processing times along with the hardware used in these tests. APPENDIX: Analysis of Overlaps in Training and Test Material Edited portion of a message from Hy Murveit and Mitch Weintraub (Received at NIST on January 19, 1989)] [The set of 80 tdit speakers:] (adg04 ahh05 aks01 apv03 bar07 bas04 bef03 bjk02 bma04 bmh05 bns04 bom07 bwm03 bwp04 cal03 cef03 ceg08 cft04 cke03 cmb05 cmr02 crc05 csh03 cth07 cyl02 das05 daw18 dhs03 djh03 dlb02 dlh03 dlr07 dlr17 dms04 dmt02 drd06 dsc06 dtb03 eeh04 ejs08 ers07 etb01 fwk04 gjd04 gmd05 gxp04 hbs07 hes05 hpg03 jcs05 jem01 jma02 jpg05 jrk06 jws04 jxm05 kes06 kkh05 lih05 ljc04 mah05 mcc05 mdm04 mgk02 mju06 mmh02 pgh01 pgl02 rcg01 rgm04 rkm05 rtk03 rws01 sdc05 tju06 tlb03 tpf01 utb05 vlo05 wem05) [The set of 40 tdid speakers:] (ajp06 awf05 bcg18 bgt05 bpm05 bth07 cae06 chh07 cpm01 ctm02 ctt05 dab01 dlc03 dpk01 dtd05 ejl06 esd06 esj06 grl01 gwt02 hjb03 hxs06 jfc06 jfr07 jlm04 jln08 jmd02 jsa05 lag06 ljd03 lmk05 rav05 rdd01 rjm12 sah01 sds06 sjk06 tab07 tdp05 wbt01) [OLD SPEAKER-INDEPENDENT TEST SET SPEAKERS:] Combined March '87 and October'87 test speaker list: 15 speakers [all from tdid] (awf05 bcg18 bth07 ctm02 ctt05 dab01 dlc03 dpk01 gwt02 jfc06 jfr07 ljd03 lmk05 sah01 sjk06) June '88 test speaker list: 12 speakers: 8 [from] tdit, 3 [from] tdid, 1 [from] neither [tdit nor tdid] (bef03 cmr02 das12 dms04 dtb03 dtd05 ers07 hxs06 jws04 pgh01 rkm05 tab07) ------------------------------------------------------------ [OVERLAPS] 1. [15] '87 test speakers all from tdid [rm1/ind/dev_aug] 2. [8] June '88 test speakers in tdit [rm1/excluded] (bef03 cmr02 dms04 dtb03 ers07 jws04 pgh01 rkm05) 3. [3] June '88 test speakers in tdid [rm1/ind/dev_excl] (dtd05 hxs06 tab07) 4. [1] June '88 test speakers in neither [tdit nor tdid] (das12) [POTENTIALLY VALID TRAINING SPEAKERS: NOT USED IN ANY PREVIOUS SPEAKER-INDEPENDENT TEST SET...] 1. 72 tdit speakers not in '87 or '88 test sets (adg04 ahh05 aks01 apv03 bar07 bas04 bjk02 bma04 bmh05 bns04 bom07 bwm03 bwp04 cal03 cef03 ceg08 cft04 cke03 cmb05 crc05 csh03 cth07 cyl02 das05 daw18 dhs03 djh03 dlb02 dlh03 dlr07 dlr17 dmt02 drd06 dsc06 eeh04 ejs08 etb01 fwk04 gjd04 gmd05 gxp04 hbs07 hes05 hpg03 jcs05 jem01 jma02 jpg05 jrk06 jxm05 kes06 kkh05 lih05 ljc04 mah05 mcc05 mdm04 mgk02 mju06 mmh02 pgl02 rcg01 rgm04 rtk03 rws01 sdc05 tju06 tlb03 tpf01 utb05 vlo05 wem05) 2. 22 tdid speakers not in '87 or '88 test sets [NOTE: THE ELIMINATION OF THE '87 AND '88 TEST SET SPEAKERS REDUCES THIS SET TO (ONLY) 22 SPEAKERS, FROM THE ORIGINAL 40, BY DELETING THE 3 OVERLAPPING FROM THE FALL '88 TEST SET AND THE 15 USED IN TESTS IN '87] (ajp06 bgt05 bpm05 cae06 chh07 cpm01 ejl06 esd06 esj06 grl01 hjb03 jlm04 jln08 jmd02 jsa05 lag06 rav05 rdd01 rjm12 sds06 tdp05 wbt01) [RATIONALE FOR A SET OF 94 SPEAKERS AND 3540 SENTENCE UTTERANCES:] 72 + 22 = 94 speakers (72 * 40) + (22 * 30) = 3540 sentences [End of material from Hy and Mitch]