Editorial note: the following consists of a revised version of
material prepared to document DARPA benchmark speech recognition
tests. 

The two papers describing the first of these tests, in March 1987,
are more formal than the others and were originally prepared for
inclusion in the proceedings of the March 1987 DARPA speech
recognition workshop.  More informal notes were prepared for
distribution within the DARPA speech research community to provide
information on the tests conducted prior to the October 1987 and
June 1988 meetings.  The June 1988 note consists of an adaptation
of the October 1987 note.  Still more informal notes were prepared
to outline the test procedures for the February 1989 and October
1989 benchmark tests.

Editorial revisions primarily include some changes of tense,
inclusion of references to directories in this cd-rom in which the
material might be found, and substitution of the term "corpus" or
"corpora" for "database" or "databases".  Other editorial changes
include insertion of comments as appropriate for clarification. 
In most cases, these editorial revisions are included within square
brackets. 
****************************************
  
FEBRUARY 1989 DARPA SPEECH RECOGNITION RESOURCE MANAGEMENT
BENCHMARK TESTS


David S. Pallett
National Computer Systems Laboratory
National Institute of Standards and Technology
Gaithersburg MD 20899
Telephone: 301/975-2935
Net: dave@ssi.icst.nbs.gov


[Original Draft:January 12, 1989]

[Amended: January 23, 1989 (See "More on Training" , "Speaker
Dependent Tests", "Speaker-Dependent vs. Speaker-Independent", and
"About Processing Times")]

[Editorially Revised: January 4, 1990]


TEST MATERIAL

1. Test Material has been selected for both Speaker-Dependent and
Speaker-Independent technologies. 

For the Speaker-Dependent systems, this material consists of 300
test sentence utterances selected from the task domain Speaker
Dependent Evaluation Test set (tdde): 25 test utterances for each
of the 12 speakers in this subset [rm1/dep/eval/]. Some effort has
been taken to select these utterances so as not to concentrate
selection from the early portion of the recording sessions,
following Doug Paul's observation about the possibility of within-
session effects.

For the Speaker-Independent systems, the material consists of a
set of 300 test sentence utterances selected from the task domain
Speaker Independent Evaluation Test Set (tdie): 30 test utterances
for each of 10 speakers, selected from the 40 speakers in this
subset [rm1/ind_eval]. 

Six of the ten speakers are male. All of the dialect regions are
represented. None of the present set of Speaker Independent
Benchmark Test Set speakers appear in any of the training material
(i.e., in either [rm1/ind_trn] or [rm1/ind/dev_aug or
rm1/ind/dev_excl] or [rm1/dep_trn]). 

Since the development of ("rapidly") speaker-adaptive technology
had been anticipated at the time the corpus was designed, there
are [in addition to the 30 test sentence utterances] an additional
12 files for each of the 10 speakers: the 2 dialect sentences
("sa1" and "sa2" files), and the 10 "rapid adaptation" sentences
("sb01" through "sb10" files).  (At present I know of no site that
intends to demonstrate rapidly adaptive technology [within the near
future], but it makes sense to distribute the data at this time.) 
Thus the total number of files in the speaker-independent test
[index file] is 420 or (300 + 120). 

It is important to note that none of this material has been
previously released. It has been at NIST/(NBS) since it was
recorded at TI, and no-one has worked with it to date [i.e., prior
to January 1989].

2. Distribution Format

Most sites will receive "tar" copies of the test material during
the week of January 16, 1988. For most sites, these will consist
of copies of the data for the close-talking (Sennheiser)
microphone data (".b") files with 16 kHz sample rate, as down-
sampled at TI and sent to NBS (now NIST). 

3. Data Integrity Check

Each site is encouraged to load these tapes as soon as possible
after receiving them. Check to make sure that all files are there.
There should be exactly 300 sentence utterance files for the
Speaker-Dependent set, and 420 files for the Speaker-
Independent set. Please notify NIST (John Garofolo at 301/975-3193)
in the event that there is a problem with reading the tapes.

As a second check on data integrity, you are encouraged to listen
to any or all of the files to ensure that there are no problems
with truncation of the files. In the past, there were one or two
instances of this, but we think the problems were fixed. Please
notify NIST (Dave Pallett at 301/975-2935 or John Garofolo at
301/975-3193) if you find any problems on listening to the test
material. The purpose of this listening test is only to verify the
data integrity, and no system parameters may be set on the basis
of results of this listening test. TI reviewed the spoken
utterances at TI at the time the data was collected, so that it
should not be necessary to check the spoken utterances against the
reference strings. 


USE OF THIS TEST MATERIAL

1.   This test material is to be used in Benchmark Tests to be
reported on at the February 1989 DARPA Spoken Language Meeting.
Some sites have agreed to report on both the Speaker-Dependent and
Speaker-Independent sets, others only one set. 

For each case, tests are to be run using both the no-grammar ("all
word" or "full-branching") case and the "word-pair" grammar
[rm1/doc/wp_gram.txt] developed by BBN. The perplexities
corresponding to these grammars are approximately 1000 and 60,
depending on the specific test set and the lexicon used.

The use of other grammars (e.g.,the probabilistic "bigram" grammar
similar to the "word-pair" grammar) is permitted, but only to
complement results reported for the two required
grammars.

2.   The SNOR lexicon (as defined in the NBS scoring package
[rm1/doc/lexicon.snr]) is to be used in reporting results to NBS
and in official scoring. As Alex Rudnicky proposed some time ago,
any post-processing used to convert results from a lexicon that
differs from the SNOR lexicon is to be documented, and differences
in the lexicons are to be noted.

 3.  The "official" NBS scoring package is to used [/rm1/score].
Although other string alignment algorithms may be preferable in
some senses, and different sites may have differing philosophies
about how to count errors, we will use the NBS scoring package for
the present.  

The option to count homophone substitutions errors as correct IS
PERMITTED ONLY FOR the case of "no grammar".

4.   The "official" SNOR reference strings that are to be used for
scoring will be available from NIST early in February
(following any feedback that may lead us to have to correct our
present versions, to account for truncations, etc) [they are in
rm1/doc/al_sents.snr].  

Availability of reference strings would not seem to be a
prerequisite to running speech recognition tests, in any event
(though it is necessary for scoring).


5.   It is not acceptable to perform any diagnostic tests (e.g.,
look in detail at the output of the scoring package for any
fraction of the test results) while the tests are in progress at
any site. On completion of all of the Benchmark Tests at any one
site, analysis is permitted, but once the analysis is underway, no
further use of this test material is permitted.  

6.   The February '89 Test Material is to be used once and only
once for each system configuration, and all system parameters are
to have been established prior to running the tests and certainly
prior to any analysis of the test results. 
 

REPORTING RESULTS TO NIST

1. After running the tests, results are to be reported to NIST as
for previous tests, sending NIST the hypothesis files that are
system output in a format that is compatible with our standard
scoring software.  This is documented in the "readme" file in the
scoring software package. Most sites have already done this
previously, and we have had a dry run with MIT/Zue. 

2. This time [prior to the February 1989 meeting], it will be
desirable to make sure that the data [as reported to NIST] is
segmented or broken in some way so that the data can be reported
for each speaker, if possible. Previously, some sites have reported
results on an aggregate set of (say) 300 files, while others have
reported on individual speakers. We want the speaker identities to
be recoverable, should there be overlap in the spoken texts.  

Contact NIST prior to transmitting the data to obtain approval for
the form of separation (e.g. separate files for each speaker, or
easily recognized text  (speaker initials)).

3. NIST plans to process each site's reported results in a uniform
manner and prepare summary  reports for distribution at the
February meeting. Your cooperation in getting the data to us will
permit us to provide this summary data. 


SPEAKER INDEPENDENCE/ADAPTIVE/DEPENDENCE

A great deal of discussion has recently taken place about what
constitutes a "speaker-independent" system. Much of this
discussion has been a consequence of inherent limitations in the
Resource Management and TIMIT corpora, and in our use of these.

Most strictly, a speaker-independent system is one that has been
developed with absolutely no access to speech data for potential
test speakers. This is a tough criterion to meet. There are a
number of factors involved.

Among these are potential knowledge of parameter ranges for the
set of speakers in the speaker-dependent training and test sets.
It has been noted that sites that have worked extensively with
speaker-dependent systems may generalize these parameters when
developing speaker-independent systems. If at some point [a
speaker-independent] test set is drawn from the speaker-dependent
set (as was the case for the June '88 test set), that
generalization would be optimal in some sense. If the choice of
parameters is deliberate, and the test set were known "a priori",
then this would clearly violate the spirit of a description of such
a system as "speaker-independent".   

Because of the overlap between the TIMIT and Resource Management
Corpora speaker populations, sites that have worked extensively
with TIMIT have in some sense and to an unknown degree been
influenced by the data for real or potential test speakers. It
would be very difficult to ignore phenomena that might have been
observed by potential test speakers, particularly if those test
speakers were unusual in any sense. So the presence of the
[Resource Management Corpus] test speakers [in TIMIT] may have
contaminated TIMIT as an allowable resource for the development of
"pure" speaker-independent technology. This dependency possibly
includes whatever influence may exist in VQ codebooks, phone models
etc.  It is very difficult to develop systems without involving the
use of information from potential test speakers to some (unknown)
degree. It has always been clearly understood (at least from NBS's
point of view) that use of the TIMIT Corpus [for system
development] was allowable. But explicit reference to it in
optimizing results for a test set would be inadmissable.


SYSTEM TRAINING 

Training is a factor that is difficult to "standardize" across
sites. Recognizing the overlap [of speakers] in the Resource
Management Corpus subsets, several of us met at the [May 1988]
Arden House Meeting and agreed on two conditions. Doug Paul
summarized this agreement:

***************************************
[(Edited) Portion of a message from Doug Paul...]

"My nomination for speaker independent training data for the Feb
89 evaluation tests is to repeat the June 88 training sets: (SD =
spkr dependent: tddt and tddd, SI = spkr independent: tdit and
tdid)

[1]  "Standard condition":
          72 tdit speakers (80 - 8 SD overlap)
          
          72 speakers x 40 sentences = 2880 total sentences

[EDITORIAL NOTE: This corresponds to [rm1/ind_trn]]

[2]  "Augmented (or extra training data) condition":
     
     Standard Condition  plus:
               37 tdid speakers (40 - 3 SD overlap)

[EDITORIAL NOTE: This corresponds to adding [rm1/ind/dev_aug]    
     

               37 speakers x 30 sentences = 1110 additional      
               sentences
               
              (This does not include the dialect (2 per speaker)
                or rapid adaptation (10 per speaker) sentences.) 
     
     Augmented condition totals
               (80 - 8) + (40 - 3) = 109  training speakers      
         (72 x 40) + (37 x 30) = 3990 sentence [utterances]


Comments:
1. tdit - SD overlap speakers: bef03 cmr02 dms04 dtb03 ers07 jws04
pgh01 rkm05

2. tdid - SD overlap speakers: dtd05 hxs06 tab07

3. non-overlapping SD speaker: das12

4. [Future] databases should not have any speaker overlap between
SI and SD portions."

***************************************


It is fair to say that no two sites have used exactly the same
"training" set to the same degree of consistency. The
"standardization" of training conditions is probably unnecessary,
except (as has been pointed out) to permit distinguishing between
improvements due to improved algorithms and those with more
training. To track progress due to factors other than training, it
is probably desirable to retain (or return to) the use of the
"official" 72 speaker training set used in the June '88 Benchmark
Tests. But it is clear that best performance will be demonstrated
with better trained systems, and that encourages our full use of
as much information as appropriate. 

Doug Paul has outlined the logic behind selection of the
"standard" or "official" (as well as the augmented) training sets
used in June.  Hy Murveit points out the desirability of
designating previous benchmark test sets as training or
developmental test sets, [along with] the need to ensure that the
test speakers are not in the training corpus:

***************************************

[Portion of a message from Hy Murveit...]

"... Since the best training set would allow the use of all
previous test sets for development tests, I now suggest we remove
the [June] '88 [test] speakers from this [best] training set. I
still advocate as large a training set as possible.  If we removed
the [June] '88 test speakers from this training set we'd still have
about 3500 sentences."

************************************** 

Hy Murveit and Mitch Weintraub brought to my attention an oversight
on my (DSP) part involving the set of speakers
designated for the "Augmented (or extra training data)
condition". [See Appendix]  It is that the set of 37 additional
speakers (added to the base of 72, resulting in the set of 109)
includes the subset of 15 that were used (at CMU and at SRI, and
possibly elsewhere) for Speaker Independent System tests in  '87,
and for Kai-Fu Lee's work with SPHINX prior to June '88. As the
tables in the Appendix indicate, by eliminating these speakers from
an "augmented" training set, there would be a total of only 94
speakers available for an "augmented " training set and the total
number of sentence utterances would then be 3540 [40 for each of
the 72 speakers, plus 30 for each of the 22 speakers that have not
been used in any test sets.  This is the "about 3500"  Hy referred
to.]  I had initially overlooked the overlap with the '87 test sets
when endorsing the "Augmented Condition" consisting of 109 speakers
and 3990 sentence utterances. 

I believe that an augmented training set of only 94 speakers is
not sufficiently larger than the set of 72 to warrant its general
use. 

The strongest argument for use [of a set of 94 speakers for
training] that I have heard is that it does not include any of the
'87 test speakers, so that tracking of progress on the '87 test set
would continue to be possible [if a set of 94 speakers was
standardized on for system training].  But the 72 speaker set is
also free of this defect, and it can be used for this purpose as
was outlined in the original version of this note.  

So I continue to advise use of the 72 speaker set for those cases
in which progress is to be tracked on an algorithmic level (vs.
the benefits of additional training, which can be demonstrated
using the 109 speaker set outlined in the original of this note
and in more detail below). 

As a cautionary note, by now [February 1989] the community will
tend to discredit results cited against the '87 test set(s). Surely
most sites will have run (and re-run) tests and diagnostics using
these test sets so that results should show some training on these
test sets.  Reporting results on  these tests should  ONLY be for
comparitive purposes. 

    
To summarize, here are the official listings of the two
conditions that are outlined for future tests with the '88 and '89
test data:

 "Standard Condition": 72 tdit speakers not in '87 or '88 test sets

(adg04 ahh05 aks01 apv03 bar07 bas04 bjk02 bma04 bmh05 bns04  bom07
bwm03 bwp04 cal03 cef03  ceg08 cft04 cke03 cmb05 crc05  csh03 cth07
cyl02 das05 daw18 dhs03 djh03 dlb02 dlh03 dlr07   dlr17 dmt02 drd06
dsc06 eeh04 ejs08 etb01 fwk04 gjd04 gmd05  gxp04 hbs07 hes05 hpg03
jcs05  jem01 jma02 jpg05 jrk06 jxm05  kes06 kkh05 lih05 ljc04 mah05
mcc05 mdm04 mgk02 mju06 mmh02   pgl02 rcg01 rgm04 rtk03 rws01 sdc05
tju06 tlb03 tpf01 utb05  vlo05 wem05)

To form the "Augmented (or extra training data) condition (109
speakers), add to the set of 72 the following tdid speakers...

Additional 37 (tdid) speakers: 37
(ajp06  awf05 bcg18 bgt05 bpm05 bth07 cae06 chh07 cpm01 ctm02   
ctt05 dab01 dlc03 dpk01 ejl06 esd06 esj06 grl01 gwt02  hjb03 jfc06
jfr07 jlm04 jln08 jmd02 jsa05 lag06 ljd03  lmk05 rav05   rdd01
rjm12 sah01 sds06 sjk06 tdp05 wbt01)

[This "Augmented" set is not to be used for training Speaker
Independent Systems to be tested on the March '87 or October '87
Test Sets, since it includes material from 15 speakers included in
that test set. For the  March '87 or October '87 Test Sets, the
"Standard" 72 speaker training set should be used.]

[Note: this additional training material IS NOT TO INCLUDE the
"dialect" sentence utterances files for these 37 speakers (sa1.sph
and sa2.sph) or the "rapid adaptation" sentence utterances
(sb01.sph through sb10.sph).]  

It is clear that conforming to a "standard" has a cost. The cost
is often in system retraining or even re-running previous
experiments using the new data.

The bottom line is a recommendation to use either or both of the
training sets outlined by Doug Paul above, as is most appropriate
for your purposes. 

By using no more training speakers than these, [e.g., the 72 and
109 speaker trainig sets] we preserve the possibility of using the
speaker-dependent speakers [once again] for [Speaker Independent
System] test purposes at some future date, except (of course) to
the extent that people may have repetitively used test or training
material for development tests. [One major problem with tests of
speaker independent technology is that each time a speaker has been
used once in a development test, data from that speaker can never
be used again. Speaker-Independent Systems are in some sense
omnivorous!] 

So the recommendation is:

EITHER...If the objective of your benchmark test data point is to
demonstrate progress (at the algorithmic level) referred to some
previous performance data point, it will probably be preferable to
use the 72 speaker training that was used in the June '88 tests,
so that the improvement will not be clouded by the issue of
additional training.

OR...If the objective of your benchmark test data point is to
demonstrate the "optimal" performance, and comparisons with the
June '88 results are not so important, and you can pay the price
of the additional training, it will probably be preferable to use
the augmented 109 speaker training described in Doug's note.

OR ELSE, BOTH...Comparisons of the benefits of increased training
can of course be made by using systems trained with both sets.
These comparisons are very desirable, but end up being labor and
machine-time intensive.  Whether it is necessary to re-run old
experiments to define optimal parameter settings  for the
augmented training set is a personal decision.

IF AT ALL POSSIBLE, DO NOT REPORT BENCHMARK TEST RESULTS ON ANY
OTHER TRAINING SETS. TWO DIFFERENT TRAINING SETS SHOULD BE ENOUGH,
AND IT WILL NOT SERVE OUR COMMON INTEREST TO REPORT RESULTS ON
SYSTEMS TRAINED WITH (SAY) 87 SPEAKERS, 105, 112 OR 120, OR SOME
OTHER MAGIC NUMBER. 72 AND 109 MAY NOT BE ENOUGH, BUT ENOUGH IS
ENOUGH!

[EDITORIAL NOTE: The 105 speaker set used by Kai-Fu Lee at CMU in
conjunction with the combined (15 speaker) March '87 and October
'87 test material does not include any overlap between training.
But it does overlap the June '88 Test set: so Kai-fu retrained his
system with the 72 speaker "small training" ttraining set for the
June '88 tests.]

[ABOUT "TUNING"]

"Tuning" is another issue that some people have wondered about as
a consequence of a reference in Kai-Fu Lee's dissertation. I asked
for clarification what this meant, and Kai-Fu asked that we provide
the following information...

*********************************
[From Kai-Fu Lee...]


... I'll tell you exactly what I did for "tuning SPHINX".  A long
time ago (Sep 87), I wanted to set a "language weight" parameter
for the system. This parameter controls the number of insertions
vs. deletions.  Since I had used all of the training data for
training, the only other source of data I had was the additional
sentences for these 15 speakers [i.e. the test speakers]
(actually, not all of them).  So I tried about 5 different language
weights, and chose the best one to evaluate the test set.  More
recently, I have tried to train on smaller training set and test
on the remaining training set, and see what happened to the optimal
choice of this parameter.  It turned out that the same parameter
was selected. So if I had the time, I could have tuned from the
training set.

Also, the choice of this parameter makes very little difference
(usually fractions of a percent).  Actually, there is another
parameter that we tuned, but it only applied to the bigram, which
nobody uses any more, so it doesn't matter.

The place where I did a lot more with those 30 "tuning" sentences
was in speaker adaptation, as I described in my thesis.  I
mentioned "tuning" just to be completely honest.  In reality, such
tuning was unnecessary.

I would appreciate it if you could forward the above to anyone who
has raised the question about "tuning SPHINX".  

************************************

It seems clear that the use of the material from test speakers for
"tuning" of SPHINX is a consequence of the limitations inherent to
the data and our (chronic) need for more data. There is a
continuing tension between setting aside data for future use for
tests, on the one hand, and, on the other hand in using as much
data as possible to develop the best-performing systems. This
tension will not go away. As Kai-Fu notes "if I had the time, I
could have tuned from the training set".  Not many have the time
to observe every possible restriction on the use of available data.

In the light of all of the foregoing, it is very evident that it
behooves every site to be very precise in explaining the use of
data in training their systems. This is very important when
reporting results to NIST and at the meeting.

---------------------------------------------------------------


SPEAKER DEPENDENT TESTS

Doug Paul has also brought to my attention the desirability of
being more specific about the rules for testing Speaker-Dependent
systems. It has been implicit that the standard condition for
"enrollment" or training is to use the full subset of 600
sentence utterances per speaker in the tddt subset [rm1/dep_trn/]. 
This is what has been used in previous tests, and there is no
reason for changing this now.  

In previous "Speaker-Dependent" tests Doug also advises me that
"parameters which are set by multiple tests on the development test
set (tddd) data [rm1/dep/dev/] should be the same for all speakers
(i.e., speaker independent". This is a practice used in the BBN and
MIT/LL comparisons in the past and there seems to be no need to
change this rule, either.

It is noted that this might result in slightly sub-optimal results
for the speaker-dependent systems, but it minimizes the tinkering
with system parameters for individual speakers,
amounting to defining one speaker-dependent system configuration,
and enrolling that system for a new speaker with the 600 sentence
utterance training matarial.  We might wish to change this in the
future, but there seems to be no good reason to change this at
present.

For systems that are "speaker-adaptive", comparisons should be
referenced to the 600-sentence-utterance condition of full
training.  Please advise me of any plans to demonstrate
"adaptation".  Be sure to document or describe the process of
adaptation, particularly the degree to which it is supervised or
unsupervised, since this issue seems to attract a lot of
attention.


SPEAKER-INDEPENDENT VS. SPEAKER-DEPENDENT

A further note from Rich Schwartz asks for clarification of the
arrangements regarding comparisons of "speaker-dependent" and
"speaker-independent" systems. 

Sites developing only speaker-dependent systems (i.e., BBN) were
not expected to test on speaker-independent test material. Sites
developing speaker-independent systems had the potential of testing
on the speaker-dependent test material, but were not required to
do so in all cases. Some sites objected to the sheer volume of
processing two complete (300 sentence utterance) sets of test
material. Some noted that previous use of the speaker-dependent
speakers for test purposes (as in the June '88 tests) renders the
use of these speakers invalid for test purposes  for speaker-
independent systems. 

Still, it is obviously of interest to be able to compare the
performance of two systems (e.g., one speaker-dependent and another
speaker-independent) on the same test material. In such a test, the
additional set of test material intended for speaker-dependent
systems should simply be regarded as additional speaker-independent
test material, and tested without special enrollment. Obviously,
this takes time and not all sites have agreed to do this. Perhaps
some sites will be able to report on these tests AFTER the deadline
prior to the meeting, or even after the meeting. 


ABOUT PROCESSING TIMES...

In earlier DARPA Benchmark Tests, we required that each site report
processing times and system configuration (e.g. hardware). More
recently, this information hasn't been reported. Now, once again,
with the expansion of the test sets, I hear comments about
limitations on the ability to run tests based on processing and/or
set-up or training time.  It is recommended that reported results
include discussion of processing time/sentence utterance, at the
least.  At one point in the past (when processing "live" test
utterances for speakers JAS, DSP and TDY) there were tales of
processing times of the order of nearly an hour for some
utterances, and reporting on this factor served to underscore the
desirability of real-time systems. While there are no formal
constraints on maximum processing time or allowable hardware, it
will be useful to document processing times along with the hardware
used in these tests.      


APPENDIX: Analysis of Overlaps in Training and Test Material

Edited portion of a message from Hy Murveit and Mitch Weintraub
(Received at NIST on January 19, 1989)]

[The set of 80 tdit speakers:]
(adg04 ahh05 aks01 apv03 bar07 bas04 bef03 bjk02 bma04 bmh05     
 bns04 bom07 bwm03 bwp04 cal03 cef03 ceg08 cft04 cke03 cmb05    
 cmr02 crc05 csh03 cth07 cyl02 das05 daw18 dhs03 djh03 dlb02     
 dlh03 dlr07 dlr17 dms04 dmt02 drd06 dsc06 dtb03 eeh04 ejs08    
 ers07 etb01 fwk04 gjd04 gmd05 gxp04 hbs07 hes05 hpg03 jcs05     
 jem01 jma02 jpg05 jrk06 jws04 jxm05 kes06 kkh05 lih05 ljc04    
 mah05 mcc05 mdm04 mgk02 mju06 mmh02 pgh01 pgl02 rcg01 rgm04     
 rkm05 rtk03 rws01 sdc05 tju06 tlb03 tpf01 utb05 vlo05 wem05)

[The set of 40 tdid speakers:]
(ajp06 awf05 bcg18 bgt05 bpm05 bth07 cae06 chh07 cpm01 ctm02     
 ctt05 dab01 dlc03 dpk01 dtd05 ejl06 esd06 esj06 grl01 gwt02    
 hjb03 hxs06 jfc06 jfr07 jlm04 jln08 jmd02 jsa05 lag06 ljd03     
 lmk05 rav05 rdd01 rjm12 sah01 sds06 sjk06 tab07 tdp05 wbt01)


[OLD SPEAKER-INDEPENDENT TEST SET SPEAKERS:]

Combined March '87 and October'87 test speaker list: 15 speakers
[all from tdid]
(awf05 bcg18 bth07 ctm02 ctt05 dab01 dlc03 dpk01 gwt02 jfc06 jfr07
 ljd03 lmk05 sah01 sjk06)

June '88 test speaker list: 12 speakers: 8 [from] tdit, 3 [from]
tdid, 1 [from] neither [tdit nor tdid]
(bef03 cmr02 das12 dms04 dtb03 dtd05 ers07 hxs06 jws04 pgh01 rkm05
 tab07)
------------------------------------------------------------

[OVERLAPS]
1. [15] '87 test speakers all from tdid [rm1/ind/dev_aug]
2. [8] June '88 test speakers in tdit [rm1/excluded]
        (bef03 cmr02 dms04 dtb03 ers07 jws04 pgh01 rkm05)
3. [3] June '88 test speakers in tdid [rm1/ind/dev_excl]
        (dtd05 hxs06 tab07)
4. [1] June '88 test speakers in neither [tdit nor tdid] 
        (das12)


[POTENTIALLY VALID TRAINING SPEAKERS: NOT USED IN ANY PREVIOUS
SPEAKER-INDEPENDENT TEST SET...]

1. 72 tdit speakers not in '87 or '88 test sets

(adg04 ahh05 aks01 apv03 bar07 bas04 bjk02 bma04 bmh05 bns04 bom07
 bwm03 bwp04 cal03 cef03 ceg08 cft04 cke03 cmb05 crc05 csh03 cth07
 cyl02 das05 daw18 dhs03 djh03 dlb02 dlh03 dlr07 dlr17 dmt02 drd06
 dsc06 eeh04 ejs08 etb01 fwk04 gjd04 gmd05 gxp04 hbs07 hes05 hpg03
 jcs05 jem01 jma02 jpg05 jrk06 jxm05 kes06 kkh05 lih05 ljc04 mah05
 mcc05 mdm04 mgk02 mju06 mmh02 pgl02 rcg01 rgm04 rtk03 rws01 sdc05
 tju06 tlb03 tpf01 utb05 vlo05 wem05)


2. 22 tdid speakers not in '87 or '88 test sets

[NOTE: THE ELIMINATION OF THE '87 AND '88 TEST SET SPEAKERS REDUCES
THIS SET TO (ONLY) 22 SPEAKERS, FROM THE ORIGINAL 40, BY DELETING
THE 3 OVERLAPPING FROM THE FALL '88 TEST SET AND THE 15 USED IN
TESTS IN '87]
 
(ajp06 bgt05 bpm05 cae06 chh07 cpm01 ejl06 esd06 esj06 grl01 hjb03
 jlm04 jln08 jmd02 jsa05 lag06 rav05 rdd01 rjm12 sds06 tdp05 wbt01)

[RATIONALE FOR A SET OF 94 SPEAKERS AND 3540 SENTENCE
UTTERANCES:]
              72 + 22 = 94 speakers
(72 * 40) + (22 * 30) = 3540 sentences

[End of material from Hy and Mitch]
