Editorial note: the following consists of a revised version of 
material prepared to document DARPA benchmark speech recognition 
tests.  
 
The two papers describing the first of these tests, in March 1987, 
are more formal than the others and were originally prepared for 
inclusion in the proceedings of the March 1987 DARPA speech 
recognition workshop.  More informal notes were prepared for 
distribution within the DARPA speech research community to provide 
information on the tests conducted prior to the October 1987 and 
June 1988 meetings.  The June 1988 note consists of an adaptation 
of the October 1987 note.  Still more informal notes were prepared 
to outline the test procedures for the February 1989 and October 
1989 benchmark tests. 
 
Editorial revisions primarily include some changes of tense, 
inclusion of references to directories in this cd-rom in which the 
material might be found, and substitution of the term "corpus" or 
"corpora" for "database" or "databases".  Other editorial changes 
include insertion of comments as appropriate for clarification.  
In most cases, these editorial revisions are included within square
brackets.   
****************************************

      SELECTED TEST MATERIAL, TEST AND SCORING PROCEDURES 

                            FOR THE 

               OCTOBER 1987 DARPA BENCHMARK TESTS


                        David S. Pallett

         Institute for Computer Sciences and Technology
                  National Bureau of Standards
                     Gaithersburg, MD 20899


                            ABSTRACT

This paper describes the test material, procedures and scoring
conventions used for the October '87 DARPA Benchmark Tests.
Selected subsets of previously recorded speech corpus material were
identified and used for tests of both Speaker Dependent and Speaker
Independent technologies. A set of 70 sentence texts was identified
and was used by three designated "live talkers" in tests
administered at BBN and CMU during September. A convention was
defined for representing the output of the systems, and a
standardized string matching and scoring convention was to be used
in reporting the results of the Benchmark Tests.


                          INTRODUCTION

At the Fall 1986 DARPA Speech Recognition Meeting, plans were
discussed for implementing Benchmark Tests of the Continuous Speech
Recognition Systems being developed under DARPA sponsorship. These
plans were further developed during the period October 1986 - March
1987. 

A preliminary implementation of the test procedures was conducted
prior to the March 1987 Meeting. At that meeting, BBN reported on
the performance of their Speaker-Dependent technology using a set
of 100 sentence utterances (25 sentence utterances from each of 4
designated speakers) chosen from the Resource Management Speaker
Dependent Development Test portion of the DARPA Resource Management
Speech Corpus [rm1/dep/dev/].  Approximately one month later, CMU
reported the results of similar tests involving another set of 100
sentence utterances (10 sentence utterances from each of 10
designated speakers) chosen from the Resource Management Speaker
Independent Development Test portion [rm1/ind/dev_aug/]. 
[Editorial Note: These March 1987 Test Set results were for the
"ANGEL" system, not the SPHINX system].  Each site also reported
on the results of tests conducted on sets of 90 sentence utterances
provided by "live talkers" (30 utterances from each of three
designated speakers). The results of these tests were analyzed at
NBS and distributed within the DARPA community.

In preparation for the Fall 1987 Speech Recognition Meeting, NBS
selected additional test material for similar tests. NBS 
participated in the implementation of the "live talker" test
procedures and developed and distributed  standardized  scoring
software to be used in reporting performance results at the Fall
1987 and future Meetings.

This paper serves to identify the test material used for these
tests, to describe the test procedures, and to outline the
characteristics of the scoring software.


         RESOURCE MANAGEMENT SPEECH CORPUS TEST MATERIAL

Speaker Independent Test Material
------- ----------- ---- --------
For the March '87 tests, a set of ten speakers was identified,
drawn from the Speaker Independent Development Test Subset of the
Resource Management Corpus [rm1/ind/dev_aug/] [1]. This set of ten
speakers included 7 male speakers and 3 female speakers. Data on
these speakers is presented in papers presented at the March '87
Meeting [1,2]. Ten sentence utterances for each of these speakers
were used for test purposes.

For the October '87 tests, a set of six speakers was identified,
also from the Speaker Independent Development Test Subset
[rm1/ind/dev_aug], and 10 utterances were specified for each of
these speakers, amounting to a total of 60 sentence utterances from
this portion of the Resource Management Corpus.   [Editorial Note:
Of these six speakers, five were new speakers, and one (subject
gwt0) had been included in the March '87 test material. The 10 test
utterances for subject gwt0 were the same for both the March '87
and October '87 tests.  This portion of the October '87 test
material thus includes 50 previously unused test utterances and 10
"retest utterances".]

An additional set of four new speakers was chosen from the Speaker
Dependent Development Test Subset [rm1/dep/dev/] and 25 sentence
utterances were specified for each of these speakers. An additional
100 sentence utterances are thus identified in this portion of the
test material. This portion is of particular interest since the
same material was to be used in tests of the  Speaker Dependent
technology, providing some overlap of a subset of test material.

Table 1 provides detailed information on the individual speaker's
regional backgrounds, race, year of birth and educational level
for these tests of Speaker Independent Technology.


Subject   Sex      Region          Race   Year of Birth Education
-------   ---      ------          ----   ------------- ---------
GWT        M       NORTHERN        WHT       '21          B.S.

CTM        M       NORTHERN        WHT       '55          H.S.

DPK        M       NEW ENGLAND     WHT       '60          B.S.

LJD        F       NORTH MIDLAND   WHT       '61          B.S.

LMK        F       SOUTHERN        WHT       '60          B.S.

SJK        M       NEW YORK CITY   WHT       '31          B.S.


Also used in tests of Speaker-Dependent technology:

DTB        M       NORTH MIDLAND   AMR*      '42          B.S.

DTD        F       SOUTHERN**      BLK       '54          B.S.

PGH        M       NEW ENGLAND     WHT       '63          B.S.

TAB        M       WESTERN         WHT       '60          B.S.
----------------------------------------------------------------

*    (American Indian)
**   (Subject reportedly "tried to change Southern accent to fit
      Chicago".)

           Table 1.  Speaker Independent Test Speakers

A total of 160 sentence utterances by a total of ten speakers are
thus specified for use in the October 1987 Benchmark Tests of
Speaker Independent technology.


Speaker Dependent Test Material
------- --------- ---- --------
For the March  '87 tests, a set of four speakers was identified,
including 3 males and 1 female [2]. Twenty-five sentence utterances
for each of these speakers [from rm1/dep/dev/] were specified and
used for test purposes. In these tests, selection of particular
sentence utterances was typically the first 25 sentences that had
been recorded and were contained on the data tape for that
individual. 

For the October '87 tests, another set of 25 sentence utterances
(from each of the same four speakers) was specified [from
rm1/dep/dev/] to be used for test purposes. 

However, for each speaker, 10 of this set of 25 sentence utterances
are the same as some of those used for the March '87 tests. These
were selected as sentences for which at least one error occurred
under at least one of the test conditions in BBN's March tests.
These sentence utterances for were selected for "retest" in order
to permit demonstrations of incremental progress.  

The remaining 15 sentence utterances for these four speakers were
randomly chosen [from the utterances for those speakers contained
within rm1/dep/dev/]. 

It was recognized that by selecting particularly difficult sentence
utterances, the statistics for performance evaluation for the
composite set of 25 sentences for each speaker may be biased
somewhat lower than if all of the sentences had been randomly
chosen, and separate reporting for the subsets of 10 difficult
sentences and 15 randomly chosen sentences would be appropriate. 

An additional set of four (new) speakers was identified from the
Speaker Dependent Development Test Subset [rm1/dep/dev/], and a
set of 25 sentence utterances was specified for each of these
speakers. This set of 100 sentence utterances is the same as that
to be used in tests of Speaker Independent Technology. 

Table 2 provides information on the speakers used for tests of the
Speaker Dependent technology.


Subject   Sex      Region         Race   Year of Birth  Education
-------   ---      ------         ----   -------------  ---------
March '87 Set:

BEF        M       NORTH MIDLAND   WHT       '52          Ph.D
CMR        F       NORTHERN        WHT       '51          M.S.
JWS        M       SOUTH MIDLAND   WHT       '40          B.S.
RKM        M       SOUTHERN        BLK       '56          B.S.

Fall '87 Set (also used for Speaker Independent technology:)

DTB        M       NORTH MIDLAND   AMR*      '42          B.S.
DTD        F       SOUTHERN**      BLK       '54          B.S.
PGH        M       NEW ENGLAND     WHT       '63          B.S.
TAB        M       WESTERN         WHT       '60          B.S.
----------------------------------------------------------------
*    (American Indian)
**   (Subject reportedly "tried to change southern accent to fit
      Chicago".)

            Table 2.  Speaker Dependent Test Speakers

A total of 200 sentence utterances by a total of eight speakers
(40 difficult + 60 randomly chosen sentences by the four speakers
used in the March '87 tests, plus 100 randomly chosen sentences for
the four new speakers) was thus designated for use in the October
1987 DARPA Benchmark Tests of Speaker Dependent technology.


                    LIVE TALKER TEST MATERIAL

In the March '87 tests, each of three speakers (JAS, TDY and DSP)
read 30 sentence texts drawn from the text corpus
[rm1/doc/al_sents.txt] for Resource Management. Ten of the 30 texts
were the same for each of the three scripts. Thus a total of 90
sentence utterances were used for test purposes, drawn from a set
of 70 unique sentence texts.

Each of the three talkers also provided the set of 10 "Rapid
Adaptation" sentences. Different productions of this test material
was provided in tests occurring on different days at both BBN and
CMU. Differences in vocal fatigue (and other factors) were noted
when recording the utterances at the two sites.

In the March '87 tests conducted at BBN, use was made of the rapid
adaptation material to "adapt" their system for use by the live
talkers.  However, for a number of reasons, it was agreed at the
March '87 Meeting that future "live tests" of the Speaker Dependent
technology would not use the rapid adaptation material, but would
be conducted in a formally speaker-dependent manner.  During July
of 1987, the three test subjects (JAS, TDY and DSP] visited BBN and
recorded system training or enrollment material.  At CMU, the live
talker tests were to be conducted either in (at CMU's preference)
a formally speaker-independent manner or with the use of rapid
adaptation, so that no formal system training was necessary.

The protocol agreed upon for recording the enrollment material at
BBN involved two half-hour sessions for each subject. The set of
600 sentences in the training sentence subset [see sentence numbers
sr001 through sr600 in rm1/doc/al_sents.txt] was organized so as
to present the longer, more difficult to read, sentences toward the
end of the script.  Each subject was instructed to read the
sentences without undue concern if a sentence was misread, going
immediately on to the next sentence. Following the sessions, BBN
staff listened to the material and deleted those sentences
involving reading errors. 

It proved possible to collect more than 300 acceptable training
sentence utterances from each of the test speakers during the
course of the two half-hour sessions. The more difficult to read
texts were not recorded because of the agreed-upon time limitation
of one hour for each subject, and these texts were to appear toward
the end of the script. The total duration of the speech material
for each of the subjects was of the order of 17 minutes, of which
all or any portion could be used for system training prior to the
"live tests" conducted in September.

The experience of the "live talkers" in the March '87 tests
suggested that the longer sentences used in the those tests
(randomly drawn from the set of [approximately] 2800 sentence texts
[rm1/doc/al_sents.txt]) were in some cases difficult to read. [It
was noted by JAS that they did not seem representative of those
that might be used for interactive dialogue in an actual resource
management application.] For the March '87 tests, the texts
averaged 7.9 words in length. Accordingly, selection of "live
talker" texts for the October '87 tests was biased somewhat toward
shorter sentences, though in practice the effect was primarily to
reduce the variation in sentence length, producing an average
length of 6.8 words.

As in the March '87 tests, each "live" test speaker read scripts
presenting the prompt forms of these sentence texts in site visits
conducted during September 1987. A Sennheiser HMD 414-6 microphone
(as used at TI in collecting the Resource Management Speech Corpus)
was to be used, and the test environment was to be a computer lab
or conference room with little or no competing (background)
conversation. A portion of the test material was  processed "live"
or "on-line" (during the test), and the remainder was processed
off-line following the visit, but was to be reported upon at the
Fall '87 meeting.


                         TEST PROCEDURES

Experimental Design
------------ ------
The October 1987 Benchmark Tests were intended to be very similar
to those implemented in March '87. The earlier procedures are
described in an earlier paper [3].

The specified utterances for the designated speakers in the
Resource Management Speech Corpus were to be processed with and
without the use of imposed grammars. Comparably detailed results
were to be reported for both conditions. No other parameters were
to be changed for these comparative tests.

Use of the material for "rapid adaptation" was optional. In
contrast to the earlier tests, BBN did not make use of the "rapid
adaptation" mode in implementing the "live tests", but used the
training material provided by the three specified "live talkers".

Live Test Protocol
---- ---- --------
These protocols were changed slightly from those used in March '87.
Experience had shown that representative system response times did
not permit processing 30 sentences within 30 minutes elapsed time.
After processing several sentences in real time, the remainder of
the 30 minute subject time was devoted to collecting the remainder
of the 30 sentences per live talker. "Off-line" processing was
subsequently used for these sentence utterances.

Every effort was to be made to process each of the three test
speakers' data using identical processing. At BBN in March '87,
there were problems in obtaining optimal performance (in terms of
both performance and processing time) for one of the live speakers
(TDY) using the speaker adaptive mode. Consequently, processing
parameters were revised for this one speaker, and the performance
statistics [reported for subject TDY in the March '87 BBN "live
test" results] are believed to be in some sense "sub-optimal". 

Vocabulary/Lexicon/Output Convention
------------------------- ----------
The conventions used for representing the system output, and, for
comparison in scoring, for the reference strings, are described in
a previous paper [3]. TI provided an implementation of the rules
for the Resource Management sentence texts in Standard Normalized
Orthographic Representations (SNORs) [rm1/doc/al_sents.snr], and
these were used for scoring purposes.

In this representation, there are a total of 991 distinct lexical
entries, as derived from the set of 2800 sentences developed at
BBN [see rm1/doc/lexicon.snr for a list of these lexical entries.] 
It has been noted that this lexicon is not logically complete.
However, it is all inclusive in the sense that it covers all
entries in the recorded corpus, and thus should be provisionally
sufficient for the purposes of scoring when using test material
derived from the set of 2800 sentences [rm1/doc/al_sents.snr]
and/or from the recorded Resource Management Corpus.

Analysis of system responses provided by BBN and CMU for the March
'87 tests disclosed that the different sites used different lexical
conventions for both internal representations and for system
output, complicating scoring. For example, at CMU there were
instances of a lexical entry "CITRUS-1" presumed to represent one
of the alternative pronunciations of "CITRUS". For such a system
response to be scored as correct, there either has to be post-
processing of the responses or special adaptation of the scoring
software. At BBN, the city (place name) San Diego, represented in
the SNOR lexicon as "SAN-DIEGO", was represented as two entries
"SAN" and "DIEGO", giving rise to other scoring complications.

There is "a natural assumption that the units used for scoring
should be as similar as possible to the lexical units used in a
system" [4]. Given differences between systems and the differing
lexical representations in different systems, there is need for a
standard representation for each word. For the Resource Management
task, the SNOR convention and lexicon fill this role, and they are
to have been used for the October '87 tests. 
 
However,  it has become evident that no consistency is to be
expected with regard to internal representations. To assist in
understanding what is meant by an "N-Word System" (e.g. the 1000
Word systems presently under study in the DARPA program), it was
proposed that the lexical words used by particular systems should
always be specified (be they words, phrases, sentences, or
combinations thereof).

Mappings or postprocessing between the internal representations
and the system output (used for evaluation by comparison with the
reference strings) should be documented.


                        SCORING PROCEDURE

At the time the March '87 tests were implemented, no general
agreement had been reached concerning the software to be used for
scoring the system output. Scoring software was provided to NBS by
BBN, CMU and TI for comparative usage with preliminary system
output. Subsequently, additional software was provided by Lincoln
Laboratories, and C-language code was written at NBS to implement
what seemed to be the most attractive features of each software
package, as well as including some new capabilities. The intended
purpose of developing this standardized scoring software is to
provide a versatile and consistent set of scoring tools.

Dynamic Programming String Matching Algorithm
------- ----------- ------ -------- ---------
Scoring data is derived from comparisons of reference strings and
system outputs, using a dynamic programming string alignment
algorithm. The C-language string alignment procedure is adapted
from code written by Doug Paul at Lincoln Laboratories (following
discussions with Rich Schwartz and Francis Kubala at BBN). It is
similar to the ERRCOM scoring utility written in Zetalisp at BBN.
Both are "functionally identical dynamic programming algorithms
for computing the lowest cost alignment between two strings
(possibly not unique) given the following constraints on the cost
function used to score the alignments:
           (1)     An exact match incurs no penalty.
           (2)     Deletion and insertion errors incur equal     
                   penalties.
           (3)     The sum of one deletion and one insertion
                   error penalty is greater than one
                   substitution error penalty.

For ties (multiple best alignments) an arbitrary choice is made.
This decision cannot affect the alignment score but merely reorders
adjacent substitution and deletion/insertion errors. It is worth
mentioning that the NBS and BBN programs make this choice
differently, therefore alignments may vary, but scores won't." [5]

Error Taxonomy and Statistics
----- -------- --- ----------
The standard error taxonomy resulting from use of the software
includes data on the percentage of words (in the reference string)
that are correctly recognized, the percentage of substitutions,
percentage of deletions, percentage of insertions, and the total
percent error (where this total includes substitutions, deletions
and insertions). In BBN's error taxonomy, they have defined "Word
Accuracy" as [100% - (total percent error)]. Note, however, that
word accuracy is not in general equal to the percent correctly
recognized words because of insertions.

Splits and Merges: Contractions
------ --- ------- ------------
Discussions of alternative error taxonomies that have appeared in
the literature [6] include discussion of the occurrence of errors
called "splits" and "merges". In our error taxonomy, a split is
decomposed into consecutive substitution and insertion errors (or
an insertion and a substitution). An example of such an error might
involve a contraction such as "ISN'T" being reported as "IS" and
"NOT". Similarly, a "merge" can be decomposed into  consecutive
substitution and deletion errors (or a deletion followed by a
substitution). Correspondingly, a merge might involve the sequence
"IS NOT" recognized as "ISN'T".  

In general, splits and merges should be reported as (unrelated)
substitutions, deletions, and insertions. In general, semantically
acceptable splits and merges defy definition. However, it has been
observed that "orthographic contractions are both common and nearly
always semantically acceptable" [3]. The NBS scoring software [used
in the 1987 tests] contained a table of splits and merges that
could be referred to when split or merge "candidates" have been
detected following implementation of the dynamic programming
algorithm. If the candidate split or merge is a member of the class
of (presumed) semantically acceptable splits or merges, it could
be listed as such and statistics compiled.

Special Classes of Insertions: Pre- and Post-Shadowing
------- ------- -- ----------- ---- --- --------------
Other references in the literature [6,7] cite the occurrence of
"pre-shadowing" or "post-shadowing". Pre-shadowing might occur when
an initial fragment of a poly-syllabic word such as "MAXIMUM" is
reported as "MAX" and subsequently the system (correctly) responds
with the correct answer, in this case "MAXIMUM". The NBS software
does not detect the occurrence of these special cases of
insertions, nor does our proposed error taxonomy include them.

Homophone Errors
--------- ------
A number of scoring software packages include special provision
for scoring substitution errors involving homophones. This may be
particularly appropriate for those cases in which there is no
imposed grammar, and the probability of homophone errors may be
high. The NBS scoring software can refer to a table of homophones
when classifying errors involving substitutions to determine if the
errors involve homophones. This option is not ordinarily used, but
is provided as a diagnostic tool. [Editorial Note: See "Standard
Scoring Procedure" (below) for usage of this option when
implementing October 1987 DARPA Benchmark Tests.]

Synonyms
--------
Analysis of the March '87 results identified some  errors involving
substitutions of synonyms (e.g. "MAX" for "MAXIMUM"). It can be
argued that these errors are semantically acceptable, and the NBS
software can be set to classify substitution errors involving
synonyms by reference to a table of acceptable synonyms. Use of
this option, like that for homophones, is not ordinarily used, but
it is provided as a diagnostic tool.

Deletions of "THE"
--------- -- -----
"Deletions of the token "THE" are typically a large proportion of
the errors observed in high performance systems and the majority
of these deletions leave the semantic intention of the utterance
intact" [5]. Thus it has been argued that it would be valuable to
compute the proportion of errors of this type, and correspondingly,
to score sentences whose only errors involve deletions only the
word "THE" as "semantically OK" . However, at present, the NBS
scoring software does not specially account for this class of
error.

Time Criterion for Word Beginnings
---- --------- --- ---- ----------
It has been argued that the most appropriate criterion for scoring
the output of a speech recognition system would include some
measure of the accuracy of a system in  reporting on the identified
word boundaries. 

One proposal to this effect suggests that the test material be
(manually) marked with word boundary information (word beginning
and end times) and that the scoring algorithm refer to this
information as well as the correctness of the word in classifying
the response as correct. According to this proposal, word beginning
times should be within a specified time window (tentatively set at
70 msec) of the value identified in the manual labelling of the
test material. 

Procedures of this sort have been used in studies of segmentation
and in evaluating the output of word spotting systems. There are
several different algorithmic approaches to implementation of such
a proposal. 

The scoring software [used in the October  '87 tests] allows for
the incorporation of such a criterion following implementation of
the dynamic programming string matching algorithm, and requires
reference strings and system output that contain the (additional)
timing information. Imposition of such a criterion tends to
increase the number of errors. Different systems err in different
ways. In some cases the predominant additional errors are due to
errors in detecting words beginning with weak fricatives preceded
by words ending in stop consonants. In other systems, the
occurrence of a deletion or insertion may skew successive  reported
word beginning times sufficiently so as to lead to multiple errors.
The need for reliance on manually labelled word boundary
information is an unattractive aspect of such a scoring procedure.
For these reasons, although the NBS scoring software permits
incorporating such a criterion, it is not ordinarily used. 


Processing Times
---------- -----
In the Benchmark Tests, the processing times and system
configuration are to be reported.

Sentence Level Scoring
-------- ----- -------
In the NBS scoring software, a sentence is scored as correctly
recognized only if all words have been recognized and there are no
insertion or deletion errors. Supplemental analyses (such as the
fraction of sentences for which the only errors involve deletions
of the word "THE") are permitted, but only to supplement the basic
data.

Characterizing the Imposed Grammar
-------------- --- ------- -------
At present, no general agreement has been reached for a completely
unambiguous procedure for characterizing the imposed grammars in
all systems. A proposal for characterizing the complexity of a
language model in terms of the "test-set perplexity" has been
circulated [6], but it appears that different descriptions of
imposed grammars may be used by different sites for the present
tests. This matter needs to be actively addressed.

"Standard Scoring Procedure"
--------- ------- ----------
The reference strings to be used for scoring are to be the SNOR
representations [rm1/doc/al_sents.snr] developed from the
lexical/output convention described in [3]. 

The lexical convention for system output for the present tests is
that for the 991 word SNOR lexicon [rm1/doc/lexicon.snr].

The option to score errors that involve substitutions of homophones
as "correct" is NOT to be used for standard scoring purposes. It
may be used for supplementary analysis, if so desired. [Editorial
Note: This procedure was changed for the June 1988 and subsequent
DARPA Benchmark Tests. When using the "no grammar" test condition,
homophone substitution errors were counted as correct in the later
tests. However, when using the "word-pair grammar" condition,
homophone substitution errors are counted as errors in the later
tests.]
   
The option to score errors involving synonyms as correct is NOT to
be used.

The option to analyze splits and merges and to identify errors
involving contractions is NOT to be used.

The option to impose constraints on the beginning time of each word
is NOT to be used. It may be used for supplementary purposes, if
so desired, but it is the responsibility of the organization
choosing that option to mark the word boundaries on the reference
strings (speech files), and to make that information available to
other DARPA researchers on request.

Summary statistics on the percentages of words (based on the number
of words in the reference strings) correctly recognized,
substitutions, deletions, and insertions are to be reported.  The
total percent error is to include substitutions, deletions, and
insertions.  These statistics are to be reported for each speaker
under each test condition as well as for the larger test subsets
to which the test speakers belong.  As mentioned previously, data
are to be reported for tests conducted without use of an imposed
grammar as well as with imposed grammar(s).

System output data is to be made available to NBS for additional
analysis.


                        ACKNOWLEDGEMENTS

The development of the NBS scoring software described in this paper
involved contributions from a number of individuals at several
organizations.  Doug Paul is to be thanked for making a C-language
program available that implements a dynamic programming algorithm
essentially identical to htat used at BBN. At BBN, Francis Kubala
provided LISP code for the BBN ERRCOM utility and is to be thanked
for his cooperation in implementing the perfromance evaluation
tests. At CMU, Rich Stern is to be thanked for his cooperation in
implementing the perfromance evaluation tests. At TI, George
Doddington and Bill Fisher are to be thanked for providing another
scorring utility, in FORTRAN. Each of these scoring packages has
demonstrable merits: we sought to combine the most attractive
features of each.

Patti Price at BBN, Jared Bernstein at SRI and Bill Fisher at TI
deserve special thanks for cooperating in the slelection of test
material for the benchmark tests and in working toward consensus
in the lexicon and output [SNOR] convention.  Alex Rudnicky at CMU
contributed significantly in clarifying the distinctions to be made
between lexical representationms that are internal to systems and
the need for specifying the mappings between these representations
and the system output used for evaluation purposes.

At NBS, Lynn Cherny was responsible for analyzing the results of
the March '87 tests and in distinguishing between "errors" that
were real and those that were artifacts due to differences in
lexical/output convention and string alignments. Credit for coding
the scoring software goes to Stan Janet.


REFERENCES

[1]  W.M. Fisher, "The DARPA Task Domain Speech Recognition
Database", Proceedings of the March 1987 DARPA Speech Recognition
Workshop.

[2]  D.S. Pallett, "Selected Test Material for the March 1987 DARPA
Benchmark Tests", Proceedings of the March 1987 DARPA Speech
Recognition Workshop.

[3]  D.S. Pallett, "Test Procedures for the March 1987 DARPA
Benchmark Tests", Proceedings of the March 1987 DARPA Speech
Recognition Workshop.

[4]  Private Communication with Alex Rudnicky, September 21, 1987.

[5]  Private Communication with Francis Kubala et al., August 13,
1987.

[6]  R.D. Rodman, M.G. Joost, and T.S. Moody, "Performance
Evaluation of Connected Speech Recognition Systems", Proceedings
of Speech Tech '87, New York, NY, April 28-30, 1987, pp. 269-274.

[7]  F. Dreizin, R. Kittredge, and D. Korelsky, "Semantic Support
in Speech recognition:  An Application to Fire Control Dialogues",
Proceedings of AVIOS '86 Voice I/O Systems Applications Conference,
Alexandria, VA, September 16-18, 1986, pp. 339-354.

[8]  S. Roucos, "Measuring Perplexity of Language Models Used in
Speech Recognizers", unpublished manuscript circulated within DARPA
research community, September, 1987.
