Editorial note: the following consists of a revised version of
material prepared to document DARPA benchmark speech recognition
tests. 

The two papers describing the first of these tests, in March 1987,
are more formal than the others and were originally prepared for
inclusion in the proceedings of the March 1987 DARPA speech
recognition workshop.  More informal notes were prepared for
distribution within the DARPA speech research community to provide
information on the tests conducted prior to the October 1987 and
June 1988 meetings.  The June 1988 note consists of an adaptation
of the October 1987 note.  Still more informal notes were prepared
to outline the test procedures for the February 1989 and October
1989 benchmark tests.

Editorial revisions primarily include some changes of tense,
inclusion of references to directories in this cd-rom in which the
material might be found, and substitution of the term "corpus" or
"corpora" for "database" or "databases".  Other editorial changes
include insertion of comments as appropriate for clarification. 
In most cases, these editorial revisions are included within square
brackets. 
****************************************

      SELECTED TEST MATERIAL, TEST AND SCORING PROCEDURES 

                            FOR THE 

                 JUNE 1988 DARPA BENCHMARK TESTS


                        David S. Pallett

         Institute for Computer Sciences and Technology
                  National Bureau of Standards
                     Gaithersburg, MD 20899


                            ABSTRACT

This paper describes the test material, procedures and scoring
conventions used for the June 1988 DARPA Benchmark Tests.  [In
contrast to other DARPA Benchmark Tests, only] one selected subset
of recorded speech corpus material was identified and this subset
was used for tests of both Speaker Dependent and Speaker
Independent technologies.  [Further,] in contrast to previous DARPA
Benchmarks Tests, there were no "live" tests during this series of
tests. The same [SNOR] convention used in previous tests was to be
used for representing the output of the systems, and a standardized
string matching and scoring convention was to be used in reporting
the results of the Benchmark Tests.


                          INTRODUCTION

At the Fall 1986 DARPA Speech Recognition Meeting, plans were
discussed for implementing Benchmark Tests of the Continuous Speech
Recognition Systems being developed under DARPA sponsorship. These
plans were further developed during the period October 1986 - March
1987. 

A preliminary implementation of the test procedures was conducted
prior to the March 1987 Meeting. At that meeting, BBN reported on
the performance of their Speaker-Dependent technology using a set
of 100 sentence utterances (25 sentence utterances from each of 4
designated speakers) chosen from the Resource Management 
Development Test Set portion of the DARPA Resource Management
Speech Corpus [rm1/dep/dev/].  Approximately one month later, CMU
reported the results of similar tests [for the CMU ANGEL system]
involving another set of 100 sentence utterances (10 sentence
utterances from each of 10 designated speakers) [from
rm1/ind/dev_aug].  Each site also reported on the results of tests
conducted on a set of 90 sentence utterances provided by "live
talkers" (30 utterances from each of three designated speakers).
The results of these tests were analyzed at NBS and distributed
within the DARPA community.

In preparation for the October 1987 Speech Recognition Meeting,
NBS selected additional test material for similar tests. NBS 
participated in the implementation of the "live talker" test
procedures and developed and distributed  standardized  scoring
software to be used in reporting performance results at the October
1987 and future Meetings. The results of these Benchmark Tests
conducted by BBN and CMU were reported at the October 1987 Meeting. 
In the Spring of 1988, MIT's Lincoln Laboratories conducted
Benchmark Tests using the specified test material and test
protocol, and reported those results within the community [9]. 

In preparation for the June 1988 Speech Recognition Meeting, NBS
selected additional previously unused test material [from
rm1/dep/dev/], made minor modifications to the scoring software
(primarily in providing a standardized format for data tabulation
and minor revisions to the table of homophones), and distributed
the test material and scoring software to several sites. This paper
serves to identify the test material used for the Benchmark  Tests
conducted in June 1988, describes the test procedures, and 
outlines the characteristics of the scoring software.


         RESOURCE MANAGEMENT SPEECH CORPUS TEST MATERIAL

Test Material

     Previous Tests

Speaker Independent Test Material
For the March '87 tests, a set of ten speakers was identified,
drawn from the Speaker Independent Development Test Subset of the
Resource Management Speech Corpus [rm1/ind/dev_aug/] recorded at
TI  [1]. Data on these speakers is presented in papers presented
at the March '87 Meeting [1,2]. Ten sentence utterances for each
of these speakers were used for test purposes.

For the October '87 tests, an additional set of six speakers was
identified from the Speaker Independent Development Test Subset,
and 10 utterances were specified for each of these speakers,
amounting to a total of 60 sentence utterances in this portion of
the test material. [EDITORIAL NOTE: Of these six speakers, five
were new, and one (subject gtw0) had been included in the March
'87 tests.]

An additional set of four new speakers was chosen from the Speaker
Dependent Development Test Subset [rm1/dep/dev/], and 25 sentence
utterances were specified for each of these speakers. An additional
100 sentence utterances were thus identified in this portion of the
test material. 

A total of 160 sentence utterances by a total of ten speakers were
thus specified for use in the October '87 Benchmark Tests of
Speaker Independent technology. 

[Editorial Note: Results for the CMU SPHINX Speaker-Independent
System were reported in April of 1988, using a 15 speaker (150
sentence utterance) test set derived from the March and October
1987 DARPA Benchmark Test sets.]   


Speaker Dependent Test Material
------- --------- ---- --------
For the March  '87 tests, a set of four speakers was identified,
including 3 males and 1 female [2]. Twenty-five sentence utterances
for each of these speakers were specified [from rm1/dep/dev/] and
used for test purposes. 

In previous descriptions of these tests [10], selection of
particular sentence utterances was [originally] referred to as a
"random process" because no effort was taken to review or select
specific sentence texts. This was not in fact a truly random
selection process, since the selected sentence utterances were
typically the first 25 sentences contained on the data tape for
that individual. The selection of texts per se is believed to be
random, but the selected sentence utterances were consecutively
recorded within a recording session.  This proved to establish an
unfortunate precedent, since there was no effort to randomize
selection from within an individual recording session, and "within
session effects" may have occurred. The most commonly observed
within session effect seems to be relatively carefully read speech
at the beginning of a session, and a tendency for more casually
read speech toward the end of the session.

For the October '87 tests, another set of 25 sentence utterances
from each of the same four speakers was specified [from
rm1/dep/dev/] to be used for test purposes. For each speaker, 10
of these sentence utterances are the same as for the March '87
tests, and were selected as sentence utterances for which at least
one error occurred under at least one of the test conditions in
BBN's March '87 tests. 

The purpose of selecting these particular "difficult" sentence
utterances was to facilitate demonstrations of incremental
progress. (In fact, no analysis to identify such incremental
progress was completed.) It is recognized that by selecting
particularly difficult sentence utterances, the statistics for
performance evaluation for the composite set of 25 sentences for
each speaker would be biased somewhat lower than if all of the
sentences had been randomly chosen, and separate reporting for the
subsets of 10 difficult sentences and 15 "randomly chosen"
sentences would be appropriate.  

An additional set of four (new) speakers was identified from the
Speaker Dependent Development Test Subset [rm1/dep/dev/], and a
set of 25 sentence utterances was specified for each of these
speakers. This set of 100 sentence utterances is the same as that
[also] used in tests of Speaker Independent Technology in the
October '87 tests.

A total of 200 sentence utterances by a total of eight speakers
(40 "difficult" + 60 "randomly chosen" sentences by the four
speakers used in the March '87 tests, plus 100 sentences for the
four new speakers) were thus identified for use in the October '87
Benchmark Tests of Speaker Dependent technology.

June '88 Tests
--------------
For the June '88 tests, it was decided to select one set of test
material drawn from the Speaker Dependent Development Test Set
(designated tddd in the [original] corpus) [rm1/dep/dev/ in the
CD-ROM version] and to use this one set of test material for both
Speaker Independent and Dependent technologies. This was intended
to permit comparisons of relative performance on identical test
material. 

It was thought wise not to use the task domain Speaker Independent
(and Dependent) Evaluation Test sets (tdie and tdde) materials
[some of which is to be found in rm1/ind_eval and rm1/dep_eval] at
this time, but to withhold it for future use. [EDITORIAL NOTE: Some
of this material was released for the February '89 and October '89
Benchmark Tests: still more is presently unreleased and designated
for future tests.]  It was also decided to increase the size of the
designated test materials to include 300 test sentence utterances.

Accordingly, following precedents established in the March and
October '87 tests, blocks of 25 unused test sentence utterances
were selected for each of the 12 speakers for the speaker dependent
material [from rm1/dep/dev]. This resulted in introduction of 4 new
speakers, in addition to the 8 used in previous tests. 

Table 1 lists some information for the 12 speakers in this test
material. 

                                           Year of
Subject    Sex     Region         Race      Birth      Education
-------    ---     ------         ----      -----      ---------
(Speakers also used in previous Benchmark Tests)

BEF0        M     NORTH MIDLAND    WHT       1952          PHD
CMR0        F     NORTHERN         WHT       1951          MS
JWS0        M     SOUTH MIDLAND    WHT       1940          BS
RKM0        M     SOUTHERN         BLK       1956          BS
DTB0        M     NORTH MIDLAND    AMR*      1942          BS
DTD0        F     SOUTHERN*        BLK       1954          BS
PGH0        M     NEW ENGLAND      WHT       1963          BS
TAB0        M     WESTERN          WHT       1960          BS
-----------------------------------------------------------------
(Speakers not previously used for Benchmark Tests)

DAS1**      F     NORTHERN         WHT       1959          MS
DMS0        F     SOUTH MIDLAND    WHT       1954          BS
ERS0        M     WESTERN          WHT       1957          MS
HXS0        F     NEW YORK CITY    WHT       1941          BS 
----------------------------------------------------------------
*   ("AMR" indicates American Indian)
**  ("1" indicates second individual with initials DAS)

TABLE 1 Test Speakers for June '88 Benchmark Tests

Perhaps unfortunately, no effort was taken to randomize selection
of the individual files, and the utterances [selected for use] in
this test set may have been recorded later in the [original]
recording session and may reflect "within session effects".

In future selections of test material, efforts should be taken to
randomize selection of all aspects of the test material (e.g.
selection of texts and within the recording session).


                       LIVE TALKER TESTS 

In the March '87 and October '87 tests, a "live talker" test
protocol was defined and implemented. This test protocol was
described in material made available at the October '87 Meeting
[10]. 

In general, these tests served the purpose of demonstrating the
ability of systems at both BBN and CMU to accommodate direct input
from a microphone and to respond with recognized speech in about
10 times the duration of the utterance (typically 30 seconds or
less). It was also shown that the results for the "live talkers"
were comparable to the speakers in the corpus recorded at TI, but
had higher error rates than for some of the highly experienced
speakers used for demonstrations at some sites. It was noted that
the effort to implement these tests was considerable, since
dedicated facilities for real-time digitization and (in some cases)
audio transmission lines had to be made available, and the live
test speakers had to make site visits.

For the June '88 tests, it was decided that it was not necessary
to implement similar tests.


                         TEST PROCEDURES

Experimental Design
------------ ------
The June '88 Benchmark Tests are intended to be very similar to
those implemented in March '87 and in October '87. The earlier
procedures are described in an earlier paper [3] and the handout
at the October '87 Meeting [10].

Specified utterances for the designated speakers in the Resource
Management Speech Corpus are to be processed with  and without the
use of imposed grammars. The case without the use of an imposed
grammar has been termed the "all word" case or the "full branching"
case. 

For the current test, the use of one specific imposed grammar is
required: this is the word-pair grammar developed at BBN for the
Resource Management Speech Corpus. It has a typical test set
perplexity of about 60. Comparably detailed results are to be
reported for both conditions. No other parameters are to be changed
for these comparative tests.

Use of the material for "rapid adaptation" is optional.

[On Speaker Independent System Training]
----------------------------------------
[At the May 1988 IEEE Arden House Speech Workshop, several DARPA
researchers met to identify a consistent set of training material
to be used for the June 1988 DARPA Benchmark Tests. It was
recognized that 8 of the 12 speakers in the June 1988 test material
were also included in the Speaker Independent Development Test set
(tdid), and that these speakers should not be included in the
system training material. Accordingly, a 72 speaker "standard"
training set was defined, excluding the speakers in the June '88
test set. This "standard 72 speaker" training set appears in the
CD-ROM set in rm1/ind_trn/. Further discussion of the rationale for
choice of these test sets is contained in the material describing
the February 1989 Benchmark Tests.]

Vocabulary/Lexicon/Output Convention
------------------------- ----------
The conventions used for representing the system output, and, for
comparison in scoring, for the reference strings, are described in
a previous paper [3]. TI provided an implementation of the rules
in  Standard Normalized Orthographic Representations (SNORs)
[rm1/doc/al_sents.snr], and these are to be used for scoring
purposes.

In this representation, there are a total of 991 distinct lexical
entries [rm1/doc/lexicon.snr], as derived from the set of 2800
sentences developed at BBN. It has been noted that this lexicon is
not logically complete. But it is all inclusive in the sense that
it covers all entries in the recorded corpus, and thus should be
provisionally sufficient for the purposes of scoring when using
test material derived from the set of 2800 sentences and/or from
the recorded Resource Management Corpus.

Analysis of system responses provided by BBN and CMU for the March
'87 tests disclosed that the different sites used different lexical
conventions for both internal representations and for system
output, complicating scoring. For example, at CMU there were
instances of a lexical entry "CITRUS-1" presumed to represent one
of the alternative pronunciations of "CITRUS". For such a system
response to be scored as correct, there either has to be post-
processing of the responses or special adaptation of the scoring
software. At BBN, the city (place name) San Diego, represented in
the SNOR lexicon as "SAN-DIEGO", was represented as two entries
"SAN" and "DIEGO", giving rise to other scoring complications.

There is "a natural assumption that the units used for scoring
should be as similar as possible to the lexical units used in a
system" [4]. Given differences between systems and the differing
lexical representations in different systems, there is need for a
standard representation for each word. For the Resource Management
task, the SNOR convention and lexicon fill this role, and they are
to be used for the October '87 tests and for the June '88 tests. 
 
However,  it has become evident that no consistency is to be
expected with regard to internal representations. To assist in
understanding what is meant by an "N-Word System" (e.g. the 1000
Word systems presently under study in the DARPA program), it is
proposed that the lexical words used by particular systems should
always be specified (be they words, phrases, sentences, or
combinations thereof).

Mappings or postprocessing between the internal representations
and the system output (used for evaluation by comparison with the
reference strings) should be documented.


                        SCORING PROCEDURE

At the time the March '87 tests were implemented, no general
agreement had been reached concerning the software to be used for
scoring the system output. Scoring software was provided to NBS by
BBN, CMU and TI for comparative usage with preliminary system
output. Subsequently, additional software was provided by Lincoln
Laboratories, and C-language code was written at NBS to implement
what seemed to be the most attractive features of each software
package, as well as including some new capabilities. The intended
purpose of developing this standardized scoring software is to
provide a versatile and consistent set of scoring tools.

Dynamic Programming String Matching Algorithm
------- ----------- ------ -------- ---------
Scoring data is derived from comparisons of reference strings and
system outputs, using a dynamic programming string alignment
algorithm. The C-language string alignment procedure was adapted
from code written by Doug Paul at Lincoln Laboratories (following
discussions with Rich Schwartz and Francis Kubala at BBN). It is
similar to the ERRCOM scoring utility written in Zetalisp at BBN.
Both are "functionally identical dynamic programming algorithms
for computing the lowest cost alignment between two strings
(possibly not unique) given the following constraints on the cost
function used to score the alignments:
            (1)   An exact match incurs no penalty.
            (2)   Deletion and insertion errors incur equal
                  penalties.
            (3)   The sum of one deletion and one insertion error
                  penalty is greater than one substitution error
                  penalty.

For ties (multiple best alignments) an arbitrary choice is made.
This decision cannot affect the alignment score but merely reorders
adjacent substitution and deletion/insertion errors. It is worth
mentioning that the NBS and BBN programs make this choice
differently, therefore alignments may vary, but scores won't."[5]


Subsequent to the October '87 Meeting, Hunt has brought attention
to certain inherent deficiencies in the use of dynamic programming
string alignment for scoring [11, 12]. The imprecision or bias in
dynamic programming string alignment arises because the string
alignment process inevitably finds random sequences that can be
lined up with correct sequences. Specifically, it has been noted
that in the presence of a high rate of insertion errors, the
dynamic programming algorithm will tend to seriously underestimate
the true rate of insertion errors, and in general the error rates
may be seriously underestimated. 

Hunt has implemented an analysis to estimate the biases in using
dynamic programming string alignment for the case of most direct
relevance to the DARPA case in these benchmark tests: for string
lengths (sentences) of 8 words  and with a vocabulary size of 1000
words.

In Hunt's simulation of the DARPA case, and using string alignment
penalties that are identical, two cases of interest to the DARPA
Benchmark Tests were considered: (1) 96% correctly recognized, with
a substitution error rate of 2.5%, an insertion error rate of 2.5%
and deletion error rate of 1.5% (which is taken as representative
of the best results obtained in the October '87 Benchmark Tests),
and (2) 20% correctly recognized, with a substitution error rate
of 78%, an insertion error rate of 75%, and deletion error rate of
1.5%. This second case is taken as representative of results
obtained without use of an imposed grammar and is in some sense a
worst case of the October tests.

For the first case, using 4000 strings, Hunt found that the dynamic
programming based estimate of proportion correct agreed perfectly
with the true value. A DP estimate of insertion errors of 2.46%
corresponded with a true value of 2.54%. A DP deletion error
estimate of 1.58% corresponded  with a true value of 1.66%. These
results confirm the generally valid observation that the
proportions of deletion and insertion errors tend to be
underestimated even though the proportion estimated correct may be
relatively more accurate. However, for high performance systems,
and with short strings and large vocabularies, the bias is less
severe than otherwise.

For the second (worst) case, however there "is a different story.
The difference between DP estimates of the insertion rate and the
deletion rate is determined by the average difference between the
true string lengths and the lengths of the recognizer output
strings. If the recognizer output is mostly incorrect, and if an
insertion plus a deletion costs more than a substitution [as is
the case in the DARPA scoring software], the DP scoring will
interpret as many errors as possible as substitutions. In the case
of output strings that are longer than their correct counterparts,
the estimated deletion error rate will be close to zero.... The
deletion and insertion error rates must be grossly underestimated
[12].

For the case of 400 strings with true string length of 8 and a
vocabulary size of 1000, with a true proportion correct of 19.0%
and a deletion error rate of 26.9% and an insertion error rate of
100%, Hunt obtained DP estimates of 19.4% correct, deletion error
rate of 0.16% and insertion error rate of 73.2%. Not only is the
already high insertion error rate underestimated (73.2% vs. 100%),
but the deletion error rate is underestimated by two orders of
magnitude (0.16% vs. 26.9%). Note, however, that the proportion
correct remains relatively accurate (19.4% vs. 19%).

Hunt suggests that the proportion of words correctly recognized is
a more reliable performance measure than is the total number of
errors as a proportion of the total number of words.   


Error Taxonomy and Statistics
----- -------- --- ----------
The standard error taxonomy resulting from use of the software
includes data on the percentage of words (in the reference string)
that are correctly recognized, the percentage of substitutions,
percentage of deletions, percentage of insertions, and the total
percent error (where this total includes substitutions, deletions
and insertions). In BBN's error taxonomy, they have inferred "word
accuracy" as [100-(percent error)]. Note, however, that word
accuracy is not in general equal to the percent correctly
recognized words because of the (potential) significance of
insertions.

Splits and Merges: Contractions
------ --- ------- ------------
Recent discussions of alternative error taxonomies that have
appeared in the literature [6] include discussion of the occurrence
of errors called "splits" and "merges". In our error taxonomy, a
split is decomposed into consecutive substitution and insertion
errors (or an insertion and a substitution). An example of such an
error would be a contraction such as "ISN'T" being reported as "IS"
and "NOT". Similarly, a "merge" would be decomposed into 
consecutive substitution and deletion errors (or a deletion
followed by a substitution). Correspondingly, the sequence "IS NOT"
might be reported as "ISN'T".  

In these DARPA Benchmark Tests, there is no explicit consideration
of splits or merges, nor is there any special provision for
reporting these errors, although the scoring software does permit
analysis of the occurrence of these errors.

Special Classes of Insertions: Pre- and Post-Shadowing
------- ------- -- ----------- ---- --- --------------
Other references in the recent literature [6,7] cite the occurrence
of "pre-shadowing" or "post-shadowing". Pre-shadowing might occur
when an initial fragment of a poly-syllabic word such as "MAXIMUM"
is reported as "MAX" and then the system (correctly) responds with
the correct answer, in this case "MAXIMUM". The NBS software does
not detect the occurrence of these special cases of insertions, nor
does our proposed error taxonomy include them.

Homophone Errors
--------- ------
A number of scoring software packages include special provision
for scoring substitution errors involving homophones. This may be
particularly appropriate for those cases in which there is no
imposed grammar, and the probability of homophone errors may be
high. The NBS scoring software can refer to a table of homophones
when classifying errors involving substitutions to determine if the
errors involve homophones. 

In previous DARPA Benchmark Tests, the option to score homophone
substitution errors as correct was not implemented. There seems to
be a growing consensus that this is unreasonable for the "no
grammar" case, since the acoustic-phonetic evidence is not
sufficient to disambiguate homophones. For the case of imposed
grammars however, since, in general, there is [or should be]
additional information available to disambiguate homophones [e.g.,
probabilistic, syntactic or semantic information] it it continues
to seem reasonable to score substitution errors involving
homophones as true errors. 

Thus for the current tests, homophone substitution errors are to
be scored as correct for the "no grammar" case, but are not to be
scored as correct for the case of imposed grammars.  [EDITORIAL
NOTE: This represents a change in scoring procedure over that used
in the March and October '87 tests.]

A revised table of acceptable homophones was included in the
scoring software distributed prior to this meeting. The impact of
this change of policy should be to slightly improve indicated
performance in the "no grammar" case.

Synonyms
--------
Analysis of the March '87 results identified some  errors involving
substitutions of synonyms (e.g. "MAX" for "MAXIMUM"). It can be
argued that these errors are semantically acceptable, and the NBS
software can be set to classify substitution errors involving
synonyms by reference to a table of acceptable synonyms. This
option is not ordinarily used, but it is provided as a diagnostic
tool.

Deletions of "THE"
--------- -- -----
"Deletions of the token "THE" are typically a large proportion of
the errors observed in high performance systems and the majority
of these deletions leave the semantic intention of the utterance
intact" [5]. Thus it has been argued that it would be valuable to
compute the proportion of errors of this type, and correspondingly,
to score sentences whose only errors involve deletions only the
word "THE" as "semantically OK" . The current NBS scoring software
can account for this class of error.

Time Criterion for Word Beginnings
---- --------- --- ---- ----------
It has been argued that the most appropriate criterion for scoring
the output of a speech recognition system would include some
measure of the accuracy of a system in  reporting on the identified
word boundaries. One proposal to this effect suggests that the test
material be (manually) marked with word boundary information (word
beginning and end times) and that the scoring algorithm refer to
this information as well as the correctness of the word in
classifying the response as correct. Hunt's procedure for
evaluating the performance of connected-word speech recognition
systems makes use of end-point information, and defines a procedure
for association of boundaries found by the recognizer with the
"actual start and end points of the words in the test data" [11].

In the DARPA Benchmark Tests, it was judged that  the need for
reliance on manually labelled word boundary information is an
unattractive aspect of such a scoring procedure, and that the costs
associated with marking the speech corpus with word boundary
information exceeded the benefits of increased precision and
reduced bias, particularly for high performance systems.


Processing Times
---------- -----
In the Benchmark Tests, the processing times and system
configuration are to be reported.

Sentence Level Scoring
-------- ----- -------
In the NBS scoring software, a sentence is scored as correctly
recognized only if all words have been recognized and there are no
insertion or deletion errors. Supplemental analyses (such as the
fraction of sentences for which the only errors involve deletions
of the word "THE") are permitted, but only to supplement the basic
data.

Characterizing the Imposed Grammar
-------------- --- ------- -------
At present, no general agreement has been reached for a completely
unambiguous procedure for characterizing the imposed grammars in
all systems. A proposal for characterizing the complexity of a
language model in terms of the "test-set perplexity" has been
circulated [6], but it appears that different descriptions of
imposed grammars may be used by different sites for the present
tests. This matter needs to be actively addressed.


"Standard Scoring Procedure"
--------- ------- ----------
The standard scoring software distributed by NBS with the specified
test files is to be used.

The reference strings to be used for scoring are to be the SNOR
representations [rm1/doc/al_sents.snr] developed from the
lexical/output convention described in [3]. 

The lexical convention for system output for the present tests is
thus that for the 991 word SNOR lexicon [rm1/doc/lexicon.snr].

The option to score homophone substitution errors as correct IS TO
BE USED for the case of "no grammar", but IS NOT TO BE USED in the
case of imposed grammars.

The option to score errors involving synonyms as correct IS NOT to
be used.

The option to analyze splits and merges and to identify errors
involving contractions IS NOT to be used.

At least TWO system configurations ARE to be run: one with no
imposed grammar, and a second with the word-pair grammar
[rm1/wp_gram.txt] developed at BBN from the Resource Management
material. The test set perplexity of this grammar has been
estimated at approximately 60, for a sufficiently large and random
sampling of material. Other grammars, with different perplexities,
are acceptable, but the perplexity should be stated for each.

Summary statistics on the percentages of words (based on the number
of words in the reference strings) correctly recognized,
substitutions, deletions, and insertions ARE TO BE reported. The
total percent error is to include substitutions, deletions, and
insertions. These statistics are to be reported for each speaker
under each test condition as well as for the larger test subsets
to which the test speakers belong. As mentioned previously, data
are to be reported for tests conducted without use of an imposed
grammar as well as with imposed grammar(s).

System output data is to be made available to NBS for additional
analysis.

                        ACKNOWLEDGEMENTS

The development of the NBS scoring software described in this paper
involved contributions from a number of individuals at several
organizations. The scoring software was developed from
contributions from Doug Paul at Lincoln Labs, Francis Kubala at
BBN, George Doddington and Bill Fisher at TI, and Rich Stern at
CMU. Alex Rudnicky at CMU contributed significantly in clarifying
the need for consistent lexical representations. Bill Fisher and
Patti Price (at BBN) collaborated with us in defining the SNOR
lexicon and files for the reference sentence strings to be used in
scoring. Each of the proposed scoring packages had demonstrable
merits: we sought to combine the best features of each.

Discussions with Melvyn Hunt at the NRC in Ottawa have been very
helpful in identifying limitations on the use of DP-based scoring
algorithms, particularly for low-performance systems. 

Selection of the test material used in the June '88 tests was done
at NBS from material previously defined and recorded with the
collaboration of Patti Price at BBN, Jared Bernstein at SRI, and
Bill Fisher at TI. They bear no responsibility, however for
imperfections in the selection of this round of test material (e.g.
unanticipated within-session effects): that responsibility is the
author's.

At NBS, credit for coding the original scoring software goes to
Stan Janet. Revisions and improvements were implemented by Mike
Garris. John Garofolo, Stan and Mike deserve significant credit
for producing copies of the speech corpus tapes and for
distributing the scoring software and Benchmark Test material. 
  

                           References

[1]  W.M. Fisher, "The DARPA Task Domain Speech Recognition
Database", Proceedings of the March 1987 DARPA Speech Recognition
Workshop.

[2]  D.S. Pallett, "Selected Test Material for the March 1987 DARPA
Benchmark Tests", Proceedings of the March 1987 DARPA Speech
Recognition Workshop.

[3]  D.S. Pallett, "Test Procedures for the March 1987 DARPA
Benchmark Tests", Proceedings of the March 1987 DARPA Speech
Recognition Workshop.

[4]  Private Communication with Alex Rudnicky, September 21, 1987.

[5]  Private Communication with Francis Kubala, et al., August 13,
1987.

[6]  R.D. Rodman, M.G. Joost and T.S. Moody, "Performance
Evaluation of Connected Speech Recognition Systems", Proceedings
of Speech Tech '87, New York, NY, April 28-30, 1987, pp. 269-274.

[7]  F. Dreizin, R. Kittredge and D. Korelsky, "Semantic Support
in Speech Recognition:  An Application to Fire Control Dialogues",
Proceedings of AVIOS '86 Voice I/O Systems Applications Conference,
Alexandria, VA, September 16-18, 1986, pp. 339-354.

[8]  S. Roucos, "Measuring Perplexity of Language Models Used in
Speech Recognizers", unpublished manuscript circulated within DARPA
research community, September 1987.

[9]  Private Communication with Doug Paul, April 1988. 

[10]  D.S. Pallett, "Selected Test Material, Test and Scoring
Procedures for the October 1987 DARPA Benchmark Tests", unpublished
manuscript distributed at the October 1987 DARPA Meeting.

[11]  M.J. Hunt, "Evaluating the Performance of Connected-Word
Speech Recognition Systems", Proceedings IEEE Int. Conf. Acoustics,
Speech & Sig. Proc., ICASSP-88, New York, April 1988.

[12] Private Communication with Melvyn Hunt, March 22, 1988. 
