Editorial note: the following consists of a revised version of
material prepared to document DARPA benchmark speech recognition
tests. 

The two papers describing the first of these tests, in March 1987,
are more formal than the others and were originally prepared for
inclusion in the proceedings of the March 1987 DARPA speech
recognition workshop.  More informal notes were prepared for
distribution within the DARPA speech research community to provide
information on the tests conducted prior to the October 1987 and
June 1988 meetings.  The June 1988 note consists of an adaptation
of the October 1987 note.  Still more informal notes were prepared
to outline the test procedures for the February 1989 and October
1989 benchmark tests.

Editorial revisions primarily include some changes of tense,
inclusion of references to directories in this cd-rom in which the
material might be found, and substitution of the term "corpus" or
"corpora" for "database" or "databases".  Other editorial changes
include insertion of comments as appropriate for clarification. 
In most cases, these editorial revisions are included within square
brackets.  
****************************************

                         TEST PROCEDURES

                             FOR THE

                MARCH 1987 DARPA BENCHMARK TESTS

                        David S. Pallett

         Institute for Computer Sciences and Technology
                  National Bureau of Standards
                     Gaithersburg, MD  20899


                            ABSTRACT

This paper describes test procedures that were to be used in
conducting benchmark performance tests prior to the March 1987
DARPA meeting.  These tests were to be conducted using selected
speech database material and input from "live talkers", as
described in a companion paper.


                          INTRODUCTION

At the Fall 1986 DARPA Speech Recognition meeting, plans were
discussed for implementing benchmark tests using the Task domain
Speech Corpus.  There was additional discussion of the desirability
of developing and implementing "live tests" using speech material
provided by speakers at the contractors' facilities, emulating in
some sense the process of inputting speech material during a
demonstration of real-time performance.  Following the Fall
Meeting, the Task Domain Speech Database [Resource Management
Corpus] was recorded at TI and significant portions of it were made
available for system development and training purposes through NBS
to both CMU and BBN.  Another portion was selected for use in
implementing these benchmark tests [1], and this test material was
distributed to CMU and BBN during the last week of February, 1987. 
This paper outlines test procedures to be used to implement these
tests prior to the March 1987 Meeting.

A number of informal documents have circulated within the DARPA
Speech Recognition community that outline proposed test procedures. 
A Strategic Computing draft document dated Dec. 6, 1985 [2]
identified key issues in some detail.  Portions of this document
were heavily annotated and distributed to several sites during June
1986 and were the subject of discussions involving the author and
representatives of CMU, BBN, Dragon Systems, MIT and TI during
visits during June and early July 1986.  These discussions were
valuable in developing an outline of benchmark test procedures [3]
that was discussed at the Fall 1986 DARPA Meeting, and which was
structured after a model for performance assessment tests outlined
in an earlier NBS publication [4].  Thus the present proposed test
procedure represents the most recent and specifically focused in
a series of documents outlining test procedures for the DARPA
Speech Recognition Program.


                       EXPERIMENTAL DESIGN

There were to be two distinct types of tests conducted prior to the
March 1987 DARPA meeting:

     (1)  Tests based on use of a subset of the Task Domain
(Resource Management) Development Test Set Speech Database.  This
subset was to include use of 100 sentence utterances in either the
Speaker Independent or Speaker Dependent portions of the database. 
The process of selecting speakers and specific utterances is
described in Reference [1].  In each case, there was considerable
freedom to choose system-dependent factors such as the amount of
training material for Speaker Dependent technology and the most
appropriate grammar.  All of the 100 specified test sentences were
to be processed and reported on at the meeting.  "Spell-mode"
material (spelled-out representations of the letter strings for
items in the lexicon) was available for use, but processing this
material was not required.

These sentence utterances were to be processed both with and
without the use of imposed grammars.  In the case of using no
grammar, the perplexity is essentially to be nominally 1000. 
Comparable detailed results are to be reported for both conditions. 
No other parameters are to be changed for these comparitive tests.

Optionally, the same data may be processed using the "rapid
adaptation" sentences for system adaptation.  There is to be no use
of adaptation during processing of the test material.

     (2)  Tests based on input provided from "live talkers".  The
test talkers visited both CMU and BBN prior to the March meeting. 
Each of the talkers spoke the "rapid adaptation" sentences and read
a script containing 30 sentences drawn from the task domain
sentence corpus.  Data derived from the input from "live talkers"
was to be analyzed and reported on at the March meeting.


LIVE TEST PROTOCOL

The microphone was to be the same as that used at TI for the
Resource Management Corpus, the Sennheiser HMD 414-6.  This is a
headset-mounted noise cancelling microphone similar to the Shure
SM-10 family of microphones.  The headset is a supra-aural headset
that allows the subject to be aware of nearby conversation or
instructions for prompting.  The test environment was to be a
conference room or computer lab.  There was to be no background
speech at the time the test material is provided.  Test utterances
could be rejected (and the subject asked to repeat the sentence)
if in the judgment of the person(s) administering the tests there
was some noise artifact (e.g. coughs or paper-shuffling noises) or
severe mis-articulation of the test sentence.  Evidence of this
could be obtained by play-back of the digitized utterance.

For systems that require time to develop speaker-adaptive models,
the subjects were to provide the 10 "rapid adaptation" sentences
prior to the tests (e.g. the evening prior to the tests).

For one of the speakers, the 30 test sentences were to be read in
and processing (automatic recognition) could take place "off-line". 
For the other two speakers, the test sentences were to be read in,
one at a time, waiting for the system to recognize each sentence
before proceeding to the next sentence.  At the end of 30 minutes,
if all 30 sentences had not been read in and recognized, the
remaining sentences were to be read in for "off-line" processing. 
In practice, only three to five sentences were recognized
interactively within the 30 minute period, and the remaining
sentences were then read in.  The elapsed time for each speaker
providing the test material in this manner was typically 45
minutes.  If requested, each speaker was to read in 10 words
randomly chosen from the "spellmode" vocabulary subset.


PROCESSING OF LIVE INPUT

The systems were to process the test material in a manner similar
to that used for the Resource Management database test material. 
Statistics comparable to those for the 100 sentence subsets were
to be prepared and reported on at the March meeting.


ADAPTATION

Although the use of the "rapid adaptation" sentences was to be
permitted, it appears that the only use made of the rapid
adaptation sentences was in adapting the Speaker Dependent system
at BBN for the "live test" speakers.

There was to be no use of any of the test material to enroll, adapt
or to optimize system performance for the test material through
repeated analyses and re-use of the test material.  Intended
allowable exceptions to this prohibition against re-use of the test
material include demonstrating the effects of using different
grammars, different strategies for enrollment, different algorithms
for auditory modelling, acoustic-phonetic feature extraction,
different HMM techniques, system architectures, etc.  It is
recognized that the breadth of these exceptions in effect limit the
future use of this test material, since such extensive use of test
material to demonstrate parametric effects constitutes training on
test material.

Since a finite set of task domain sentences was developed at BBN,
and the entire corpus of task domain [Resource Management]
sentences was made available to both CMU and BBN, in some cases the
grammars used for these tests have been adapted to this finite set
of sentences, including the test material.


VOCABULARY/LEXICON/OUTPUT CONVENTIONS

The task domain sentences in effect define the vocabulary. 
Internal representations (lexicon entries) may be at the system
designer's choice, but for the purposes of implementing uniform
scoring procedures, a convention was defined, drawing on material
provided by CMU [5], BBN and TI.  This convention includes the
following considerations:

Case differences are not preserved.  All input (reference) strings
and output strings are in upper case.

There is no end-of-sentence punctuation.  Nor is there any required
special symbol to denote silences (either pre-pended, within the
sentence utterance, or appended) or to indicate failure of a system
to parse the reference string or input speech.

Apostrophes are represented by plusses.  Words with apostrophes
(embedded or appended) are represented as single words.  Thus
"it's" becomes "IT+S".

Abbreviations become single words.  All periods indicating
abbreviations are removed and the word is closed up (e.g.
"U. S. A." becomes "USA").

Hyphenated items count as single words.  In general, compound words
that do not normally appear as separate words in the context of the
assumed task domain model are entered as single, hyphenated items. 
The exception to this rule are compounds that include a geographic
term, such as STRAIT, SEA or GULF.  Thus entries such as the
following count as single "words":  HONG-KONG, SAN-DIEGO, ICE-NINE,
PAC-ALERT, LAT-LON, PUGET-1, M-RATING, C-CODE, SQQ-23, etc. 
However, BERING STRAIT is to count as two words since this compound
includes the geographic term "STRAIT", and it is not to be
hyphenated.

Acronyms count as single words, and the output representation is
not the form of the acronym made easier to interpret or pronounce
(e.g. "PACFLT", not PAC-FLEET or PAC FLEET).

Mixed strings of alpha-numerics are treated as acronyms.  Thus,
"A42128" is treated as a one-word acronym, even though the prompt
form of this indicates that this is to be pronounced as
"A-4-2-1-2-8".  Strings of the alpha set are also treated as
acronyms (e.g. "USA").  Strings of digits are entered in a manner
that takes into account the context in which they appear.  Thus for
a date such as 1987, it is represented as three words:  "NINETEEN"
"EIGHTY" and "SEVEN".  If it is referred to as a cardinal number
it would be represented as "ONE" "THOUSAND" "NINE" "HUNDRED"
"EIGHTY" "SEVEN".


SCORING THE TEST MATERIAL

For results to be reported at the March 1987 meeting, the use of
different scoring software [at each contractor's site] was
acceptable.  Each contractor was free to use software consistent
with the following general requirements:  
 
Data are to be reported at two levels:  sentence level and word
level.  

At the sentence level, a sentence is to be reported as correctly
recognized only if all words are correctly recognized and there are
no deletion or insertion errors (Other than insertions of a word
or symbol for silence or a pause).  The percent of sentences
correctly recognized is to be reported, along with the percent of
sentences that contain (at least one) insertion error(s), the
percent of sentences that contain (at least one) deletion error(s)
and the percent of sentences that contain (at least one)
substitution error(s).  The number to be used for the denominator
in computing these percentages is the number of input sentences in
the relevant test subset, without allowing for rejection of
sentences or utterances that may not parse or for which poor scores
result.

At the word level, data that were to be reported included the
percent of words in the reference string that have been correctly
recognized.  For these tests, "correct recognition" does not
require that any criterion be satisfied with regard to word
beginning or ending times.  It was valuable, but not required, to
report the percent of insertion, deletion, and substitution errors
occurring in the system output.

For those systems that provide sentence or word lattice output,
scoring was to be based on the top-ranked sentence hypothesis. 
Additional scoring based on lower-ranked alternative hypotheses was
acceptable, provided the data were compared with comparable data
for the top-ranked hypothesis.

System response timing statistics was to be reported.

Data resulting from these tests was provided to NBS following the
March [1987] meeting for detailed analysis and in evaluating
alternative scoring software.


DOCUMENTATION

Documentation on the characteristics of the imposed grammar(s) must
be provided.  This information should describe any use of the
material from which the test material was drawn (i.e. the set of
2200 task domain sentences developed at BBN and used by TI in
recording the Resource Management Speech Database).

The system architecture and hardware configuration used for these
tests should be documented.


REFERENCES:

[1]  D.S. Pallett, "Selected Test Material for the March 1987 DARPA
Benchmark Tests", Proceedings of the March 1987 DARPA Speech
Recognition Workshop.

[2]  (Anonymous) "Integration, Transition and Performance
Evaluation of Generic Artificial Intelligence Technology",
Strategic Computing Program draft document dated Dec. 6, 1985 (For
Official Use Only).

[3]  D.S. Pallett, "Benchmark Test Procedures for Continuous Speech
Recognition Systems", draft document dated August 29, 1986
distributed prior to the Fall 1986 DARPA meeting.

[4]  D.S. Pallett, "Performance Assessment of Automatic Speech
Recognizers", Journal of Research of the National Bureau of
Standards, Volume 90, Number 5, September-October 1985, pp. 371-
387.

[5]  A.I. Rudnicky, "Rules for Creating Lexicon Entries", note
dated 11 February, 1987.


****************************************************************************

                     SELECTED TEST MATERIAL

                             FOR THE

                MARCH 1987 DARPA BENCHMARK TESTS

                        David S. Pallett

         Institute for Computer Sciences and Technology
                  National Bureau of Standards
                     Gaithersburg, MD  20899


ABSTRACT

This paper describes considerations in selecting test material for
the March '87 DARPA Benchmark Tests.  Using a subset of material
available from the Task domain (Resource Management) Development
Test Set, two sets of 100 sentence utterances were identified.  For
Speaker Independent technology, 10 speakers each provide 10 test
sentences.  For Speaker Dependent technology, 4 speakers each
provide 25 test sentences.  For "live talker" test purposes, three
30-sentence scripts were identified, using a total of 70 unique
sentence texts.  The texts of all of these test sentences were
drawn from a set of 2200 sentences developed by BBN in modelling
the (resource management) task domain. 


INTRODUCTION

In order to implement benchmark tests of speech recognition systems
to be reported at the March '87 DARPA Speech Recognition Meeting,
it was necessary to specify selected test material.  This test
material is drawn from two sources:  (a) the Task Domain speech
corpus recorded at Texas Instruments (also referred to as the
"Resource Management Corpus"), and  (b) the use of "live talkers"
in site visits.  In each case, the texts of the sentences were
drawn from a set of sentences developed by BBN.  Selection of test
material using the Resource Management Corpus includes two separate
components, a Speaker Independent component and a Speaker Dependent
component.  This paper outlines the process of defining these
subsets of speech material.

At the time the Resource Management Speech Corpus was designed, it
was intended that approximately equal volumes of material would be
available for system development (research) purposes and for two
rounds of benchmark tests.  Consequently, approximately half of the
available material is designated as "training" material, and the
remaining portion is designated for test purposes.  The test
material is designated as "Development Test" or "Evaluation Test"
sets, each including 1200 test sentence utterances in each portion
(Speaker Independent or Speaker Dependent).

The design and collection of this Task Domain (Resource Management
Speech Corpus is described elsewhere in a paper by Fisher [1].

Thus, as originally intended, two sets of 1200 sentence utterances
[i.e., one set of 1200 sentence utterances for Speaker Dependent
technology, and another set of 1200 sentence utterances for Speaker
Independent technology] were to be available for the March '87
tests.  [However,] during January 1987, discussions involving
representatives of CMU, BBN, MIT, NBS and the DARPA Program Manager
determined that use of this large a volume of test material was not
necessary to establish performance of current technology when
pragmatic considerations of processing times and expected
performance levels were made.  Consequently, it was agreed that
subsets of 100 sentence utterances were to be defined for these
tests, and that NBS would specify the appropriate subset.

To complement the use of the recorded speech database material, a
test protocol for the use of "live talkers" emulating in some sense
procedures to be used in future demonstrations of these systems was
defined, and texts were selected for this purpose.


RESOURCE MANAGEMENT SPEECH DATABASE TEST MATERIAL

Speaker Independent Test Material
------- ----------- ---- --------
For the March '87 tests, a set of ten speakers was identified,
drawn from material recorded at TI and made available to NBS in
December '86 and January '87.  Each speaker provided two "dialect"
sentences [i.e., sentence utterance files ending in sa01.sph or
sa02.sph] and the ten "rapid adaptation" sentences [i.e., sentence
utterance files sb01.sph through sb10.sph] in addition to a total
of thirty test sentence utterances.  For each speaker, a unique
subset of ten sentence utterances were specified to be used for the
March '87 tests, amounting to 100 sentence utterances in all (10
speakers times 10 sentence utterances per speaker).

Seven male and three female test speakers were selected, reflecting
the male/female balance throughout the Resource Management Speech
Database.

To aid in the selection of individual speakers, a preliminary set
of approximately 16 speakers was identified.  SRI was asked for
advice on whether any of these would be regarded as anomalous on
the basis of the "dialect" sentences obtained in the acoustic-
phonetic database.  SRI performed a clustering analysis and advised
us that most of the speakers clustered in three groups of similar
speakers with three other individuals categorized as exceptional
in some sense (e.g. unusually slow rate of speech)  [2].  The ten
speakers identified for inclusion in the test subset include one
of these "exceptional" speakers, the others being drawn from the
three clusters to provide some degree of coverage of regional
effects.

Table 1 provides detailed information on the individual speakers'
regional backgrounds, race, year of birth and educational level
for the ten selected speakers in the March '87 Test Subset.

Subject   Sex        Region       Race   Year of Birth  Education
-------   ---        ------       ----   -------------  ---------
  DAB     MALE    NEW ENGLAND     WHT         '62          B.S.

  GWT     MALE    NORTHERN        WHT         '21          B.S.

  DLG     MALE    NORTH MIDLAND   WHT         '42          (?)

  CTT     MALE    SOUTHERN        WHT         '62          B.S.  

  JFC     MALE    NEW YORK CITY   WHT         '59          B.S.

  BTH     MALE    WESTERN         WHT         '62          B.S.

  AWF     FEMALE  SOUTHERN        WHT         '58          B.S.

  BCG     FEMALE  "ARMY BRAT"     (?)         '59          B.S.

  SAH     FEMALE  NEW ENGLAND     WHT         '46          B.S.

  JFR     MALE    WESTERN         WHT         '39          M.S.

            Table 1.  Speaker Independent Test Subset


Analysis, by TI, of the lexical coverage provided by this subset
of the test material indicates that 348 words occur at least once
in this test material, and the total number of words is 836, for
a mean length of each sentence of 8.36 words.

Speaker Dependent Test Material
------- --------- ---- --------
For these tests, a set of four speakers was identified, also drawn
from material recorded at TI and made available to NBS during
December '86 and January '87.  In this case, selection of the
specific individuals was strongly influenced by the availability
of training material.  BBN expressed concern that the entire set
of 600 sentence utterances intended for system training should be
available for any test speakers.  At the time of selection of test
material, not all of the 12 speakers for this portion of the
database had completed recording their training material.  Four
speakers were identified with this constraint in mind.

Each speaker had previously recorded the ten "rapid adaptation"
and "dialect" sentences, and the Development Test material included
100 sentence utterances for each speaker.  From this, unique sets
of 25 sentence utterances were identified for each of the four
speakers, amounting to 100 sentence utterances in all for this
portion of the test material.

Three of the speakers were male and one was female.

Table 2 provides additional data on these speakers.

Subject   Sex        Region       Race   Year of Birth  Education
-------   ---        ------       ----   -------------  ---------
  CMR:    FEMALE  NORTHERN        WHT         '51          M.S.

  BEF:    MALE    NORTH MIDLAND   WHT         '52          Ph.D.

  JWS:    MALE    SOUTH MIDLAND   WHT         '40          B.S.

  RKM:    MALE    SOUTHERN        BLK         '56          B.S.

             Table 2.  Speaker Dependent Test Subset

Analysis, by TI, of the lexical coverage provided by this subset
of the test material indicates that 832 words occur at least once,
with a total number of words of 832, for a mean sentence length of
8.32 words.  This is quite similar to that for the Speaker
Independent material, although the details of the distributions
differ slightly.


LIVE TALKER TEST MATERIAL

For the "live Tests", it was necessary to select sentence texts
that would be read by the test speakers.  It was thought desirable
to use three speakers, each speaker reading a total of 30 sentence
texts in addition to the 10 "rapid adaptation" sentences.  Ten of
the thirty sentence texts were to be the same for all speakers, so
that of the 90 sentence utterances to be used for testing, there
would be three productions of each of the ten sentences, and 60
other sentences (20 for each of three speakers).  A total of 70
unique sentence texts was thus required.

The sentence texts were selected from a subset of 2200 Resource
Management sentences.  CMU representatives had indicated a
preference for sentence texts that could be read in less than 6
seconds.  Accordingly, the essentially random process of sentence
text selection was perturbed slightly to exclude longer sentences.

Lexical analysis, by TI, of the scripts developed from these
sentences indicates that the three scripts are well-balanced in
terms of mean sentence length and number of lexical entries.  Each
of the three scripts has a mean sentence length of 7.93 words (258
words/30 sentences), reflecting the intentional bias in sentence
selection process toward slightly shorter sentences.  The number
of lexical entries in the three scripts is 153, 155 and 161.

The prompt form of each of these scripts was made available to the
"live talkers" in site visits conducted in March '87.  Each of the
test speakers was to use the Sennheiser HMD 414-6, the same
microphone used at TI for the Resource Management Speech Database,
and the test environment was to be a computer lab or conference
room with no competing conversation.  A portion of the test
material was to be provided in an interactive manner (i.e. while
waiting for system processing of the data) and the remainder was
to be processed off line.

GRAMMATICAL COVERAGE

At the time that BBN developed the set of approximately 2800
sentence texts [e.g.,in rm1/doc/al_sents.txt] modelling this task
domain, no explicit or formally defined grammar was used.  Rather,
a set of prototypical sentences was identified to provide coverage
of the task, and the subset of vocabulary occurring in these
sentence "patterns" was then expanded to approximately 1000 words. 
There were a total of approximately 950 sentence patterns [3].  By
incorporation of the expanded vocabulary, the 2800 sentences were
generated by including approximately three exemplars of each
pattern.  From these, 600 were designated to be used for speaker-
dependent training material, leaving a remaining subset of 2200
sentences.  All of the test material was randomly selected from
this subset of 2200 sentences.

No analysis to determine the representation of the basic sentence
patterns in the test material has been conducted to date.

REFERENCES

[1]  W.A. Fisher, "A Task Domain Database", Proceedings of the
March 1987 DARPA Speech Recognition Workshop.

[2]  J. Bernstein, private communication, January 1987.

[3]  P. Price et al., oral presentation at the September 1986 DARPA
Speech Recognition Workshop.