Editorial note: the following consists of a revised version of
material prepared to document DARPA benchmark speech recognition
tests. 

The two papers describing the first of these tests, in March 1987,
are more formal than the others and were originally prepared for
inclusion in the proceedings of the March 1987 DARPA speech
recognition workshop.  More informal notes were prepared for
distribution within the DARPA speech research community to provide
information on the tests conducted prior to the October 1987 and
June 1988 meetings.  The June 1988 note consists of an adaptation
of the October 1987 note.  Still more informal notes were prepared
to outline the test procedures for the February 1989 and October
1989 benchmark tests.

Editorial revisions primarily include some changes of tense,
inclusion of references to directories in this cd-rom in which the
material might be found, and substitution of the term "corpus" or
"corpora" for "database" or "databases".  Other editorial changes
include insertion of comments as appropriate for clarification. 
In most cases, these editorial revisions are included within square
brackets. 
****************************************

[MEMO TO EXPECTED PARTICIPANTS IN]

October 1989 DARPA
Resource Management Benchmark Speech Recognition Tests

[From: David S. Pallett
NIST
August 1989]
**********

What is the same as for previous tests?  

Most of the details are the same. (See [FEBRUARY 1989 DARPA SPEECH
RECOGNITION RESOURCE MANAGEMENT BENCHMARK TESTS  ***AMENDED JANUARY
23, 1989***]).  

But...

What is different this time? 

(1) New action dates (of course):
    Distribution of the test material: Week of August 28
    Required date for receipt of data at NIST: October 2, 1989.

(2) Protocols for testing/reporting results on Speaker Adaptive
Systems (see below).

(3) Change of filenaming convention for test material (see below).

(4) Changed format for reporting results to NIST for summary prior
to the October Meeting (see below).

(5) Optional use of new "statistical classs grammar" (see below).

(6) New Test Material (see below)
**********

[1] Action Dates:

NIST has been asked by Charles Wayne to provide a tabulation of
results prior to the meeting, as we did for the February meeting.
At that time, some sites did not sent the data to us on time, nor
in a convenient "standard" format for us to efficiently process
through the scoring software. This time, we have asked for two
weeks to process the data and prepare a summary, using some of the
statistical tools we're developing (i.e., implementations of
McNemar's test and the matched pairs test that Larry Gillick and
Steve Cox advocate). Late submissions of data (at most a day or
two late) are ONLY acceptable if the format convention described
below is scrupulously adhered to.

 
[2] Protocols for Testing Speaker Adaptive/Short Training Systems:

Until recently, CMU (Kai-Fu Lee) had indicated that they (he)
planned to report on performance of their speaker independent
system using the 10 "rapid adaptation" sentences provided for each
speaker. At present [August 1989] this appears unlikely.  If other
sites intend to report on "rapid adaptation" on this test set,
please contact me (DSP) at NIST to describe the test protocol that
you propose, and we can discuss this issue. 

BBN has advised that they intend to report results on their speaker
dependent system using a set of 40 sentences for "short training".
There is to be comparisons of performance for systems fully trained
using the set of 600 (570 ?) training sentences for each speaker
with systems trained only on the set of 40 sentences.  If others
intend to report on "short training", please contact me.


[3] Change of Test Filenaming Convention:

Some criticism of our benchmark tests has involved the fact that
the identity of the filenames (and hence the reference
transcriptions) and the speakers (for speaker independent systems)
was known. One suggestion was to go to "blind filenames", revealing
the text of the reference string and the identity of the speakers
(for speaker independent systems) only after the tests had been run
and the output submitted to NIST for "official scoring". Although
we seriously entertained this idea, we've decided not to implement
the suggestion at this time with this set of resource management
data. 

We have, however, chosen to modify the filenaming convention for
this round of test material. The intent of the revision is to
facilitate generation of test results in a standard format and
processing of those results by NIST.  The filename will contain
the minimum amount of information required to uniquely identify
the utterance and, as such, can be used as an unambiguous utterance
identifier in the test result files.

An original filename such as "tdde-pgh01-st0123-b.adc" would now
be represented as "pgh0-st0123.adc".


[4] Changed Format for Reporting Results to NIST

(Note that these instruction override those in the January-February
1989 instructions.)

The format for Fall '89 test results submission is slightly
different from the previous test format and should be easier to
generate and score.  The new format provides unambiguous utterance
identification by adding a speaker identifier to the sentence
identification field at the end of the hypothesis string (see below
for an example).  The new utterance identifier is intentionally
identical to the utterance filename (excluding the ".adc"
extension) and should be extracted from the filename when producing
hypothesis strings.  

The new format allows all of the speaker hypothesis strings to be
stored in one submission file and easily split into separate test
files and then further parsed into "speaker" files for scoring. 
For your convenience, we have included "csh" and "sh" shell
programs below [Appendix A] which will create multiple old-format
speaker ".hyp" files from one input file containing hypothesis
strings with the new utterance identifier.  (Note: the input file
must contain ONLY hypothesis strings from ONE test)  A future
addition of the NIST scoring software will handle speaker parsing
automatically.

Format for submission of Fall '89 recognition test results:

 #<Freeform descriptions/comments>
 #
 #
 #
 #-----
 #1. [Test title 1][CR]               
 #2. [Test title 2][CR]
 #.
 #.
 #.
 #n. [Test title n][CR]
 #-----
 #1. [Test title 1][CR]               
 [hypothesis string] ([speaker id]-[sentence id])[CR]
 [hypothesis string] ([speaker id]-[sentence id])[CR]
 .
 .
 .
 [hypothesis string] ([speaker id]-[sentence id])[CR]
 #2. [Test title 2][CR]
 .
 .
 .
 etc.

              ***** Excerpt of an Example Submission *****

#The following data contains the results of two recognition tests. 
#In the speaker-independent-no-grammar test please note that #
.....
#-----
#1. Speaker Independent/No Grammar
#2. Speaker Independent/Word-Pair Grammar
#-----
#1. Speaker Independent/No Grammar
IS JASON+S MAXIMUM SUSTAINED SPEED SLOWER THAN JUPITER+S (JDM2-
ST0009) TURN AREAS OFF AND REDRAW CURRENT AREA (JDM2-ST0029)
IS TRIPOLI IN THE HOOKED PORT (JDM2-ST0064)
.
.
.
HAS SWORDFISH REPORTED ANY TRAINING PROBLEMS (JDM2-ST0091) WHAT
SPEED IS EISENHOWER GOING (CMH1-ST0160)
HOW SOON CAN ESTEEM CHOP TO ATLANTIC FLEET (CMH1-ST0186)
.
.
.
DEFINE AREA ALERTS FOR GULF OF CALIFORNIA (CMH1-ST0203)
.
.
#2. Speaker Independent/Word-Pair Grammar
.
.
.
etc.

[5] Optional Use of New Grammar

BBN has developed an alternative to the use of the "word-pair"
grammar, a statistical class grammar. As you are undoubtedly aware,
the grammar has been thoroughly documented and made available over
the net. I think they are to be commended for developing and
offering an alternative, one that may offer a number of advantages.
Alan Derr has outlined the motivation for use of this grammar in
a net note that some might not have received. We are advised that
BBN will make use of this grammar (as a complement to the present
"standard" use of the no-grammar and word-pair conditions) in the
tests to be reported at this meeting. Thus use of it is, for the
time being, optional. 

[6] Planned Test Material: 

The test material to be distributed at the end of August consists
of both new and previously used test material. 

The new material consists of the following: 

(a) one set of 300 sentence utterances (25 sentence
utterances/speaker X 12 speakers) chosen from the Speaker Dependent
Evaluation subset, and

(b) one set of 300 sentence utterances (30 sentence
utterances/speaker X 10 speakers) chosen from the Speaker
Independent Evaluation Subset.

This is comparable to what was distributed for the February 1989
tests. This test material will be new (previously unreleased).


The previously used test material has been drawn from the test set
used in February. It consists of the following:

(a) one set of 150 sentence utterances (25 sentence
utterances/speaker X 6 speakers) chosen from the February '89
Speaker Dependent Evaluation subset, and

(b) one set of 150 sentence utterances (30 sentence
utterances/speaker X 5 speakers) chosen from the February '89
Speaker Independent Evaluation subset.

We have designated this a "Test/Retest Set", and are interested in
using the results, in conjunction with the statistical tools, to
see if we can identify significant progress since the last test. 
We recognize that the sample size may be too small, but hope that
the increment of progress may be shown to be significant in the no
grammar case. Because it involves extra processing time, we request
only that each site use it for one pass through what you believe
to be your "best performing" system, in the "no grammar" condition. 
Since this amounts to processing 150 utterances once, it should not
present a significant burden.

Note that in the February 1989 memo, in the section "USE OF THIS
TEST MATERIAL", paragraph 6, it states that [the February 1989
material]  "is to be used once and only once for each system
configuration".   We recognize that there may have been some reuse
of the test material, nonetheless, and we ask that you specify the
use you may have made of it since February so that we can be alert
the likelihood of systems having been trained on the test material. 


Appendix A

#
# make_spkr_hyp_files.csh
#
# Converts sentences from a file using (<spkr>-<sentence>) format
into # individual <spkr>.hyp files with the old (<sentence>)
format. # Make sure that no .hyp files exist in the current
directory prior to # running this script or they may be appended.
#
if ($1 == "") then
  echo "Usage: make_spkr_hyp_files.csh <input file>"
  exit
endif
set count = 1
set total_lines = `cat $1 |wc -l`
echo $total_lines
while ($count <= $total_lines)
  set line=`tail +$count $1 | head -1`
  set spkrfile=`echo $line | awk -F'(' '{print substr($2,1,4)
".hyp"}' -`   set newline=`echo $line | awk -F'(' '{print $1 "("
substr($2,6,25)}' -`   echo $newline >> $spkrfile
  echo $newline appended to $spkrfile
  @ count++
end

-----------------------------------------------------------------

#! /bin/sh
# make_spkr_hyp_files.sh
#
# Converts sentences from a file using (<spkr>-<sentence>) format
into # individual <spkr>.hyp files with the old (<sentence>)
format. # Make sure that no .hyp files exist in the current
directory prior to  # running this script or they may be appended.
#
if test $# -ne 1
  then
  echo "Usage: make_spkr_hyp_files.sh <input file>" > /dev/tty  
exit 0
fi
count=1
total_lines=`cat $1 |wc -l`

while test $count -le $total_lines 
  do
  line=`tail +$count $1 | head -1`
  spkrfile=`echo $line | awk -F'(' '{print substr($2,1,4) ".hyp"}'
-`   newline=`echo $line | awk -F'(' '{print $1 "("
substr($2,6,25)}' -`   echo $newline >> $spkrfile 
  echo $newline appended to $spkrfile > /dev/tty
  count=`expr $count + 1`
done
