DARPA Resource Management Continuous Speech Database
                                   (RM1)

                 Development Test and Evaluation Test Data
                                    and
                              Scoring Software

                              NIST Disc 2-4.2


This CD-ROM includes part of a corpus of recorded speech for use in designing
and evaluating algorithms for continuous speech recognition, along with
scoring software that has been used in benchmark tests.  Speaker-dependent,
speaker-adaptive, and speaker-independent recognition modes are accommodated. 
The corpus consists of oral readings of sentences taken from a (nominally)
1000-word language model of a naval resource management task built around
existing interactive database and graphics programs [1].

Speaker-dependent and speaker-independent system training data for this
corpus are also available on CD-ROM from the National Technical Information
Service (NTIS) (NTIS accession numbers PB89-226666 and PB90-500539,
respectively.)  

This disc contains speaker-dependent and speaker-independent test
material used in previous DARPA benchmark recognition tests, along
with scoring and diagnostic software for those tests.  This version of
the scoring software includes implementations of statistical
significance tests outlined by Gillick and Cox [2].  Prototype
software to implement alternative scoring and diagnostic tools using a
phonology-based string alignment procedure is also included as well as
a library of software to manipulate the speech file header structure
developed at the National Institute of Standards and Technology
(NIST).

Please note that this CD-ROM is a revision of NIST Speech Disc 2-4.1,
dated January 1990, and includes additional test data.


TABLE OF CONTENTS

   I. Development-Test and Evaluation-Test Speech Material
  II. NIST Header Structure
 III. NIST Speech Header Resources (SPHERE)
  IV. Prior Benchmark Tests Using This Material
   V. Implementation of Scoring Software
  VI. Implementation of Statistical Significance Tests
 VII. Experimental Implementation of Phonology-based String Alignment
VIII. Compatibility with European SAM Project Standards
  IX. Acknowledgements
   X. References
  XI. Disclaimers


I. DEVELOPMENT-TEST AND EVALUATION-TEST SPEECH MATERIAL

The directory CD2-4.2: rm1 comprises the Resource Management Development
Test and Evaluation Test corpora used by the DARPA speech community to date. 
This directory contains 5760 NIST-headered speech sphere files as well as
several documentation files and directories.  Sentence text prompts have been
included but "official" transcriptions (orthographic, phonetic, etc.) do not
exist and have, therefore, not been included.  For the purpose of system
testing, it has been assumed that the prompts represent an accurate
orthographic transcription of the utterances.  In addition to speech corpora
documentation, information describing prior DARPA benchmark tests and test
results for two recognition systems have been included.

Detailed information on this material and the structure of the speech
directories may be found in "CD2-4.2: rm1/readme.txt".


II. NIST HEADER STRUCTURE

This series of CD-ROMs employs the NIST speech file header structure. The
header is an object-oriented, 1024-byte fixed-length, entirely ASCII
structure [6].  The header is composed of a fixed portion followed by an
object-oriented variable portion.  The fixed portion is as follows:

NIST_1A<new-line>
   1024<new-line>

The first line specifies the header type and the second line specifies the
header length.  Each of these lines are 8 bytes long (including new-line) and
are structured to identify the header as well as allow those who do not wish
to read the subsequent header information to programmatically skip over it. 

The remaining object-oriented variable portion is composed of object-type-
value "triple" lines which have the following format:

<LINE> ::= <TRIPLE><new-line> |
           <COMMENT><new-line> | 
           <TRIPLE><COMMENT><new-line> | 

  <TRIPLE> ::= <OBJECT><space><TYPE><space><VALUE><OPT-SPACES>

    <OBJECT> ::= <PRIMARY-SUBOBJECT> | 
                 <PRIMARY-SUBOBJECT><SECONDARY-SUBOBJECT>

      <PRIMARY-SUBOBJECT> ::= <ALPHA> | <ALPHA><ALPHA-NUM-STRING>
      <SECONDARY-SUBOBJECT> ::= _<ALPHA-NUM-STRING> | 
                                _<ALPHA-NUM-STRING><SECONDARY-SUBOBJECT>

    <TYPE> ::= -<INTEGER-FLAG> | -<REAL-FLAG> | -<STRING-FLAG>

      <INTEGER-FLAG> ::= i
      <REAL-FLAG> ::= r
      <STRING-FLAG> ::= s<DIGIT-STRING>
      
    <VALUE> ::= <INTEGER> | <REAL> | <STRING>  (depending on object type)

      <INTEGER> ::= <SIGN><DIGIT-STRING>
      <REAL> ::= <SIGN><DIGIT-STRING>.<DIGIT-STRING> 

    <OPT-SPACES> ::= <SPACES> | NULL

  <COMMENT> ::= ;<STRING>  (excluding embedded new-lines)

<ALPHA-NUM-STRING> ::= <ALPHA-NUM> | <ALPHA-NUM><ALPHA-NUM-STRING> 
<ALPHA-NUM> ::= <DIGIT> | <ALPHA>
<ALPHA> ::= a | ... | z | A | ... | Z
<DIGIT-STRING> ::= <DIGIT> | <DIGIT><DIGIT-STRING>
<DIGIT> ::= 0 | ... | 9
<SIGN> ::= + | - | NULL
<SPACES> ::= <space> | <SPACES><space>
<STRING> ::=  <CHARACTER> | <CHARACTER><STRING>
<CHARACTER> ::= char(0) | char(1) | ... | char(255)

The currently defined objects (used in this database) are listed in the file
rm1/doc/header.def.  (Note: The list of objects in header.def may be
expanded for other corpora, since no order or number of objects is imposed
on this header structure.  The file header.def is simply a repository for
Resource Management object definitions.)

The single object "end_head" marks the end of the active header and the
remaining unused header space is undefined.

The following is an example header from the Resource Management corpus:

NIST_1A
   1024
database_id -s3 RM1
database_version -s3 1.0
utterance_id -s11 ajp0_st2195
channel_count -i 1
sample_count -i 46183
sample_rate -i 16000
sample_min -i -2119
sample_max -i 2921
sample_n_bytes -i 2
sample_byte_format -s2 01
sample_sig_bits -i 16
end_head


III. NIST SPEECH HEADER RESOURCES (SPHERE)

SPHERE is a library of C-language functions which facilitate NIST speech file
header manipulation.  The SPHERE library can be found in CD2-4.2: /sphere. 
A basic suite of command-line header utility programs built using the SPHERE
library is also included. 

The file, CD2-4.2: /sphere/readme.txt may be consulted for more information
on using SPHERE.


IV. PRIOR BENCHMARK TESTS USING THIS MATERIAL

A series of DARPA Benchmark Tests has been conducted using test material on
this CD-ROM with systems that have been developed using system training data
contained on other CD-ROMs in this series.  

These tests were conducted prior to DARPA-sponsored speech research
meetings in:

                    (1) March 1987
                    (2) October 1987
                    (3) June 1988
                    (4) February 1989
                    (5) October 1989
                    (6) February 1991
                    (7) September 1992

Please also note that the material for the June 1990 DARPA Resource
Management benchmark tests is not contained in this CD-ROM.  It can be
found in the DARPA Extended Resource Management Continuous Speech
Speaker-Dependent Corpus (RM2) CD-ROM set (NIST Speech Discs
3-1.2/3-2.2, last revised September 1990/NTIS Order No. PB90-501776).
The RM2 corpus is a longitudinal Speaker-Dependent extension to the
RM1 corpus and contains 2400 training sentences for each of 4
speakers.  A development test set and evaluation test set (June 1990
DARPA benchmark test material) for the 4 speakers is included in the
2-disc set as well.  Since the training material for these speakers
had not been "seen" by the community prior to the test, the same test
material was used for both speaker-dependent and speaker-independent
tests in the June 1990 DARPA benchmark tests.

In addition, research conducted at CMU during development of the SPHINX
system by Kai-Fu Lee made use of a portion of the 1987 test material [3].  

Documentation included on this disc for each of these tests includes: 

(a) relatively concise text summarizing the properties of the test (an
"overview" file) 

(b) index files citing the specified test material (with an ".ndx"
extension), 

(c) an outline which permits recreation of applicable test and system-
training conditions using the test data and scoring software contained on
this disc in conjunction with data on other discs in this series (the
"outlin" file), and 

(d) the text of a series of background memoranda (of varying degrees of
formality) that describe the implementation of the DARPA Benchmark Tests to 
participants in the DARPA speech research community (the "bkgrnd" file). 
These memoranda  are provided primarily for background information.  This
material has not been been updated for tests 6 and 7.

This material is provided in order to permit use of this test material and
scoring software to replicate system training and benchmark test conditions
that were applicable in prior tests.  Since some of these tests were
conducted more than five years ago, users are encouraged to refer to the
most recent reported results when making comparisons. 

It is advisable to designate one set of the test material for use in system
development, and to defer use of other test sets until system development is
complete.  Detailed analysis of the results of development tests may be used
for the purposes of system "tuning", but in no case should detailed analysis
of the results of the benchmark tests be used for system "tuning" if those
test set results are to be reported.  Results reported in publications should
be limited to single-pass or "first-time" tests.

To permit comparison of system results with state-of-the art systems included
in the October 1989 DARPA benchmark tests, two directories have been provided
containing example results for speaker-independent ("ind_ex") and speaker-
dependent ("dep_ex") systems. Each example directory contains subdirectories
for the "Word-Pair" and "no grammar" test conditions.  Within each of these
directories, the ".hyp" file contains the recognition system output, which
may be used as input to the scoring software. The ".hyp" file in each example
directory was used to produce corresponding "score.out" files, containing
portions of four different reports for each test. 


V. IMPLEMENTATION OF SCORING SOFTWARE 

The scoring software contained on this CD-ROM represents a version
of a scoring software and diagnostic tools package that has been developed
and used at the National Institute of Standards and Technology in conjunction
with DARPA Benchmark Tests of speech recognition systems using the Resource
Management Corpus [4]. It has been developed to provide a uniform reporting
standard for the DARPA contractors, and to make it possible to track
incremental progress. 

Contractors have reported results for a given benchmark test set by citing
(as a minimum) the data contained in the report "Summary of Accuracy for the
Test Condition". 

Use is made of a number of corpus- and lexicon-specific files that must be
redefined if this software is to be adapted to other corpora.  Specific
examples include the tables of homophones, splits and merges, alpha-numerics
and the partitioning of the lexicon into mono- and poly-syllabic categories.

The NIST "production" scoring programs are meant to be run in batch mode, and
general functions for evaluation of results are still in research and
development at NIST.  Flexible and interactive diagnostic tools based on
speech science, linguistics and statistical considerations may offer greater
insights into system performance than the tools used to date within the DARPA
speech research community.   

The set of scoring software tools included on this disc is believed to be the
first made available to the speech research community at large that permits
implementation of statistical significance tests for continuous speech
recognition systems, along with other diagnostic tools. It offers a broad
base for implementing uniform reporting standards.  


VI. IMPLEMENTATION OF STATISTICAL SIGNIFICANCE TESTS

Gillick and Cox [2] have suggested the use of two simple tests: McNemar's
test, and a matched-pairs test.  Implementations of these tests are options
in the "stats" software package.

In the implementation of McNemar's test in the scoring software on this CD-
ROM, errors are scored at the sentence level (i.e., a sentence is either
recognized correctly or in error, and the differences that are most important
to the McNemar test are derived from comparisons of the number of sentence-
level errors that are unique to each system) [5].  

The McNemar's sentence-level error test can be performed using the stats
option "-SENT_MCN".

In the implementation of the matched-pairs test, knowledge from the aligned
reference and hypothesized sentence strings (using the present "standard"
dynamic programming (DP) string matching algorithm) is used to locate
segments of the hypothesized sentence strings that contain errors.  These
segments are selected so as to meet the criterion of ensuring that the errors
in one segment are statistically independent of the errors in any other
segment.  In order for a sentence hypothesis to be segmented into two (or
more) segments, there must be at least one region of a number of correctly
recognized "buffer" words.  For the two systems, the matched-pairs test
computes the difference in the number of errors in corresponding segments. 
It then tests the null hypothesis that the mean difference (in the number of
word errors per segment) is zero.  

Following a suggestion of Gillick and Cox, using the aligned strings,
segments where no errors have occurred are identified, and these 'good'
segments are used to separate segments where errors have occurred ('bad'
segments).  The length of the 'good' segments must be sufficiently long to
ensure that after a good segment, the first error in a bad segment is
independent of any previous errors.  The segments upon which the matched-
pairs test is based are bounded: (a) on the left by either the beginning of
the sentence string or by two (or more) correctly recognized words, and (b)
on the right by either two (or more) correctly recognized words or the end
of the sentence string.  The choice of the number of buffer words in the
'good' segments (in this case two) reflects a compromise between: (a)
allowing for a long enough period of time to ensure independence of errors
in each segment, and (b) ensuring that the sentence strings are subdivided
into a large number of segments.  With the number of buffer words set at 2,
each sentence is typically segmented into about 1.4 segments, while a shorter
buffer length of 1 correctly recognized word yields about 1.9 segments per
sentence. 
 
This matched-pair sentence-segment word error test can be performed using the
stats option "-MTCH_PR".

In this implementation of both tests, a 95% confidence level is  used for
rejecting the null hypothesis.  An assumed chi-square distribution with one
degree of freedom is used in implementing the McNemar test, and for the
matched-pair sentence-segment word error test, a normal distribution is
assumed. 

Software is also provided to implement a Friedman two-way analysis-of-
variance by ranks.


VII. EXPERIMENTAL IMPLEMENTATION OF PHONOLOGY-BASED STRING ALIGNMENT

Also included on this disc is experimental software and tables to align
strings so as to minimize their indicated phonological distance.

More recent versions of this software are in development at NIST.  For
information regarding the status of the revisions, contact Dr. William
Fisher (billf@jaguar.ncsl.nist.gov).

A directory (/score/src/rdev) of general-purpose "C" language functions and
prototype utility programs implementing "phonological distance" computations
is given, and this approach is an alternative that can be selected for the
alignment program in the main scoring package.

In order to report word errors, an alignment of words in REF and HYP strings
must be done.  The usual approach taken is to find the alignment that
minimizes the weighted sum of indicated word substitutions, insertions, and
deletions.  All insertions and deletions have the same weight, and each
substitution weighs slightly less than the sum of an insertion and deletion. 
An efficient algorithm exists to solve this problem.

Our new alignment procedure[5] is similar in several respects to that
reported by Picone et al.[7], whose procedure aligned strings of phones based
on an assumed table of phone-to-phone distances.

The procedure developed at NIST uses a hierarchy of linguistic code sets. 
There are both compositional and basic code sets.  If a code set is basic,
each element consists of only an ASCII representation.  If the code set is
compositional, then each of its elements also has a list of elements in the
next lower code set.  For instance, the lexical code set is a compositional
one consisting of a set of words, each word having an ASCII representation
and a composition in terms of a list of phonemes.  Similarly, each member of
the phoneme code set has a list of phonological features composing it. The
feature code set is basic. 

Our software reads in and uses arbitrary code sets from ASCII text files. 
Most of our experimental work has used a Resource Management lexicon
developed at SRI, which contains for each word the string of phones based
on the most frequent forms observed in a training set from the Resource
Management database.  SRI does not claim that these represent the most
likely pronunciations in general.  It is a crude approximation to take
their lexicon out of the context of their research and use it for more
general purposes, but it generally works.  Our procedure currently operates
under the constraint that only one (non-probabilistic) phonological
representation for each word can be used.  We decided to use a lexicon
of most-frequent phones instead of a lexicon of base-forms.  When only
one (most-likely) word representation can be used, a good deal of contextual
variation and probabilistic information must be lost.  Still, the alignments
resulting from using this material are almost always more plausible than
the current "standard"  alignment, when they differ, and seem comparable
in quality to the alignments achieved by the proprietary TI alignment
software.  We are experimentally developing other lexicons which may give
improved results, using material kindly sent to us from SRI, TI, and others.

In the preliminary software release here, the alignment process is phrased
as a calculation of alignment distance.  The particular alignment that is
found is returned as a side effect.  

In doing an alignment, we start with REF and HYP strings of words.  These
strings are sent to a function ALDIST(s1,s2,code), whose responsibility it
is to calculate and return the alignment distance between the two strings. 
It uses the usual DP algorithm.  Whenever the weight (or distance) between
elements i and j of the (word) code set, W(i,j), is referred to, instead of
looking this distance up in a table, another function WOD(i,j,code) is
called.  This function is given the job of computing and returning the weight
or distance between elements i and j.  In order to do this, it calls ALDIST,
specifying the next lower code set and the two strings of lower-code
(phonemic) elements corresponding to (words) i and j.  The process is
repeated at the next lower level, until eventually a code is reached that is
basic and has no composition in a next-lower code.  WOD then ends the
recursion by returning a value based only on comparison of the integers i and
j (e.g. 0 if i=j, 1 otherwise).  In our case, this is at the feature level.

In order to make use of the same logic at the feature level as at other
levels, we use a feature representation that is an adaptation of the
classical "privative" feature opposition of Trubetzkoy [8].  If a phone has
a certain feature, this feature will appear in the phone's string of lower
code feature elements, and if it doesn't have that feature, no such symbol
will be there.  The list of features is strictly ordered, so that
interchanging consecutive symbols to find a match is never needed.  When WOD
works with a feature code set, it returns the value 1 if either i=0
(insertion) or j=0 (deletion); otherwise, if i=j (a match) it returns 0, and
if i!=j (substitution), it returns a very large arbitrary number, in order
to effectively suppress substitution hypotheses. 
 
As a result, the unit of distance, at every level from phone to utterance,
is the number of  phonological features that must be changed to turn one into
the other.

For the source code, phonological code set tables, and more documentation,
see directory /score/src/rdev.


VIII. COMPATIBILITY WITH EUROPEAN SAM PROJECT STANDARDS

Within the European speech research community, the European multi-lingual
Speech input/output Assessment Methodology and standardization (SAM) Project
has developed conventions for SAM speech databases that differ in many
respects from the conventions used in this series of speech corpora on CD-
ROM.  Software has been developed at ICP in Grenoble, a participant in the
ESPRIT SAM project, to permit a bridge between these differing conventions.
This software deals with differences in file-naming and header conventions
as well as what are termed the "associated files".  

Significant differences in file-naming conventions involve handling of
speaker (name) codes, corpus codes, and file numbering. In SAM terminology,
the file-naming convention is of the form (XXnnxxxx.SAS).  

The approach taken in the present prototype software is to map the speaker's
initials into unique two-character speaker codes (XX). 

In the SAM convention, the two-character "corpus code" (nn) is intended as
a code for the recording session, however, for the DARPA corpora this
information is not readily available.  The reconciliation of this difference
in approach (used in the present conversion software) is that unless
otherwise designated by the user, the two first letters of the original file
names are used for this purpose (e.g., SA, SB or ST).  [These letters
correspond to different portions of the text corpus, rather than the
recording sessions (i.e., "SA" signifies that the sentence texts are the
"dialect" sentences, "SB" signifies that the sentence texts are the "rapid
adaptation" sentences, and "ST" are other sentences from the Resource
Management Corpus).]  

Another portion of the SAM file-naming convention (xxxx) contains a unique
file number "attributed by the SAM consortium" (and which may differ from
site to site). In using the conversion software, after the user has defined
an initial file number, this is incremented before every new file. An
associated label file will contain the original (DARPA) file name.   

Finally, in the filename extension (e.g., SAS), the "S" signifies that these
are sentence utterances, "A" signifies that the spoken language is American
(i.e., English as spoken in the United States), and the final "S" signifies
that the file contains sampled speech while a final "O" signifies that it is
an orthographic transcription file.

When using this software, for each .sph file that is processed using the
"Convert" software, two associated SAM-convention files will be produced: one
with an extension .SAS, and an associated file with an extension .SAO.


IX. ACKNOWLEDGEMENTS

At NIST, John Garofolo has been responsible for compilation and premastering
of the Resource Management Corpus for this series of CD-ROM discs.  Jonathan
Fiscus had responsibility for a thorough revision of the "standard" scoring
software and for implementation of the statistical significance tests. 
William Fisher developed the phonology-based scoring software, and worked
with Jon Fiscus to incorporate it into the revised "standard" scoring
software.  David Pallett coordinated the selection of test material and
implementation of the DARPA benchmark tests.  John Garofolo, Bill Fisher and
David Pallett designed the speech file header structure, and Stan Janet
developed the SPHERE software.  Comments are welcome, and should be addressed
to: Dr. David S. Pallett, Room A216 Technology Building, National Institute
of Standards and Technology, Gaithersburg, MD 20899.

The cooperation of the following DARPA contractors in implementing DARPA
Benchmark Tests is gratefully acknowledged: Francis Kubala at BBN, Kai-Fu Lee
(formerly) at CMU, Hy Murveit at SRI, Doug Paul at MIT Lincoln Laboratory and Victor Zue
at MIT Laboratory for Computer Science.  Jay Wilpon and Larry Rabiner at AT&T
Bell Laboratories also deserve thanks for cooperation in the use of the
Resource Management Corpus for benchmark tests within their organization.

Kai-Fu Lee (formerly of CMU) is to be particularly thanked for providing
clarification of details of his use of the Resource Management Corpus
and the 1987 test material in development of the SPHINX system.  He is
also to be thanked for agreeing to the release of the October 1989 CMU
SPHINX speaker-independent benchmark test results and for providing a
concise description of that SPHINX system.

Special thanks are also due to Francis Kubala at BBN for agreeing to the
release of the October 1989 BBN BYBLOS speaker-dependent benchmark test
results and for providing a description of that BYBLOS system.

Discussions with Larry Gillick at Dragon systems have been very helpful in
developing the present implementation of statistical significance tests. 
Gillick has suggested the development of more flexible, interactive (rather
than "batch mode") statistically based diagnostic tools.  Perhaps the
development of such tools should be the subject of future research.  NIST
staff, however, have been solely responsible for the design and
implementation of the statistical test software contained on this CD-ROM.

Mike Cohen at SRI is hereby thanked for sending us the lexicons and
phone feature sets that were developed in his research there. TI's
Jack Godfrey has been kind enough to send us an experimental RM
phonemic lexicon of theirs.  In addition, Joe Picone and George
Doddington (formerly of TI) earlier lent us their alignment software,
for which we are grateful.

The cooperation of Jean-Marc Dolmazon and Jerome Zeiliger at Institut de la
Communication Parlee (ICP), in Grenoble, in developing prototype conversion
software between the file format used in this series of discs and for the
ESPRIT SAM format is gratefully acknowledged.  Questions about implementation
of this prototype software (and the availability of revisions) should be
directed to: Jerome ZEILIGER, Institut de la Communication Parlee, I.N.P.G. -
E.N.S.E.R.G., 46 Avenue Felix-Viallet, 38031 GRENOBLE CEDEX, FRANCE,
telephone: +33 76574538, FAX: +33 76574710.


X. REFERENCES

[1] Price, P. J., Fisher, W. M., Bernstein, J. "The DARPA 1000-word Resource
Management Database for Continuous Speech Recognition", Paper S.13.b.21 in
Proceedings of ICASSP'88 (New York) (April 1988) pp.651-654. 

[2] Gillick, L. and Cox, S. J. "Some Statistical Issues in the Comparison of
Speech Recognition Algorithms", Paper S10.b.5 in Proceedings of ICASSP'89
(Glasgow) (May 1989) pp. 532-535.

[3] Lee, K. F. "Large-Vocabulary Speaker-Independent Continuous Speech
Recognition: The SPHINX System", Ph.D. Dissertation, Carnegie Mellon
University Computer Science Department, Report No. CMU-CS-88-148 (April
1988).

[4] Pallett, D. S., "Benchmark Tests for DARPA Resource Management Database
Performance Evaluations", Paper S10.b.6 in Proceedings of ICASSP'89 (Glasgow)
(May 1989) pp. 536-539.

[5] Pallett, D. S., Fisher, W. M., and Fiscus, J. G., "Tools for the Analysis
of Benchmark Speech Recognition Tests", Paper 7.S2.16 in Proceedings of
ICASSP'90 (Albuquerque) (April 1990).

[6] Garofolo, J. S. and Pallett, D. S., "Use of CD-ROM for Speech Database
Storage and Exchange" in Proceedings of Eurospeech 89 (European Conference
on Speech Communication and Technology) (Paris) (September 1989) Vol.2 pp.
309-312.

[7] Picone, J., Goudie-Marshall, K. M., Doddington, G. R., and Fisher, W. M.,
"Automatic Text Alignment for Speech System Evaluation", IEEE Transactions
on Acoustics, Speech and Signal Processing, Vol. ASSP-34, No. 4, pp. 1010-
1011, August 1986.

[8] Anderson, S. R., Phonology in the Twentieth Century, U. of Chicago Press,
Chicago, 1985, pp. 99-100.


XI. DISCLAIMERS

(1)  The scoring software package included in this CD-ROM was developed and
tested using Berkeley 4.2 and 4.3 UNIX (TM) operating systems. It has been
successfully implemented at other sites, and at one site, modifications have
been successfully made to permit implementation in an MS-DOS environment. 
However, in implementing this software, it may be necessary to make minor
local modifications. Little effort was expended in optimizing this software
for memory allocation or run-times, since it was thought likely to be
infrequently executed.

(2)  The implementation of statistical significance tests incorporated in the
scoring software package represents a preliminary effort to introduce these
considerations to performance assessment for speech recognition technology,
and is intended to "encourage researchers who are reporting empirical results
to use statistical measures in summarizing their findings and drawing
conclusions".  Some of the assumptions required for these tests to be
strictly applicable (e.g., independence of errors and the availability of
sufficient errors to justify assumptions about distributions) may not be
satisfied for some of the benchmark test material. 

(3)  The phonology-based string alignment option incorporated in the scoring
software package represents an alternative approach to the word string
alignment procedure that has been employed to date in DARPA Benchmark Tests. 
It appears to offer significant advantages over the traditional approach. 
However, it is the subject of ongoing research and has not yet been adopted
for "standard" usage within the DARPA research community.  Comments on this
approach are welcome, and should be directed to the attention of Dr. William
M. Fisher, Room A216 Technology Building, National Institute of Standards and
Technology, Gaithersburg, MD 20899.

(4)  These speech corpora and software tools have been developed for use
within the DARPA speech research community.  Although the corpora and scoring
software have been adopted for widespread use within the DARPA speech
community, they are the subject of ongoing research.  Although care has been
taken to ensure that all CD-ROM based data and software is complete and
error-free, it may not meet all users' requirements.  As such, it is made
available to the speech research community at large, without endorsement or
express or implied warranties.  The results of tests conducted with this test
material and/or analyses of performance of speech recognition systems are not
to be construed as official findings of the National Institute of Standards
and Technology, the Department of Defense, or the United States Government.