DARPA Resource Management Continuous Speech Database (RM1) Development Test and Evaluation Test Data and Scoring Software NIST Disc 2-4.2 This CD-ROM includes part of a corpus of recorded speech for use in designing and evaluating algorithms for continuous speech recognition, along with scoring software that has been used in benchmark tests. Speaker-dependent, speaker-adaptive, and speaker-independent recognition modes are accommodated. The corpus consists of oral readings of sentences taken from a (nominally) 1000-word language model of a naval resource management task built around existing interactive database and graphics programs [1]. Speaker-dependent and speaker-independent system training data for this corpus are also available on CD-ROM from the National Technical Information Service (NTIS) (NTIS accession numbers PB89-226666 and PB90-500539, respectively.) This disc contains speaker-dependent and speaker-independent test material used in previous DARPA benchmark recognition tests, along with scoring and diagnostic software for those tests. This version of the scoring software includes implementations of statistical significance tests outlined by Gillick and Cox [2]. Prototype software to implement alternative scoring and diagnostic tools using a phonology-based string alignment procedure is also included as well as a library of software to manipulate the speech file header structure developed at the National Institute of Standards and Technology (NIST). Please note that this CD-ROM is a revision of NIST Speech Disc 2-4.1, dated January 1990, and includes additional test data. TABLE OF CONTENTS I. Development-Test and Evaluation-Test Speech Material II. NIST Header Structure III. NIST Speech Header Resources (SPHERE) IV. Prior Benchmark Tests Using This Material V. Implementation of Scoring Software VI. Implementation of Statistical Significance Tests VII. Experimental Implementation of Phonology-based String Alignment VIII. Compatibility with European SAM Project Standards IX. Acknowledgements X. References XI. Disclaimers I. DEVELOPMENT-TEST AND EVALUATION-TEST SPEECH MATERIAL The directory CD2-4.2: rm1 comprises the Resource Management Development Test and Evaluation Test corpora used by the DARPA speech community to date. This directory contains 5760 NIST-headered speech sphere files as well as several documentation files and directories. Sentence text prompts have been included but "official" transcriptions (orthographic, phonetic, etc.) do not exist and have, therefore, not been included. For the purpose of system testing, it has been assumed that the prompts represent an accurate orthographic transcription of the utterances. In addition to speech corpora documentation, information describing prior DARPA benchmark tests and test results for two recognition systems have been included. Detailed information on this material and the structure of the speech directories may be found in "CD2-4.2: rm1/readme.txt". II. NIST HEADER STRUCTURE This series of CD-ROMs employs the NIST speech file header structure. The header is an object-oriented, 1024-byte fixed-length, entirely ASCII structure [6]. The header is composed of a fixed portion followed by an object-oriented variable portion. The fixed portion is as follows: NIST_1A 1024 The first line specifies the header type and the second line specifies the header length. Each of these lines are 8 bytes long (including new-line) and are structured to identify the header as well as allow those who do not wish to read the subsequent header information to programmatically skip over it. The remaining object-oriented variable portion is composed of object-type- value "triple" lines which have the following format: ::= | | | ::= ::= | ::= | ::= _ | _ ::= - | - | - ::= i ::= r ::= s ::= | | (depending on object type) ::= ::= . ::= | NULL ::= ; (excluding embedded new-lines) ::= | ::= | ::= a | ... | z | A | ... | Z ::= | ::= 0 | ... | 9 ::= + | - | NULL ::= | ::= | ::= char(0) | char(1) | ... | char(255) The currently defined objects (used in this database) are listed in the file rm1/doc/header.def. (Note: The list of objects in header.def may be expanded for other corpora, since no order or number of objects is imposed on this header structure. The file header.def is simply a repository for Resource Management object definitions.) The single object "end_head" marks the end of the active header and the remaining unused header space is undefined. The following is an example header from the Resource Management corpus: NIST_1A 1024 database_id -s3 RM1 database_version -s3 1.0 utterance_id -s11 ajp0_st2195 channel_count -i 1 sample_count -i 46183 sample_rate -i 16000 sample_min -i -2119 sample_max -i 2921 sample_n_bytes -i 2 sample_byte_format -s2 01 sample_sig_bits -i 16 end_head III. NIST SPEECH HEADER RESOURCES (SPHERE) SPHERE is a library of C-language functions which facilitate NIST speech file header manipulation. The SPHERE library can be found in CD2-4.2: /sphere. A basic suite of command-line header utility programs built using the SPHERE library is also included. The file, CD2-4.2: /sphere/readme.txt may be consulted for more information on using SPHERE. IV. PRIOR BENCHMARK TESTS USING THIS MATERIAL A series of DARPA Benchmark Tests has been conducted using test material on this CD-ROM with systems that have been developed using system training data contained on other CD-ROMs in this series. These tests were conducted prior to DARPA-sponsored speech research meetings in: (1) March 1987 (2) October 1987 (3) June 1988 (4) February 1989 (5) October 1989 (6) February 1991 (7) September 1992 Please also note that the material for the June 1990 DARPA Resource Management benchmark tests is not contained in this CD-ROM. It can be found in the DARPA Extended Resource Management Continuous Speech Speaker-Dependent Corpus (RM2) CD-ROM set (NIST Speech Discs 3-1.2/3-2.2, last revised September 1990/NTIS Order No. PB90-501776). The RM2 corpus is a longitudinal Speaker-Dependent extension to the RM1 corpus and contains 2400 training sentences for each of 4 speakers. A development test set and evaluation test set (June 1990 DARPA benchmark test material) for the 4 speakers is included in the 2-disc set as well. Since the training material for these speakers had not been "seen" by the community prior to the test, the same test material was used for both speaker-dependent and speaker-independent tests in the June 1990 DARPA benchmark tests. In addition, research conducted at CMU during development of the SPHINX system by Kai-Fu Lee made use of a portion of the 1987 test material [3]. Documentation included on this disc for each of these tests includes: (a) relatively concise text summarizing the properties of the test (an "overview" file) (b) index files citing the specified test material (with an ".ndx" extension), (c) an outline which permits recreation of applicable test and system- training conditions using the test data and scoring software contained on this disc in conjunction with data on other discs in this series (the "outlin" file), and (d) the text of a series of background memoranda (of varying degrees of formality) that describe the implementation of the DARPA Benchmark Tests to participants in the DARPA speech research community (the "bkgrnd" file). These memoranda are provided primarily for background information. This material has not been been updated for tests 6 and 7. This material is provided in order to permit use of this test material and scoring software to replicate system training and benchmark test conditions that were applicable in prior tests. Since some of these tests were conducted more than five years ago, users are encouraged to refer to the most recent reported results when making comparisons. It is advisable to designate one set of the test material for use in system development, and to defer use of other test sets until system development is complete. Detailed analysis of the results of development tests may be used for the purposes of system "tuning", but in no case should detailed analysis of the results of the benchmark tests be used for system "tuning" if those test set results are to be reported. Results reported in publications should be limited to single-pass or "first-time" tests. To permit comparison of system results with state-of-the art systems included in the October 1989 DARPA benchmark tests, two directories have been provided containing example results for speaker-independent ("ind_ex") and speaker- dependent ("dep_ex") systems. Each example directory contains subdirectories for the "Word-Pair" and "no grammar" test conditions. Within each of these directories, the ".hyp" file contains the recognition system output, which may be used as input to the scoring software. The ".hyp" file in each example directory was used to produce corresponding "score.out" files, containing portions of four different reports for each test. V. IMPLEMENTATION OF SCORING SOFTWARE The scoring software contained on this CD-ROM represents a version of a scoring software and diagnostic tools package that has been developed and used at the National Institute of Standards and Technology in conjunction with DARPA Benchmark Tests of speech recognition systems using the Resource Management Corpus [4]. It has been developed to provide a uniform reporting standard for the DARPA contractors, and to make it possible to track incremental progress. Contractors have reported results for a given benchmark test set by citing (as a minimum) the data contained in the report "Summary of Accuracy for the Test Condition". Use is made of a number of corpus- and lexicon-specific files that must be redefined if this software is to be adapted to other corpora. Specific examples include the tables of homophones, splits and merges, alpha-numerics and the partitioning of the lexicon into mono- and poly-syllabic categories. The NIST "production" scoring programs are meant to be run in batch mode, and general functions for evaluation of results are still in research and development at NIST. Flexible and interactive diagnostic tools based on speech science, linguistics and statistical considerations may offer greater insights into system performance than the tools used to date within the DARPA speech research community. The set of scoring software tools included on this disc is believed to be the first made available to the speech research community at large that permits implementation of statistical significance tests for continuous speech recognition systems, along with other diagnostic tools. It offers a broad base for implementing uniform reporting standards. VI. IMPLEMENTATION OF STATISTICAL SIGNIFICANCE TESTS Gillick and Cox [2] have suggested the use of two simple tests: McNemar's test, and a matched-pairs test. Implementations of these tests are options in the "stats" software package. In the implementation of McNemar's test in the scoring software on this CD- ROM, errors are scored at the sentence level (i.e., a sentence is either recognized correctly or in error, and the differences that are most important to the McNemar test are derived from comparisons of the number of sentence- level errors that are unique to each system) [5]. The McNemar's sentence-level error test can be performed using the stats option "-SENT_MCN". In the implementation of the matched-pairs test, knowledge from the aligned reference and hypothesized sentence strings (using the present "standard" dynamic programming (DP) string matching algorithm) is used to locate segments of the hypothesized sentence strings that contain errors. These segments are selected so as to meet the criterion of ensuring that the errors in one segment are statistically independent of the errors in any other segment. In order for a sentence hypothesis to be segmented into two (or more) segments, there must be at least one region of a number of correctly recognized "buffer" words. For the two systems, the matched-pairs test computes the difference in the number of errors in corresponding segments. It then tests the null hypothesis that the mean difference (in the number of word errors per segment) is zero. Following a suggestion of Gillick and Cox, using the aligned strings, segments where no errors have occurred are identified, and these 'good' segments are used to separate segments where errors have occurred ('bad' segments). The length of the 'good' segments must be sufficiently long to ensure that after a good segment, the first error in a bad segment is independent of any previous errors. The segments upon which the matched- pairs test is based are bounded: (a) on the left by either the beginning of the sentence string or by two (or more) correctly recognized words, and (b) on the right by either two (or more) correctly recognized words or the end of the sentence string. The choice of the number of buffer words in the 'good' segments (in this case two) reflects a compromise between: (a) allowing for a long enough period of time to ensure independence of errors in each segment, and (b) ensuring that the sentence strings are subdivided into a large number of segments. With the number of buffer words set at 2, each sentence is typically segmented into about 1.4 segments, while a shorter buffer length of 1 correctly recognized word yields about 1.9 segments per sentence. This matched-pair sentence-segment word error test can be performed using the stats option "-MTCH_PR". In this implementation of both tests, a 95% confidence level is used for rejecting the null hypothesis. An assumed chi-square distribution with one degree of freedom is used in implementing the McNemar test, and for the matched-pair sentence-segment word error test, a normal distribution is assumed. Software is also provided to implement a Friedman two-way analysis-of- variance by ranks. VII. EXPERIMENTAL IMPLEMENTATION OF PHONOLOGY-BASED STRING ALIGNMENT Also included on this disc is experimental software and tables to align strings so as to minimize their indicated phonological distance. More recent versions of this software are in development at NIST. For information regarding the status of the revisions, contact Dr. William Fisher (billf@jaguar.ncsl.nist.gov). A directory (/score/src/rdev) of general-purpose "C" language functions and prototype utility programs implementing "phonological distance" computations is given, and this approach is an alternative that can be selected for the alignment program in the main scoring package. In order to report word errors, an alignment of words in REF and HYP strings must be done. The usual approach taken is to find the alignment that minimizes the weighted sum of indicated word substitutions, insertions, and deletions. All insertions and deletions have the same weight, and each substitution weighs slightly less than the sum of an insertion and deletion. An efficient algorithm exists to solve this problem. Our new alignment procedure[5] is similar in several respects to that reported by Picone et al.[7], whose procedure aligned strings of phones based on an assumed table of phone-to-phone distances. The procedure developed at NIST uses a hierarchy of linguistic code sets. There are both compositional and basic code sets. If a code set is basic, each element consists of only an ASCII representation. If the code set is compositional, then each of its elements also has a list of elements in the next lower code set. For instance, the lexical code set is a compositional one consisting of a set of words, each word having an ASCII representation and a composition in terms of a list of phonemes. Similarly, each member of the phoneme code set has a list of phonological features composing it. The feature code set is basic. Our software reads in and uses arbitrary code sets from ASCII text files. Most of our experimental work has used a Resource Management lexicon developed at SRI, which contains for each word the string of phones based on the most frequent forms observed in a training set from the Resource Management database. SRI does not claim that these represent the most likely pronunciations in general. It is a crude approximation to take their lexicon out of the context of their research and use it for more general purposes, but it generally works. Our procedure currently operates under the constraint that only one (non-probabilistic) phonological representation for each word can be used. We decided to use a lexicon of most-frequent phones instead of a lexicon of base-forms. When only one (most-likely) word representation can be used, a good deal of contextual variation and probabilistic information must be lost. Still, the alignments resulting from using this material are almost always more plausible than the current "standard" alignment, when they differ, and seem comparable in quality to the alignments achieved by the proprietary TI alignment software. We are experimentally developing other lexicons which may give improved results, using material kindly sent to us from SRI, TI, and others. In the preliminary software release here, the alignment process is phrased as a calculation of alignment distance. The particular alignment that is found is returned as a side effect. In doing an alignment, we start with REF and HYP strings of words. These strings are sent to a function ALDIST(s1,s2,code), whose responsibility it is to calculate and return the alignment distance between the two strings. It uses the usual DP algorithm. Whenever the weight (or distance) between elements i and j of the (word) code set, W(i,j), is referred to, instead of looking this distance up in a table, another function WOD(i,j,code) is called. This function is given the job of computing and returning the weight or distance between elements i and j. In order to do this, it calls ALDIST, specifying the next lower code set and the two strings of lower-code (phonemic) elements corresponding to (words) i and j. The process is repeated at the next lower level, until eventually a code is reached that is basic and has no composition in a next-lower code. WOD then ends the recursion by returning a value based only on comparison of the integers i and j (e.g. 0 if i=j, 1 otherwise). In our case, this is at the feature level. In order to make use of the same logic at the feature level as at other levels, we use a feature representation that is an adaptation of the classical "privative" feature opposition of Trubetzkoy [8]. If a phone has a certain feature, this feature will appear in the phone's string of lower code feature elements, and if it doesn't have that feature, no such symbol will be there. The list of features is strictly ordered, so that interchanging consecutive symbols to find a match is never needed. When WOD works with a feature code set, it returns the value 1 if either i=0 (insertion) or j=0 (deletion); otherwise, if i=j (a match) it returns 0, and if i!=j (substitution), it returns a very large arbitrary number, in order to effectively suppress substitution hypotheses. As a result, the unit of distance, at every level from phone to utterance, is the number of phonological features that must be changed to turn one into the other. For the source code, phonological code set tables, and more documentation, see directory /score/src/rdev. VIII. COMPATIBILITY WITH EUROPEAN SAM PROJECT STANDARDS Within the European speech research community, the European multi-lingual Speech input/output Assessment Methodology and standardization (SAM) Project has developed conventions for SAM speech databases that differ in many respects from the conventions used in this series of speech corpora on CD- ROM. Software has been developed at ICP in Grenoble, a participant in the ESPRIT SAM project, to permit a bridge between these differing conventions. This software deals with differences in file-naming and header conventions as well as what are termed the "associated files". Significant differences in file-naming conventions involve handling of speaker (name) codes, corpus codes, and file numbering. In SAM terminology, the file-naming convention is of the form (XXnnxxxx.SAS). The approach taken in the present prototype software is to map the speaker's initials into unique two-character speaker codes (XX). In the SAM convention, the two-character "corpus code" (nn) is intended as a code for the recording session, however, for the DARPA corpora this information is not readily available. The reconciliation of this difference in approach (used in the present conversion software) is that unless otherwise designated by the user, the two first letters of the original file names are used for this purpose (e.g., SA, SB or ST). [These letters correspond to different portions of the text corpus, rather than the recording sessions (i.e., "SA" signifies that the sentence texts are the "dialect" sentences, "SB" signifies that the sentence texts are the "rapid adaptation" sentences, and "ST" are other sentences from the Resource Management Corpus).] Another portion of the SAM file-naming convention (xxxx) contains a unique file number "attributed by the SAM consortium" (and which may differ from site to site). In using the conversion software, after the user has defined an initial file number, this is incremented before every new file. An associated label file will contain the original (DARPA) file name. Finally, in the filename extension (e.g., SAS), the "S" signifies that these are sentence utterances, "A" signifies that the spoken language is American (i.e., English as spoken in the United States), and the final "S" signifies that the file contains sampled speech while a final "O" signifies that it is an orthographic transcription file. When using this software, for each .sph file that is processed using the "Convert" software, two associated SAM-convention files will be produced: one with an extension .SAS, and an associated file with an extension .SAO. IX. ACKNOWLEDGEMENTS At NIST, John Garofolo has been responsible for compilation and premastering of the Resource Management Corpus for this series of CD-ROM discs. Jonathan Fiscus had responsibility for a thorough revision of the "standard" scoring software and for implementation of the statistical significance tests. William Fisher developed the phonology-based scoring software, and worked with Jon Fiscus to incorporate it into the revised "standard" scoring software. David Pallett coordinated the selection of test material and implementation of the DARPA benchmark tests. John Garofolo, Bill Fisher and David Pallett designed the speech file header structure, and Stan Janet developed the SPHERE software. Comments are welcome, and should be addressed to: Dr. David S. Pallett, Room A216 Technology Building, National Institute of Standards and Technology, Gaithersburg, MD 20899. The cooperation of the following DARPA contractors in implementing DARPA Benchmark Tests is gratefully acknowledged: Francis Kubala at BBN, Kai-Fu Lee (formerly) at CMU, Hy Murveit at SRI, Doug Paul at MIT Lincoln Laboratory and Victor Zue at MIT Laboratory for Computer Science. Jay Wilpon and Larry Rabiner at AT&T Bell Laboratories also deserve thanks for cooperation in the use of the Resource Management Corpus for benchmark tests within their organization. Kai-Fu Lee (formerly of CMU) is to be particularly thanked for providing clarification of details of his use of the Resource Management Corpus and the 1987 test material in development of the SPHINX system. He is also to be thanked for agreeing to the release of the October 1989 CMU SPHINX speaker-independent benchmark test results and for providing a concise description of that SPHINX system. Special thanks are also due to Francis Kubala at BBN for agreeing to the release of the October 1989 BBN BYBLOS speaker-dependent benchmark test results and for providing a description of that BYBLOS system. Discussions with Larry Gillick at Dragon systems have been very helpful in developing the present implementation of statistical significance tests. Gillick has suggested the development of more flexible, interactive (rather than "batch mode") statistically based diagnostic tools. Perhaps the development of such tools should be the subject of future research. NIST staff, however, have been solely responsible for the design and implementation of the statistical test software contained on this CD-ROM. Mike Cohen at SRI is hereby thanked for sending us the lexicons and phone feature sets that were developed in his research there. TI's Jack Godfrey has been kind enough to send us an experimental RM phonemic lexicon of theirs. In addition, Joe Picone and George Doddington (formerly of TI) earlier lent us their alignment software, for which we are grateful. The cooperation of Jean-Marc Dolmazon and Jerome Zeiliger at Institut de la Communication Parlee (ICP), in Grenoble, in developing prototype conversion software between the file format used in this series of discs and for the ESPRIT SAM format is gratefully acknowledged. Questions about implementation of this prototype software (and the availability of revisions) should be directed to: Jerome ZEILIGER, Institut de la Communication Parlee, I.N.P.G. - E.N.S.E.R.G., 46 Avenue Felix-Viallet, 38031 GRENOBLE CEDEX, FRANCE, telephone: +33 76574538, FAX: +33 76574710. X. REFERENCES [1] Price, P. J., Fisher, W. M., Bernstein, J. "The DARPA 1000-word Resource Management Database for Continuous Speech Recognition", Paper S.13.b.21 in Proceedings of ICASSP'88 (New York) (April 1988) pp.651-654. [2] Gillick, L. and Cox, S. J. "Some Statistical Issues in the Comparison of Speech Recognition Algorithms", Paper S10.b.5 in Proceedings of ICASSP'89 (Glasgow) (May 1989) pp. 532-535. [3] Lee, K. F. "Large-Vocabulary Speaker-Independent Continuous Speech Recognition: The SPHINX System", Ph.D. Dissertation, Carnegie Mellon University Computer Science Department, Report No. CMU-CS-88-148 (April 1988). [4] Pallett, D. S., "Benchmark Tests for DARPA Resource Management Database Performance Evaluations", Paper S10.b.6 in Proceedings of ICASSP'89 (Glasgow) (May 1989) pp. 536-539. [5] Pallett, D. S., Fisher, W. M., and Fiscus, J. G., "Tools for the Analysis of Benchmark Speech Recognition Tests", Paper 7.S2.16 in Proceedings of ICASSP'90 (Albuquerque) (April 1990). [6] Garofolo, J. S. and Pallett, D. S., "Use of CD-ROM for Speech Database Storage and Exchange" in Proceedings of Eurospeech 89 (European Conference on Speech Communication and Technology) (Paris) (September 1989) Vol.2 pp. 309-312. [7] Picone, J., Goudie-Marshall, K. M., Doddington, G. R., and Fisher, W. M., "Automatic Text Alignment for Speech System Evaluation", IEEE Transactions on Acoustics, Speech and Signal Processing, Vol. ASSP-34, No. 4, pp. 1010- 1011, August 1986. [8] Anderson, S. R., Phonology in the Twentieth Century, U. of Chicago Press, Chicago, 1985, pp. 99-100. XI. DISCLAIMERS (1) The scoring software package included in this CD-ROM was developed and tested using Berkeley 4.2 and 4.3 UNIX (TM) operating systems. It has been successfully implemented at other sites, and at one site, modifications have been successfully made to permit implementation in an MS-DOS environment. However, in implementing this software, it may be necessary to make minor local modifications. Little effort was expended in optimizing this software for memory allocation or run-times, since it was thought likely to be infrequently executed. (2) The implementation of statistical significance tests incorporated in the scoring software package represents a preliminary effort to introduce these considerations to performance assessment for speech recognition technology, and is intended to "encourage researchers who are reporting empirical results to use statistical measures in summarizing their findings and drawing conclusions". Some of the assumptions required for these tests to be strictly applicable (e.g., independence of errors and the availability of sufficient errors to justify assumptions about distributions) may not be satisfied for some of the benchmark test material. (3) The phonology-based string alignment option incorporated in the scoring software package represents an alternative approach to the word string alignment procedure that has been employed to date in DARPA Benchmark Tests. It appears to offer significant advantages over the traditional approach. However, it is the subject of ongoing research and has not yet been adopted for "standard" usage within the DARPA research community. Comments on this approach are welcome, and should be directed to the attention of Dr. William M. Fisher, Room A216 Technology Building, National Institute of Standards and Technology, Gaithersburg, MD 20899. (4) These speech corpora and software tools have been developed for use within the DARPA speech research community. Although the corpora and scoring software have been adopted for widespread use within the DARPA speech community, they are the subject of ongoing research. Although care has been taken to ensure that all CD-ROM based data and software is complete and error-free, it may not meet all users' requirements. As such, it is made available to the speech research community at large, without endorsement or express or implied warranties. The results of tests conducted with this test material and/or analyses of performance of speech recognition systems are not to be construed as official findings of the National Institute of Standards and Technology, the Department of Defense, or the United States Government.