*************************** A T T E N T I O N ******************************** * This document provided the output preparation, self-scoring * * instructions, and submission instructions for the November 1996 * * DARPA/NIST CSR Hub-4 Evaluation. You may use this document to * * implement the same protocols used in the 1996 Evaluation, but please * * do NOT submit your results to NIST. * *************************** A T T E N T I O N ******************************** DARPA CSR 1996 Broadcast News Hub-4 Benchmark Test Evaluation System Output Preparation, Scoring Protocols, and Submission Instructions Revised 11/27/96 Contents -------- 1.0 Scoring the H4 Broadcast News Benchmark Tests 1.1 Preparation of hypothesized transcripts 1.2 Preparation of reference transcripts 1.3 Transcription pre-filtering 1.3.1 Word fragments 1.3.2 Compound words 1.3.3 Multiple representations 1.3.4 Contractions 1.3.5 Pause fillers 1.3.6 Overlapping speech 1.4 Running the NIST scoring software 2.0 Scoring software output 3.0 System descriptions 4.0 Submission of test results to NIST 4.1 Due dates 4.2 Test results Format 4.3 File and directory formats Sections 1.0 and 2.0 describe the process to be used in preparing Hub-4 system output for scoring and for implementing the NIST scoring software. Section 3.0 describes the format for submission of system information. Section 4.0 describes the protocol for submitting your recognition output to NIST for official scoring. 1.0 Scoring the H4 Broadcast News Benchmark Test ------------------------------------------------- This section describes the process for preparing system-generated hypothesis and reference transcriptions for scoring and for implementing the NIST scoring software on these files. 1.1 Preparation of Hypothesized Transcripts -------------------------------------------- The hypothesis transcripts are to be formatted in the CTM format. This format is a concatenation of time mark records for each word in each channel of a waveform. Each record, separated by a newline, must have a waveform id, channel id (1 for the Hub-4 data), start time, duration, word text, and optionally, a confidence score. The waveform id for each CTM record will be the file ID given in the PE or UE map file record used. The file must be sorted according to the first three fields in each record: the first and the second in ASCII order, and the third in numeric order. This can be accomplished using the UNIX sort command: "sort +0 -1 +1 -2 +2nb -3". See the manual page for ctm(5) supplied in the sclite distribution for a complete description of the file format. 1.2 Preparation of Reference Transcripts ----------------------------------------- Prior to scoring, the Hub-4 reference transcripts must be converted to the segment time marked (STM) format used by the sclite scoring software. The filter, 'bn_filt.pl' (version 1.6), located under the 'bn_filt.pl' directory in the top level of this disc, produces a derivation of the original Hub-4 transcript suitable for scoring. In order for the scoring software to be used, the reference transcripts must be processed as follows: 1. Only the excerpts used in the test may be input to the scoring software 2. Each of the reference excerpts must be run through bn_filt.pl individually 3. The filtered excerpts must then be concatenated into a single file in the same order as in the concatenated hypothesis transcript. Example execution of bn_filt.pl using the Hub-4 devtest data: % bn_filt.pl -s h496_spkrdb_960917 -f stm,uem,pem -b 127 -e 1869\ i96071p.txt i96071p -s speaker database (filename of speaker database) -f stm,uem,pem (indicates output types (files) to be produced) -b beginning time -e ending time The command must end with the transcript filename followed by the basename to be used for the output files. 1.3 Transcription Pre-filtering -------------------------------- The reference and hypothesis (system-generated) transcriptions will be "pre-filtered" prior to scoring to remove certain ambiguities according to a set of rules located in a pair of rule files. It is known that variant and erroneous spellings of words exist in the acoustic training data. These variants will be mapped to a single "canonical" representation using this pre-filter. The two rule files have been developed to cover the 1996 H4 acoustic training, development test, and evaluation test material. The rule files and utility to perform the pre-filtering operation are located in the "/tranfilt" directory. "Tranfilt" version 1.4 is required for this evaluation. The rule files for the evaluation test data are "et96_1.glm" and "et96_1.utm". The file, "et96_1.glm", contains rules for global substitutions of lexical equivalents and the file, "et96_1.utm", contains rules for utterance-specific lexical equivalents. Note that "et96_1.utm" is empty, but is still required by the program. *** NOTE: The word-mapping files will not be released until December 12. *** The Bourne shell script, "csrfilt.sh', is used to apply the mapping rules in the above files. The "/tranfilt" directory contains compilation and installation instructions for the script. The script operates as a simple UNIX filter that reads the reference and hypothesis transcriptions from "stdin" and writes the filtered transcriptions to "stdout". The format for using the utility is as follows: csrfilt.sh -dh global-map-file utterance-map-file < filein > fileout Once a STM file has been filtered using bn_filt.pl, it must be filtered using csrfilt.sh using the above command. The hypothesis file, in CTM format, must be filtered slightly differently via the additional option "-i ctm", so that it is parsed correctly. Note: The flag "-dh" replaces all hyphens with spaces so that hyphenated words are scored as separate lexemes. This option must be used on both the reference and hypothesis transcript for correct scoring. Example use of csrfilt.sh: (reference transcripts) csrfilt.sh -i stm -dh et96_1.glm et96_1.utm < ref.stm > ref.stm-filt (hypothesis transcripts ) csrfilt.sh -i ctm -dh et96_1.glm et96_1.utm < hyp.ctm > hyp.ctm-filt The .stm-filt and .ctm-filt files will be used as input to the scoring software. 1.3.1 Word Fragments -------------------- Word Fragments, i.e., partially-pronounced words, will be scored using the protocol developed for the LVCSR Tests. When fragments occur in a reference transcript, the following 2 additional rules are applied during scoring: 1. fragment deletions are forgiven (i.e., as if they never occurred) 2. substitutions where a transcribed fragment is included as a substring of the aligned word in the recognized string are counted as correct All other insertions and substitutions will be scored as errors. Examples: Ref: the dollar rose shar- today Hyp: the dollar rose today (deletion ignored - no error) Hyp: the dollar rose sharp today (superstring substitution scored as correct) Hyp: the dollar rose shape today (substitution error) Note that the above procedures have been used in recent LVCSR tests. 1.3.2 Compound Words -------------------- Compound words which appear as single and multiple words in the acoustic training data will be treated as equivalent. New compound words will be looked up in the American Heritage Dictionary, third edition and on the World Wide Web. If the compound word exists in these sources as only a single compound word, it will only be scored as one word. If, however, it is also listed as two separate words or as a hyphenated word, it will be scored as separate words. The equivalences will be handled by csrfilt.sh and the above word-mapping files. *** NOTE: The word-mapping files will not be released until December 12. *** 1.3.3 Multiple Representations ----------------------------- Words in the acoustic training data which appear with multiple spellings (including mispellings) will be treated as equivalent via csrfilt.sh and the above word-mapping files. Since the training data is also known to suffer from certain homophone errors ("its" and "it's"), these will be treated as equivalent as well. Note that in the past, we have not allowed homophone errors when a language model was present, but since this year's language models could be confused by the acoustic training data, we will score homophone errors which also occur in the acoustic training data as correct. However, please note that we do not plan to continue this practice in the future. *** NOTE: The word-mapping files will not be released until December 12. *** 1.3.4 Contractions ------------------ Contractions in the recognition output will be expanded to an alernation containing all possible expansions relative to context via csrfilt.sh and the global word-mapping file. E.g., she's -> she {has/is} The transcript will use the new "" SGML tag to indicate the proper expansion of each contraction relative to context. The sytax for the tag is described in the revised annotation document (Ver 3.8). The transformation to the expanded form will be accomplished by bn_filt.pl. The alternated/expanded hypothesis file will then be scored against the expanded reference file. *** NOTE: The word-mapping files will not be released until December 12. *** Additional versions of the word-mapping file may be released after the above date to accomodate later submissions. 1.3.5 Pause Fillers ------------------- Non-word pause fillers, such as um, uh, hmm, err, etc. will be filtered from the reference transcripts prior to scoring. Each site MUST remove such pause fillers from their system output before submission for scoring. These pause fillers are removed from the STM file via bn_filt.pl. 1.3.6 Overlapping Speech ------------------------- Areas of overlapping speech will not be scored in this evaluation. The transcript will contain a new SGML tag to indicate overlapping speech. The syntax for the tag is described in the revised annotation document to be released next week. Words recognized during the tagged overlap times will not be scored. This is accomplished by the separation of the overlapping text into a specially tagged STM record indicating that scoring should not be performed. 1.4 Running the NIST scoring software on Hub-4 data ---------------------------------------------------- In scoring the Hub-4 tests, NIST will use word alignments produced by the NIST SCLITE Version 1.4 scoring package which is included on this disc. The scoring package has been included in the top-level directory, "/sclite1.4" of this release. The directory contains a "readme" file with compilation and installation instructions. The alignment and scoring process can be performed with a single command. Be sure to use pre-filtered concatenated reference and hypothesis transcriptions as described in Sections 1.2 and 1.3 above. To score a reference transcription against a corresponding system-generated hypothesis transcription, use the "sclite" program as follows: sclite -F -r ref.stm-filt stm -h hyp.ctm-filt ctm -o all lur More detailed documentation for using sclite is located in the man page, "/sclite1.4/doc/sclite.1" or the HTML file, "/sclite1.4/doc/sclite.htm". On UNIX, after installation, the man page may be accessed via "man sclite". 2.0 Scoring Software Output ---------------------------- The sclite program not only aligns the reference and hypothesis texts, but also generates scoring reports for each hypothesis input file. The scoring report file names are created by appending various extensions to the hypothesis file name. The following set of output files are generated via the "-o all" command line option. .sys: A summary of speaker performance in terms of Percent: Correct, Substitutions, Deletions, Insertions, Word Errors and Sentence (or Utterance) errors. Speaker averages, means, medians and standard deviations are computed for each percentage. .raw: A summary similar to 'ex1.ctm.sys' except the output is word counts instead of percentages. .pra: A text copy of all the string alignments. For the Hub-4 evaluation, an additional report will be produced via the "-o lur" option. .lur: A report containing a scoring summary of the system broken down into sub-categories for each speaker. 3.0 System Descriptions ------------------------ As part of the November 1996 CSR Tests, each test site is required to generate a description of the systems used in each Hub-4 test configuration according to a prescribed format. The format for the system description is as follows: SITE/SYSTEM NAME HUB-4 {CORE/CONTRAST} TEST 1) PRIMARY TEST SYSTEM DESCRIPTION: 2) ACOUSTIC TRAINING: 3) GRAMMAR TRAINING: 4) RECOGNITION LEXICON DESCRIPTION: 5) DIFFERENCES FOR EACH CONTRASTIVE TEST: 6) NEW CONDITIONS FOR THIS EVALUATION: 7) REFERENCES: 4.0 Submission of Test Results to NIST --------------------------------------- The following describes the formats and protocols for submitting your results to NIST for scoring. 4.1 Due Dates -------------- ALL results for the Hub 4 Core Tests MUST be received at NIST by 0700 (EST) Thursday, December 12 to be scored as "official". ALL results for the Hub 4 Contrast Tests MUST be received at NIST by 0700 (EST) Thursday, December 19 to be scored as "official". RESULTS RECEIVED AFTER THE ABOVE DEADLINES WILL BE SCORED AND INCLUDED IN THE SUMMARY TABULATION TO BE PREPARED BY NIST FOR THE FEBRUARY WORKSHOP. HOWEVER, THESE RESULTS WILL BE MARKED WITH THE LABEL, "LATE - (DATE OF RECEIPT)" AND THEY WILL NOT BE CONSIDERED "OFFICIAL". Full CSR Test Schedule/Deadlines: October 28 Last day to "enter or withdraw" November 8 Deadline for optional submission of devtest results November 11 Distribution of evaluation test data December 12 (0700 EST) - deadline for submission of core evaluation results December 16 (0500 EST) Post scored core test results December 19 (0700 EST) - deadline for submission of contrast results December 23 (0500 EST) Post scored contrast results February 2-5 DARPA Speech Recognition Workshop, Westfields Conference Center, Chantilly, VA 4.2 Test Results Format ------------------------ The steps and format for submitting results will be the same as last year. The submission process consists of 3 steps: 1) directory structure creation, 2) system documentation and inclusion of hypothesis recognition output, 3) transmission protocol to NIST. Attached is an example system description template and an example of the steps taken to create the submission directory structure. Step 1: Directory Structure Creation Create a directory identifying your site ('SITE') from the following list which will serve as the root directory for all your submissions: att bbn bu cmu cu-con cu-htk dra dragon ibm limsi lucent nyu ogi philips rutgers sri You should place all of your recognition tests results in this directory. When scored results are sent back to you and subsequently published, this directory name will be used to identify your organization. For each test system, create a sub-directory under your 'SITE' directory identifying the system's name or key attribute. The sub-directory name is to consist of a free form system identification string 'SYSID' chosen by you. Place all files pertaining to hub/spoke tests run using a particular system in the same SYSID directory. Step 2: System Description and Recognition Hypothesis Output For each hub or spoke test you run, you'll need to create a system description file as outlined in Section 5.0, and several system output files. The output derived from each primary or contrastive experiment must be placed in a file by itself. The following file must be generated for each system used in the tests. Only one copy of the file need be generated if the system is used for multiple tests/conditions: sys-desc.txt (system description file) Place your system description in the file, 'sys-desc.txt'. The following file must be generated for each Hub 4 test condition: .ctm (system output hypothesis time-marked word) Create a system output file, '.hyp', for each primary or contrastive test (where, corresponds to the root portion of the index file name.) The list of names is included below. Step 3: Test Results Submission Protocol Once you have structured all of your recognition results according to the above format, you can then submit them to NIST. Due to international e-mail file size restrictions, test sites are permitted to submit results to NIST using either email or anonymous ftp. Continental US sites may use either method, but international sites must use the 'ftp' method. The following instructions assume that you are using the UNIX operating system. If you do not have access to UNIX utilities or ftp, please contact NIST to make alternate arrangements. E-mail method: First change directory to the directory immediately above the directory. Next, type the following: tar -cvf - ./ | compress | uuencode -.tar.Z | \ mail -s "Nov96 CSR H4 test results -" \ jon@jaguar.ncsl.nist.gov where, is the name of the directory created in Step 1 to identify your site. The submission number (e.g. your first submission would be numbered '1', your second, '2', etc.) Ftp method: First change directory to the directory immediately above the directory. Next, type the following command. tar -cvf - ./ | compress > -.tar.Z where, is the name of the directory created in Step 1 to identify your site. The submission number (e.g. your first submission would be numbered '1', your second, '2', etc.) This command creates a single file containing all of your results. Next, ftp to jaguar.ncsl.nist.gov giving the username 'anonymous' and your e-mail address as the password. After you are logged in, issue the following set of commands, (the prompt will be 'ftp>'): ftp> cd /pub/benchmark/nov96_csr ftp> binary ftp> put -.tar.Z ftp> quit You've now submitted your recognition results to NIST. The last thing you need to do is send an e-mail message to Jon Fiscus at 'jon@jaguar.ncsl.nist.gov' notifying NIST of your submission. Please include the name of your submission file in the message. Note: If you choose to submit your results in multiple shipments, please submit ONLY one set of results for a given test system/condition unless you've made other arrangements with NIST. Otherwise, NIST will programmatically ignore duplicate files. 4.3 File and Directory Formats ------------------------------- The following is the BNF directory structure format for CSR hypothesis recognition results: // where, SITE ::= att | bbn | bu | cmu | ... (use above site codes) SYSID ::= (short system description ID, preferably <= 8 characters) FILES ::= sys-desc.txt | (system description including reference to paper if applicable) .ctm (file containing hypothesized words with time marks for the H4 tests) where, TEST_SET ::= et96h4.pem | et96h4.uem The time-marked hypothesis words for the H4 tests will be placed in a single file, called ".ctm". The CTM file format, is a concatenation of time marks for each word in each broadcast. Each word token must have a broadcast id, channel identifier (1 in the case of Hub-4), start time, duration, and case-insensitive word text. Optionally a confidence score can be appended for each word. The start time must be is seconds and relative to the beginning of the waveform file. The broadcast id's for the Hub-4 corpus will be the basename of the waveform file. The file must be sorted by the first three columns: the first and the second in ASCII order, and the third by a numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +2nb -3" will sort the words into appropriate order. Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored. Included below is an example: ;; ;; Comments follow ';;' ;; ;; The Blank lines are ignored ;; 940401 1 11.34 0.2 YES -6.763 940401 1 12.00 0.34 YOU -12.384530 940401 1 13.30 0.5 CAN 2.806418 940401 1 17.50 0.2 AS 0.537922 ================================ END OF FILE =================================