DARPA CSR 1997 Broadcast News Hub-4NE (Spanish & Mandarin) Benchmark Test Evaluation

System Output Preparation, Scoring Protocols, and Submission Instructions

Revised 11/12/97

Sections 1.0 and 2.0 describe the process to be used in preparing Hub-4NE system output for scoring and for implementing the NIST scoring software. Section 3.0 describes the format for submission of system information. Section 4.0 describes the protocol for submitting your recognition output to NIST for official scoring.

1.0 Scoring the H4NE Broadcast News Benchmark Test

This section describes the process for preparing system-generated hypothesis and reference transcriptions for scoring and for implementing the NIST scoring software on these files.

1.1 Scoring Protocols for Transcribed Events

The reference transcription will be transformed prior to comparing it with the output from a recognizer. It is important that these transformations are properly comprehended in the design of a recognition system, so that the system will perform well according to the scoring measure. Here are the transformations that will be applied to the reference:

Word fragments
Word fragments are represented in the transcription by appending a "-" to the (partial) spelling of the fragmented word. Fragments are included in the total word count and scored as follows:
1. If the fragment is deleted in the time alignment process, no error is counted
2. If the fragment matches the recognizer output up to the "-", no error is counted
3. Otherwise, there is a substitution error

Unintelligible and Doubtful Words

The reference transcripts may describe some speech as unintelligible (indicated by "(( ))"), and then may or may not also provide a "best guess" as to what words it consists of. Such "best guess" doubtful words will be included in the total word count, with scoring as follows:

If the alignment produces a deletion, no error is counted
If the alignment produces a matching word, no error is counted
Otherwise, there is a substitution error

Foreign Words

The reference transcripts may describe words as foreign, as words not in the language under test. This description will not be applied to words of foreign origin that have been widely incorporated into speech of the given language. Such foreign words will be included in the total word count, with scoring as follows:

If the alignment produces a deletion, no error is counted
If the alignment produces a matching word, no error is counted
Otherwise there is a substitution error

Pause fillers

For scoring purposes, all hesitation sounds, referred to as "non-lexemes", will be considered to be equivalent, and will be scored the same way as fragments, doubtful, and foreign words. Although these sounds are transcribed in a variety of ways due to highly variable phonetic quality, they are all considered to be functionally equivalent from a linguistic perspective. Thus, all reference transcription words and hypothesized words in the conventional set of hesitation sounds, will be mapped to "%hesitation". The system output transcriptions should use either the "%hesitation" token or any of the hesitation sounds when a hesitation is hypothesized or omit it altogether. Again:

If the alignment produces a deletion, no error is counted
If the alignment produces a matching word, no error is counted
Otherwise there is a substitution error

The set of Spanish hesitation sounds are:

eh, ey, oh, uh, uy, uf, oy, ha, ah, hm, mmh, pss, shh, pff, aaa, eee, iii, mmm, emm, amm, imm

The set of Mandarin hesitation sounds are:

Homophones

Homophones will not be treated as equivalent. Homophones must be correctly spelling in order to be counted as correct.

Multiple spellings

Words in the acoustic training data which appear with multiple spellings (including mispellings) will be treated as equivalent via csrfilt.sh and the above word-mapping files. Since the training data is also known to suffer from certain homophone errors ("its" and "it's"), these will be treated as equivalent as well. Note that in the past, we have not allowed homophone errors when a language model was present, but since this year's language models could be confused by the acoustic training data, we will score homophone errors which also occur in the acoustic training data as correct. However, please note that we do not plan to continue this practice in the future.

*** NOTE: Not Applicable to the Spanish & Mandarin Hub-4NE test ***

Overlapping speech

Periods of overlapping speech will not be scored. Any words hypothesized by the recognizer during these periods will not be counted as errors.

Compound Words

Compound words which appear as single and multiple words in the acoustic training data will be treated as equivalent. New compound words will be looked up in the American Heritage Dictionary, third edition and on the World Wide Web. If the compound word exists in these sources as only a single compound word, it will only be scored as one word. If, however, it is also listed as two separate words or as a hyphenated word, it will be scored as separate words. The equivalences will be handled by csrfilt.sh and the above word-mapping files.

*** NOTE: Not Applicable to the Spanish & Mandarin Hub-4NE test ***

Contractions

For languages where contracted words are commonly used (e.g., English), contractions will be expanded to their underlying forms in the reference transcriptions. Manual auditing will be used to ensure correct expansion. Contractions in the recognizer output will be expanded based on default expansions for standard contractions in the language. Thus the recognizer need not expand contractions, but it may be preferable for it to do so.

*** NOTE: Not Applicable to the Spanish & Mandarin Hub-4NE test ***

1.2 Preparation of Hypothesized Transcripts

The hypothesis transcripts are to be formatted in the CTM format. This format is a concatenation of time mark records for each word in each channel of a waveform. Each record, separated by a newline, must have a waveform id, channel id (1 for the Hub-4NE data), start time, duration, word text, and optionally, a confidence score. The waveform id for each CTM record will be the file ID given in the test map file(s).

The file must be sorted according to the first three fields in each record: the first and the second in ASCII order, and the third in numeric order. This can be accomplished using the UNIX sort command: "sort +0 -1 +1 -2 +2nb -3".

See the manual page for ctm(5) supplied in the sctk distribution for a complete description of the file format.

1.3 Preparation of Reference Transcripts

Prior to scoring, the Hub-4NE reference transcripts must be converted to the segment time marked (STM) format used by the sclite scoring software. The filter, bn_filt.pl (version 1.12), located under the 'bn_filt' directory in the top level of this disc, produces a derivation of the original Hub-4NE transcript suitable for scoring.

*** NOTE: The bn_filt.pl script MAY NOT be used for this evaluation, whatever tool we do use to prepare the reference transcripts, will be released by December 2. ***

In order for the scoring software to be used, the reference transcripts must be processed as follows:

Only the excerpts used in the test may be input to the scoring software
Each of the reference excerpts must be run through bn_filt.pl individually
The filtered excerpts must then be concatenated into a single file in the same order as in the concatenated hypothesis transcript.

Example execution of bn_filt.pl using the Hub-4NE devtest data:

The command must end with the transcript filename followed by the basename to be used for the output files.

1.4 Transcription Pre-filtering

The reference and hypothesis (system-generated) transcriptions will be "pre-filtered" prior to scoring to remove certain ambiguities according to a set of rules. The rule file and utility to perform the pre-filtering operation are located in the "/tranfilt" directory. "Tranfilt" version 1.6 is required for this evaluation.

The rule file for the evaluation test data sets are:

For Spanish: sp971029.glm. The file contains rules for global substitutions of lexical equivalents.
For Mandarin: ma970904.glm. The file contains rules for global substitutions of lexical equivalents.

For this evaluation, a scoring wrapper script, described below, has been supplied which will pre-filter both the reference and hypothesis transcript and before scoring. If you are scoring the default Hub-4NE evaluation, no manual prefiltering is required.

1.5 Running the NIST scoring software on Hub-4NE data

The top-level directory ./scripts contains the PERL script, hubscr02.pl, used by NIST to score submissions for this benchmark test. It is provided as a template for how one should use the NIST scoring tools when attempting to duplicate the scoring methodology used by NIST to produce the published scores. The script requires 2 NIST software packages:

SCLITE scoring software available in the SCTK V1.1 NIST scoring Toolkit located in the top-level directory ./sctk of this disc. Version 1.0 or greater is required for this evaluation.
Transcription Filtering software Tranfilt V1.6 is located in the ./tranfilt directory of this disc.

Both packages require compilation, see the readme's in their respective directories for installation instructions.

The pre-filtering, alignment and scoring process can be performed with a single execution of hubscr02.pl as follows:

% hubscr02.pl -v -g sp971029.glm -l spanish -h hub4 -r h4ne97sp1.stm hyp1.ctm

2.0 Scoring Software Output

The hubscr02.pl program not only pre-filters the transcripts and aligns the reference and hypothesis texts, but also generates scoring reports for each hypothesis input file. The scoring report file names are created by appending various extensions to the hypothesis file name. (Additional reports may be available by modifying SCLITE's "-o" option in the script.)

<HYP>.sys:	A summary of speaker performance in terms of Percent: Correct, Substitutions, Deletions, Insertions, Word Errors and Sentence (or Utterance) errors. Speaker averages, means, medians and standard deviations are computed for each percentage.
<HYP>.raw:	A summary similar to 'ex1.ctm.sys' except the output is word counts instead of percentages.
<HYP>.pra:	A text copy of all the string alignments.

For the Hub-4NE evaluation, an additional report will be produced via the "-o lur" option.

<HYP>.lur:

A report containing a scoring summary of the system broken down into sub-categories for each speaker.

3.0 System Descriptions

As part of the November 1997 Hub-4NE Tests, each test site is required to generate a description of the systems used in each Hub-4NE test configuration according to a prescribed format. The format for the system description is as follows:

SITE/SYSTEM NAME
HUB-4NE {CORE/CONTRAST} TEST

PRIMARY TEST SYSTEM DESCRIPTION:
ACOUSTIC TRAINING:
GRAMMAR TRAINING:
RECOGNITION LEXICON DESCRIPTION:
DIFFERENCES FOR EACH CONTRASTIVE TEST:
NEW CONDITIONS FOR THIS EVALUATION:
REFERENCES:

4.0 Submission of Test Results to NIST

The following describes the formats and protocols for submitting your results to NIST for scoring.

4.1 Test Results Format

The steps and format for submitting results will be the same as last year.

The submission process consists of 3 steps:

directory structure creation,
system documentation and inclusion of hypothesis recognition output,
transmission protocol to NIST.

Attached is an example system description template and an example of the steps taken to create the submission directory structure.

Step 1: Directory Structure Creation

Create a directory identifying your site ('SITE') from the following list which will serve as the root directory for all your submissions:

att, bbn, cmu ... You should place all of your recognition tests results in this directory. When scored results are sent back to you and subsequently published, this directory name will be used to identify your organization.

For each test system, create a sub-directory under your 'SITE' directory identifying the system's name or key attribute. The sub-directory name is to consist of a free form system identification string 'SYSID' chosen by you. Place all files pertaining to test runs using a particular system in the same SYSID directory.

Step 2: System Description and Recognition Hypothesis Output

For each test you run, you'll need to create a system description file as outlined in Section 3.0, and several system output files. The output derived from each primary or contrastive experiment must be placed in a file by itself.

The following file must be generated for each system used in the tests. Only one copy of the file need be generated if the system is used for multiple tests/conditions:

Place your system description in the file, 'sys-desc.txt'.

The following file must be generated for each Hub-4NE test condition:

Create a system output file, '<TEST_SET>.hyp', for each primary or contrastive test (where, <TEST_SET> corresponds to the root portion of the index file name.) The list of <TEST_SET> names is included below.

Step 3: Test Results Submission Protocol

Once you have structured all of your recognition results according to the above format, you can then submit them to NIST. Because of limitations of international e-mail file sizes, international test sites must submit results to NIST using anonymous ftp. Continental US sites may use either email or anonymous ftp. The following instructions assume that you are using the UNIX operating system. If you do not have access to UNIX utilities or ftp, please contact Jonathan Fiscus at NIST to make alternate arrangements.

E-mail method:

tar -cvf - ./<SITE> | compress | \
uuencode <SITE>-<SUBM_ID>.tar.Z | \
mail -s "Nov97 Hub-4NE test results <SITE>-<SUBM_ID>" \
jonathan.fiscus@nist.gov

where,

Ftp method:

where,

This command creates a single file containing all of your results. Next, ftp to jaguar.ncsl.nist.gov giving the username 'anonymous' and your e-mail address as the password. After you are logged in, issue the following set of commands, (the prompt will be 'ftp>'):

You've now submitted your recognition results to NIST. The last thing you need to do is send an e-mail message to Jon Fiscus at jonathan.fiscus@nist.gov notifying NIST of your submission. Please include the name of your submission file in the message.

Note: If you choose to submit your results in multiple shipments, please submit ONLY one set of results for a given test system/condition unless you've made other arrangements with NIST. Otherwise, NIST will programmatically ignore duplicate files.

4.2 File and Directory Formats

The following is the BNF directory structure format for HUB-4NE hypothesis recognition results:

SYSID ::= (short system description ID, preferably <= 8 characters)

FILES ::= sys-desc.txt | <TEST_SET>.ctm

TEST_SET ::= h4ne97sp | h4ne97ma

The time-marked hypothesis words for the Hub-4NE test will be placed in a single file, called "<TEST_SET>.ctm". The CTM file format, is a concatenation of time marks for each word in each broadcast. Each word token must have a broadcast id, channel identifier (1 in the case of Hub-4NE), start time, duration, and case-insensitive word text. Optionally a confidence score can be appended for each word. The start time must be in seconds and relative to the beginning of the waveform file. The broadcast id for the Hub-4NE tests are "h4ne97sp" and "h4ne97ma".

The file must be sorted by the first three columns: the first and the second in ASCII order, and the third by a numeric order. The UNIX sort command: "sort +0 -1 +1 -2 +2nb -3" will sort the words into appropriate order.

Lines beginning with ';;' are considered comments and are ignored. Blank lines are also ignored.

Included below is an example:

;;
;;  Comments follow ';;'
;;
;;  The Blank lines are ignored

;;
h4ne97sp 1 11.34 0.2  NO 0.763
h4ne97sp 1 12.00 0.34 HABLO 0.384530
h4ne97sp 1 13.35 0.5  ENGLISH 0.806418

DARPA CSR 1997 Broadcast News Hub-4NE (Spanish & Mandarin) Benchmark Test Evaluation System Output Preparation, Scoring Protocols, and Submission Instructions

Contents

1.0 Scoring the H4NE Broadcast News Benchmark Test

1.1 Scoring Protocols for Transcribed Events

1.2 Preparation of Hypothesized Transcripts

1.3 Preparation of Reference Transcripts

1.4 Transcription Pre-filtering

1.5 Running the NIST scoring software on Hub-4NE data

2.0 Scoring Software Output

3.0 System Descriptions

4.0 Submission of Test Results to NIST

4.1 Test Results Format

Step 1: Directory Structure Creation

Step 2: System Description and Recognition Hypothesis Output

Step 3: Test Results Submission Protocol

4.2 File and Directory Formats

DARPA CSR 1997 Broadcast News Hub-4NE (Spanish & Mandarin) Benchmark Test Evaluation

System Output Preparation, Scoring Protocols, and Submission Instructions