The 1997 Hub-4NE Evaluation Plan for Recognition of Broadcast News, in Spanish and Mandarin

The 1997 Hub-4NE Evaluation Plan for Recognition of Broadcast News, in Spanish and Mandarin
Last Modification: September 4^th, 1997
Version: 2.3
Introduction

This document specifies the 1997 evaluation of speech recognition technology on broadcast news in Spanish and Mandarin. The purpose of this evaluation is to foster research on the problem of accurately transcribing broadcast news speech and to measure objectively the state of the art. The evaluation will deal with the following types of television and radio shows:

anchored news shows
news magazines
hearings, news conferences and speeches

This program material includes a challenging combination of read speech and spontaneous speech, as well as a combination of recording environments in broadcast studios and in the field. It is expected that this material will provide an impetus to improve core speech recognition capability; to improve the adaptability of recognition systems to new speakers, dialects, and recording environments; and to improve the systems' abilities to cope with the difficult problems of unknown words, spontaneous speech, and unconstrained syntax and semantics.

The style of the 1997 Spanish and Mandarin evaluations will be similar to the "unpartitioned evaluation" component (UE) of the 1996 Hub-4 English evaluation. The evaluation task will consist of the recognition of whole shows, substantial segments of shows, and/or excerpts from long speeches.

Definitions and Terminology

A "show" is a particular television or radio broadcast production, encompassing all of the dates and times of broadcast. Examples in English include "CNN Headline News" and "NPR All things Considered".

An "episode" is an instance of a show on a particular date (and possibly time). Examples in English include "All Things Considered on July 5, 1997" and "CNN Headline News at 1000 EDT on July 5, 1997".

A "story" is a continuous portion of an episode that discusses a single topic or event.

Evaluation Test Data

The evaluation test data will be obtained from the audio component of a variety of television and broadcast news sources. These will include some or all of the sources listed below in "Summary of Show Sources".

The evaluation test data will consist of approximately one hour of speech in each language. This data will be taken from shows broadcast later in time than all of the training data (described below).

The actual data to be used for the evaluation will be selected by NIST, in consultation with the LDC. The data will be selected according to the following guidelines:

The data will consist of a single monophonic channel of audio, even if the original program material is distributed in stereo.
The evaluation data will consist of portions of episodes. To the extent possible, these materials will be selected on a story-by-story basis, to provide support for experimentation in adaptive language modeling.
The evaluation data will exclude commercials and sports results because of significant differences in syntax and semantics.
To the extent possible, the evaluation will be structured to include the speech of many speakers, especially those who do not appear in the training data, so that there would be no advantage to learning the characteristics of individual speakers ahead of time.

Training Data

Training data for this evaluation will include approximately 30 hours of data in each language, with 10 hours of each language to be released by the LDC by June 30, 1997, and the remainder to be released by August 31, 1997. All of this data was recorded from shows that are similar in form and content to shows used for the evaluation test data and will be selected subject to the following additional guidelines:

There will be partial or full overlap between shows used in the training set and shows used in the evaluation test set.
Episodes used in the training set will not overlap in time with episodes used in the evaluation test set.

Acoustic Training Data

In addition to the 30 hours of data in each language described above, sites may also make use of other acoustic data that they acquire privately or from outside sources, provided that they also supply as a contrast condition the evaluation results obtained from the same system trained only on the baseline acoustic training data. Such material must have a recording date of no later than June 30, 1997. Privately acquired data that is not otherwise available must be provided to the LDC in a form suitable for publication and unencumbered by intellectual property rights, such that it could be released as an LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if the government elects to have the data published and an implied statement that the data is legally unencumbered. Delivery of the data to LDC may be done after the evaluation, provided that it is accomplished no later than March 31, 1998.

Language Model Training Data

The LDC has available a CD-ROM of newswire text in Spanish and another of newswire text in Mandarin. These will be made available to sites upon request to use for language model training for this evaluation. It should be noted, however, that this material may contain digit strings and abbreviations of a type not applicable to evaluation language models, and sites will be responsible for doing appropriate editing to avoid the effects of these.

In addition to the supplemental texts for language model training provided by LDC, sites may make use of language model data that they acquire privately or from commercial sources. Such data must have been originally produced no later than June 30, 1997. Privately acquired data must be made available to the LDC in a form suitable for publication and unencumbered by intellectual property rights, such that it could be released as an LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if the government elects to have the data published and an implied statement that the data is legally unencumbered. Delivery of the data to LDC may be done after the evaluation, provided that it is accomplished no later than March 31, 1998.

Sites may not make use of any material dated after June 30, 1997 for language model training. Nor may they make use of shows (from any date) that are identified by NIST as reserved for testing only.

Development Test Data

A separate development test is not being collected for this evaluation. However, for the convenience of participants, NIST will specify a development subset from the training data released August 31, 1997. NIST will try to select development data that is broadly representative of the types of shows included in the evaluation test data, but the specific shows included may differ. Note that all development test data will have been recorded earlier in time than all evaluation test data.

Summary of Show Sources

The following lists represent the television and radio programs for which the LDC has negotiated redistribution rights, and which the LDC has recorded for use in Hub-4NE training and test sets.

Spanish:

VOA Programming – four original news programs a day, five days a week

ECO – Mexican news show with two reporters in the studio, broadcast on the Galavision network

Noticiero Univision – half hour weekday news program originating in Miami

Mandarin:

VOA Programming – five main programs plus 5-10 minute news slots

CCTV International – evening news broadcast from Bejing, dominated by anchor reading news

KAZN 1030 AM - all news Los Angeles based Mandarin station

Annotation of Training Data

The LDC, in conjunction with NIST, has developed a transcription and annotation system to aid in the development and evaluation of speech recognition technology. To this end, speech is annotated with time marks and uniquely named speaker identifiers.

More information on annotation of Spanish and Mandarin Hub-4 data may be obtained from the LDC’s ftp site. Annotation will be provided for the training data set, including the part specified as development data, but annotation will not be provided for the evaluation test set until after the recognition results have been submitted. This evaluation set annotation may be somewhat different in form from that of the training data.

Because the transcription accuracy is expected to be sufficiently high relative to the recognition performance, there will not be a formal adjudication process after the evaluation takes place.

Evaluation Conditions

Participating sites may participate in either the Spanish or the Mandarin evaluation, or in both. Participating sites are required to conduct a single evaluation over all of the evaluation data in the language(s) in which they are participating. Beginning/ending times will be supplied for those portions of each episode to be evaluated. NIST will also supply speech segmentation and classification information for data within each of these major portions, using an automatic segmentation and classification utility. (NIST will provide this utility to participating sites that request it.) Sites may use the NIST-supplied segmentation information, or they may perform their own segmentation and classification. Those sites that perform their own segmentation are urged to provide supplemental evaluation results that contrast the performance of their segmentation with that of NIST’s.

Any recognition approach is allowed, including running a decoder in unsupervised transcription mode. Any audio segment in the evaluation test data may be used to help decode any other segment of audio. (In other words, adaptation techniques may make use of audio across episode boundaries and show boundaries.)

Scoring

Each system will be evaluated by comparing its recognition output with a reference transcription. This reference will be a conventional orthographic transcription using standard transcription practice. Verification by multiple transcribers should assure high accuracy of this reference compared to automatic system performance. Thus there will be no formal adjudication process after the evaluation.

Each recognition system will produce an output file of time-ordered records, with one word per record. Detailed format information for the recognition output file will be supplied by NIST.

The WER (CER) Metric

Word error rate is defined as the sum of the number of words in error divided by the number of words in the reference transcription. The words in error are of three types, namely substitution errors, deletion errors, and insertion errors. Identification of these errors results from the process of mapping the words in the reference transcription onto the words in the system output transcription. This mapping is performed using NIST’s SCLITE software package.

A substitution error results when the spellings of the reference word and the corresponding system output word differ.
A deletion error results when the reference word has no corresponding system output word.
An insertion error results when a system output word has no corresponding reference word.

Scoring will be performed by aligning the system output transcription with the reference transcription and then computing the word error rate. Alignment will be performed independently for each turn, using NIST' s SCLITE scoring software. The system output transcription will be processed to match the form of the reference transcription.

Word error rate will be the scoring metric for Spanish. For Mandarin, character error rate alignment and scoring will be performed similarly, but at the character level.

Transcription Transformations

The reference transcription will be transformed prior to comparing it with the output from a recognizer. It is important that these transformations are properly comprehended in the design of a recognition system, so that the system will perform well according to the scoring measure. Here are the transformations that will be applied to the reference:

Word fragments

Word fragments are represented in the transcription by appending a "-" to the (partial) spelling of the fragmented word. Fragments are included in the total word count and scored as follows:

If the fragment is deleted in the time alignment process, no error is counted
If the fragment matches the recognizer output up to the "-", no error is counted
Otherwise, there is a substitution error

Unintelligible and Doubtful Words

The reference transcripts may describe some speech as unintelligible (indicated by "(( … ))"), and then may or may not also provide a "best guess" as to what words it consists of. Such "best guess" doubtful words will be included in the total word count, with scoring as follows:

If the alignment produces a deletion, no error is counted
If the alignment produces a matching word, no error is counted
Otherwise, there is a substitution error

Foreign Words

The reference transcripts may describe words as foreign, as words not in the language under test. This description will not be applied to words of foreign origin that have been widely incorporated into speech of the given language. Such foreign words will be included in the total word count, with scoring as follows:

If the alignment produces a deletion, no error is counted
If the alignment produces a matching word, no error is counted
Otherwise there is a substitution error

Pause fillers

For scoring purposes, all hesitation sounds, referred to as "non-lexemes", will be considered to be equivalent, and will be scored the same way as fragments, doubtful, and foreign words. Although these sounds are transcribed in a variety of ways due to highly variable phonetic quality, they are all considered to be functionally equivalent from a linguistic perspective. Thus, all reference transcription words beginning with "%", the hesitation sound flag, along with the conventional set of hesitation sounds, will be mapped to "%hesitation". The system output transcriptions should use any of the hesitation sounds (without "%") when a hesitation is hypothesized or omit it altogether. Again:

If the alignment produces a deletion, no error is counted
If the alignment produces a matching word, no error is counted
Otherwise there is a substitution error

The evaluation data distributed for each language will contain the list of recognized hesitation sounds for the language.

Multiple spellings

Some words appear in the training corpus with multiple spellings, including misspellings. For scoring, however, a single standardized spelling will generally be required, and the recognizer must output this standard spelling in order to be scored as correct. The evaluation data distributed for each language will list the allowed alternate word spellings, if any, for the language.

Homophones

Homophones will not be treated as equivalent. Homophones must be correctly spelled in order to be counted as correct.

Overlapping speech

Periods of overlapping speech will not be scored. Any words hypothesized by the recognizer during these periods will not be counted as errors.

Compound words

For languages where compounded words are commonly used, compound words will be treated as separate multiple words if they commonly appear in that form (in the training data or official lexica). If a compound word exists only in compound form, only then will it be treated as a single word.

Contractions

For languages where contractions are commonly used, contractions will be expanded to their underlying forms in the reference transcriptions. Manual auditing will be used to ensure correct expansion. Contractions in the recognizer output will be expanded based on default expansions for standard contractions in the language. Thus the recognizer need not expand contractions, but it may be preferable for it to do so.

Language specific scoring issues

SPANISH: The word error rate (WER) will be the primary scoring metric. NIST will provide a text-encoding standard for recognizer output (which will follow LDC practices) and a detailed format specification for Spanish.

MANDARIN: Mandarin will be scored at the character level rather than the word level, and the primary scoring metric will thus be a character error rate (CER) rather than a word error rate (WER). NIST will provide a text-encoding standard for recognizer output (which will follow LDC practices) and a detailed format specification for Mandarin.

Dry Run

Participating sites are encouraged to submit the output of their system for a portion of the development test set to NIST prior to the formal evaluation, to verify that system output is correct and is being processed properly.

NIST will make its scoring software available to participants, on request.

NIST Reports

NIST will tabulate and report error rates over the entire dataset for each language. NIST will also tabulate and report error rates for various subsets of test material, in order to examine performance for different conditions. Special attention will be given to the performance on high fidelity speech from native speakers in clean background conditions. This condition is of particular interest because the absence of other complicating factors such as background noise, music, and non-native dialects focuses attention on basic speech recognition issues common to all conditions.

Immediately after the evaluation, NIST will make available the complete annotation record for the evaluation test material, to facilitate the analysis of performance by individual sites.

Multiple Systems Running a Single Test

In order to discourage the running of several systems on a single test to improve one's chances of scoring well, sites must designate one system as the primary system if more than one system is run on a single test. This designation must be made before the test is begun. Results must be reported for all systems run on any test.

System Descriptions

Sites are required to submit a standard system description to NIST along with the results for each system run on any test. The format for these system descriptions is given in documentation that NIST supplies with the test data.

Evaluating sites will be required to provide a written description at the Workshop of computational resource requirements including processor speed and storage requirements used to produce the evaluation results, and to publish information about the complexity of new algorithms.

Site Commitments

Sites interested in participating in the 1997 Hub-4NE evaluation should notify NIST. NIST will ensure that sites considering participation receive appropriate training and devtest material in a timely fashion after authorization to do so from the LDC.

Site commitments are used to control evaluation and to manage evaluation resources. It is imperative that sites honor their commitments in order for the evaluation to have beneficial impact. Sites must notify NIST as soon as possible, prior to the distribution of the evaluation data, if it appears that a commitment may not be honored. Defaulting on a commitment may jeopardize permission to participate, and to obtain early distributions of future test data, in subsequent evaluations.

Workshop

A workshop will be held in early 1998 for presenting evaluation results and discussing the technology used in the Hub-4 evaluations. Evaluation results will be reported by NIST, and invited and contributed presentations will be made by evaluation participants. Presentations and results at the Workshop will be published in a written publicly-available Proceedings. N.B. Participants will be required to deliver camera-ready copies of their papers (plus release approvals) at least one week prior to the workshop.

Schedule

October 13 – Deadline for site commitment to participate

November 14 – Deadline for sites to submit Devtest results (optional)

November 17 – NIST distributes the evaluation test data to the sites

December 9 (0700 EST) – Deadline for submission of primary test results

December 12 – NIST releases scores for the primary test results

December 19 (0700 EST) – Deadline for submission of contrast test results

December 23 – NIST releases scores for the contrast test results

January/February 1998 – Hub-4 workshop for participating sites