The 1997 Hub-4E Evaluation Plan
for Recognition of Broadcast News, in English

Introduction

This document specifies the 1997 evaluation of speech recognition technology on broadcast news in English. The purpose of this evaluation is to foster research on the problem of accurately transcribing broadcast news speech and to measure objectively the state of the art. The evaluation will deal with the following types of television and radio shows:

anchored news shows
news magazines
hearings, news conferences and speeches

This program material includes a challenging combination of read speech and spontaneous speech, as well as a combination of recording environments in broadcast studios and in the field. It is expected that this material will provide an impetus to improve core speech recognition capability; to improve the adaptability of recognition systems to new speakers, dialects, and recording environments; and to improve the systems' abilities to cope with the difficult problems of unknown words, spontaneous speech, and unconstrained syntax and semantics.

The style of the 1997 evaluation will be similar to the "unpartitioned evaluation" component (UE) of the 1996 Hub 4 evaluation. The evaluation task will consist of the recognition of whole shows, substantial segments of shows, and/or excerpts from long speeches.

Definitions and Terminology

A "show" is a particular television or radio broadcast production, encompassing all of the dates and times of broadcast. Examples include "CNN Headline News" and "NPR All things Considered".

An "episode" is an instance of a show on a particular date (and possibly time), such as "All Things Considered on July 5, 1997" or "CNN Headline News at 1000 EDT on July 5, 1997".

A "story" is a continuous portion of an episode that discusses a single topic or event.

Evaluation Test Data

The evaluation test data will be obtained from the audio component of a variety of television and broadcast news sources. Broadcast sources include television news programs from CNN, ABC, and C-SPAN, as well as news radio broadcasts from NPR and PRI.

The evaluation test data will consist of approximately three hours of speech. These data will be taken primarily from shows broadcast between October 15 and November 14, 1996.

The actual data to be used for the evaluation will be selected by NIST, in consultation with the LDC. The data will be selected according to the following guidelines:

The data will consist of a single monophonic channel of audio, even if the original program material is distributed in stereo.
The evaluation data will consist of portions of episodes. To the extent possible, these materials will be selected on a story-by-story basis, to provide support for experimentation in adaptive language modeling.
The evaluation data will exclude commercials or sports results because of significant differences in syntax and semantics. (These portions of programs will be distributed as part of the acoustical data, but they will be not included in the evaluation or development test set specification.)
To the extent possible, the evaluation will be structured to include the speech of many speakers, especially those who do not appear in the training data, so that there would be no advantage to learning the characteristics of individual speakers ahead of time.

Training Data

Training data for this evaluation includes the approximately 50 hours of standard acoustical training data released by the LDC to sites participating in the 1996 Hub 4 evaluation, along with the additional 50 hours of similar material released by LDC in early 1997. These data were recorded from shows that are similar in form and content to shows used for the evaluation test data. These data were selected subject to the following additional guidelines:

There are partial overlaps between shows used in the training set and shows used in the evaluation test set.
Episodes used in the training set do not overlap in time with episodes used in the evaluation test set.

Acoustic Training Data

Baseline acoustic training data consist of the approximately 100 hours of annotated training data and developmental test data provided by NIST and the LDC for this and previous Hub 4 evaluations, plus any acoustical training data, development test data, and evaluation test data developed for previous speech recognition evaluations administered by NIST (including the various Resource Management, ATIS, Wall Street Journal, North American Business News, Marketplace, Switchboard, Macrophone, and Call Home corpora).

Sites may also make use of other acoustic data that they acquire privately or from outside sources, including additional untranscribed audio training data distributed by LDC for this evaluation, provided that they also supply as a contrast condition the evaluation results obtained from the same system trained only on the baseline acoustic training data. Privately acquired data that is not otherwise available must be provided to the LDC in a form suitable for publication and unencumbered by intellectual property rights, such that it could be released as an LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if the government elects to have the data published and an implied statement that the data is legally unencumbered. Delivery of the data to LDC may be done after the evaluation, provided that it is accomplished no later than March 31, 1998.

Sites may not make use of any material dated after June 30, 1996 for acoustical training. They also may not make use of shows (from any date) that are identified below as reserved for testing only.

Language Model Training Data

Additional training data for statistical language models are available from the large corpus of commercial text transcripts of broadcast news shows that were prepared in 1996 by Alex Rudnicky and the LDC. Text conditioning tools for these texts are available from NIST.

In addition to the supplemental texts for language model training provided by LDC and NIST, sites may make use of language model data that they acquire privately or from commercial sources. Privately-acquired data must be made available to the LDC in a form suitable for publication and unencumbered by intellectual property rights, such that it could be released as an LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if the government elects to have the data published and an implied statement that the data is legally unencumbered. Delivery of the data to LDC may be done after the evaluation, provided that it is accomplished no later than March 31, 1998.

Sites may not make use of any material dated after June 30, 1996 for language model training. Nor may they make use of shows (from any date) that are identified by NIST as reserved for testing only.

Development Test Data

The development test sets for this evaluation consist of the development and evaluation test sets for the 1996 Hub 4 evaluation, consisting of approximately three hours and two hours of speech, respectively. These data are similar in form and content to the 1997 evaluation test data subject to the following additional guidelines:

As in the case of the training data, there is a partial overlap between shows used in the development test and shows used in the evaluation test set. Again, the goal was to minimize the utility of training speaker-specific models to speakers in the training and development test sets.
Episodes used in the development test set do not overlap in time with episodes used in the evaluation test set.

Summary of Show Sources

The following lists represent the television and radio programs for which the LDC has negotiated redistribution rights, and which the LDC has recorded for use in Hub 4 training and test sets.

Shows used only for training:

ABC Nightline

ABC World Nightly News

ABC World News Tonight

CNN Early Edition

CNN Early Primetime News

CNN Headline News

CNN Primetime News

CNN The World Today

NPR All Things Considered

Shows used for both training and testing:

C-SPAN Washington Journal

PRI Marketplace

Shows used only for testing: (N.B. Sites may not use these shows to develop their systems. This includes prohibition against using data from these shows for acoustic model training or language model training.)

ABC Prime Time

CNN Morning News

CNN World View

NPR Morning Edition

NPR The World

Annotation of Data

The LDC, in conjunction with NIST, have developed a transcription and annotation system to aid in the development and evaluation of speech recognition technology. To this end, speech is annotated with time marks and classifications that include the following factors:

The speaker (uniquely named)
The speaking mode (either Spontaneous or Planned)
The fidelity (High, Medium, or Low)
The background (Speech, Music, or Other)

More information on annotation may be obtained from the Hub-4 annotation specification, which is available at ftp://jaguar.ncsl.nist.gov/csr96/h4/h4annot.ps. Annotation will be provided for the training data set and development test set, but annotation will not be provided for the evaluation test set until after the recognition results have been submitted.

NIST will provide highly accurate reference transcriptions and annotations for the evaluation test set. This will be achieved by producing three independent versions and then reconciling the differences. This will preclude the need for a formal adjudication process after the evaluation takes place.

Evaluation Conditions

Participating sites are required to conduct a single evaluation over all of the evaluation data. Beginning/ending times will be supplied for those portions of each episode to be evaluated. NIST will also supply speech segmentation and classification information for data within each of these major portions, using an automatic segmentation and classification utility. (NIST will provide this utility to participating sites who request it.) Sites may use the NIST-supplied segmentation information, or they may perform their own segmentation and classification. Those sites that perform their own segmentation are urged to provide supplemental evaluation results that contrast the performance of their segmentation with that of NIST's.

Any recognition approach is allowed, including running a decoder in unsupervised transcription mode. Any audio segment in the evaluation test data may be used to help decode any other segment of audio. (In other words, adaptation techniques may make use of audio across episode boundaries and show boundaries.)

Scoring

Sites will generate decodings that include word time alignments. The scoring algorithm used for the 1996 Hub 4 evaluation will be used for this evaluation. Word error will be the primary metric.

NIST will tabulate and report word error rates over the entire dataset. These entire dataset results will be reported by NIST. NIST will also tabulate and report word error rates for various subsets of test material to examine performance for different conditions. Special attention will be given to the performance on high fidelity speech from native speakers in clean background conditions. This condition is of particular interest because the absence of other complicating factors such as background noise, music and non-native dialects focuses attention on basic speech recognition issues common to all conditions.

Immediately after the evaluation, NIST will provide the complete annotation record for the evaluation test material, to facilitate the analysis of performance by individual sites.

Evaluating sites are encouraged to submit the output of their system for a portion of the development test to NIST prior to the formal evaluation, to verify that the system output is processed properly by the NIST scoring software.

NIST will make its scoring software available in a timely manner to participants on request.

Multiple Systems Running a Single Test

In order to discourage the running of several systems on a single test to improve one's chances of scoring well, sites must designate one system as the primary system if more than one system is run on a single test. This designation must be made before the test is begun. Results must be reported for all systems run on any test.

System Descriptions

Sites are required to submit a standard system description to NIST along with the results for each system run on any test. The format for these system descriptions is given in documentation that NIST supplies with the test data.

Evaluating sites will be required to provide a written description at the Workshop of computational resource requirements including processor speed and storage requirements used to produce the evaluation results, and to publish information about the complexity of new algorithms.

Site Commitments

Sites interested in participating in the 1997 Hub 4 evaluation should notify NIST. NIST will ensure that sites considering participation receive appropriate training and devtest material in a timely fashion after authorization to do so from the LDC.

Site commitments are used to control evaluation and to manage evaluation resources. It is imperative that sites honor their commitments in order for the evaluation to have beneficial impact. Sites must notify NIST as soon as possible, prior to the distribution of the evaluation data, if it appears that a commitment may not be honored. Defaulting on a commitment may jeopardize permission to participate, and to obtain early distributions of future test data, in subsequent evaluations.

Workshop

A workshop will be held in early 1998 for presenting evaluation results and discussing the technology used in the Hub 4 evaluation. Evaluation results will be reported by NIST, and invited and contributed presentations will be made by evaluation participants. Presentations and results at the Workshop will be published in a written publicly-available Proceedings. N.B. Participants will be required to deliver camera-ready copies of their papers (plus release approvals) at least one week prior to the workshop.

Schedule

October 13 - Deadline for site commitment to participate

October 24 - Deadline for sites to submit Devtest results (optional)

October 27 - NIST distributes the evaluation test data to the sites

November 25 (0700 EST) - Deadline for submission of primary test results

December 2 - NIST releases scores for the primary test results

December 4 (0700 EST) - Deadline for submission of contrast test results

December 9 - NIST releases scores for the contrast test results

January/February 1998 - Hub 4 workshop for participating sites

The 1997 Hub-4E Evaluation Plan for Recognition of Broadcast News, in English