SPECIFICATION FOR THE ARPA NOVEMBER 1996 HUB 4 EVALUATION Approved by the SRCC July 26, 1996 and October 2, 1996; this revision November 1, 1996 (Developed by the Hub 4 working group, Richard Stern, Chair, with major input from George Doddington, Dave Pallett, and Charles Wayne.) INTRODUCTION This document concerns the specification of the Hub 4 test set for the November 1996 ARPA CSR Evaluation. The purpose of the 1996 Hub 4 evaluation is to improve the basic performance of speaker-independent unlimited-vocabulary recognition systems. Using broadcast news sources, it is intended that the inclusion of a challenging combination of read speech, spontaneous speech, as well as a combination of recording environments in broadcast studios and in the field will provide impetus to improve core speech recognition capability, to improve the adaptability of recognition systems to new speakers, dialects, and recording environments, and to improve the systems' abilities to cope with the difficult problems of unknown words, spontaneous speech, and unconstrained syntax and semantics. The 1996 evaluation will consist of two components, referred to as the "unpartitioned evaluation" component (UE) and the "partitioned evaluation" component (PE). Sites will be required to evaluate on the PE, while the UE is optional. The UE will be similar to the 1995 Hub 4 evaluation in that it contains relatively complete portions of television and radio news broadcasts, but using a wider variety of source material than had been employed in the 1995 evaluation. The PE will contain all of the same material as the UE, but manually segmented into homogeneous regions, plus some additional material. This component will provide for a set of controlled contrastive conditions referred to as "evaluation focus conditions" that are intended to supersede the hubs and spokes of the CSR evaluations in previous years based on read speech from the Wall Street Journal and other North American business news sources. DEFINITIONS AND TERMINOLOGY For the purposes of this document, a "show" refers to a particular television or radio broadcast production, encompassing all of the dates and times of broadcast. Examples include "CNN Headline News" and "NPR All things Considered". An "episode" refers to an instance of a show on a particular date (and possibly time), such as "All Things Considered on July 5, 1996" or "CNN Headline News at 1000 EDT on July 5, 1996". A "portion" is a subset of an episode. It generally contains multiple speech styles or recording environments. A "story" is a contiguous subset of a portion that describes or discusses a single topic. A "segment" refers to a contiguous section of audio for which the "Focus Conditions for the Partitioned Evaluation" as defined below remain unchanged. EVALUATION TEST DATA The evaluation test data will be obtained from the audio component of a variety of television and broadcast news sources. Broadcast sources for which the Linguistic Data Consortium (LDC) has obtained legal clearances for use in this evaluation include television news programs from CNN, ABC, and C-SPAN, as well as news radio broadcasts from NPR. * Evaluation Test Data Goals The evaluation test data will consist of approximately 2.5 hours of speech. The actual data to be used for the UE and PE components of the evaluation will be selected by NIST, in consultation with members of the LDC. The data will be selected according to the following guidelines, which should be followed to the extent that is practical: * The evaluation test content should be 60% from television shows and 40% from radio shows. * Half of the test material will be taken from shows that are considered to be "anchored news broadcasts" and half of the test material will be taken from shows that are considered to be "news magazines", as identified by the working group on program content chaired by Long Nguyen. * Labelled focus conditions for the PE are specified below. Beginnings and endings of stories are also marked for the PE. * The PE will use the same acoustic data as the UE plus some additional data. * Neither the UE nor the PE components of the evaluation should include commercials or sports results because of significant differences in syntax and semantics. (These portions of programs will be distributed as part of the training data for possible use in future evaluations, but they will be not included in this year's evaluation test or development test data.) The UE evaluation, however, may include segments consisting only of background music. * The data will consist of a single monophonic channel of audio, even if the original program material is distributed in stereo. * Focus Conditions for the Partitioned Evaluation The partitioned evaluation will include speech that is segmented and labelled according to recording environment and speech style in a fashion that supports the focus conditions below. The PE component will include the following components, plus additional segments: F0: BASELINE BROADCAST SPEECH This condition describes speech that is directed to the general broadcast audience, and that is recorded in a quiet studio environment presumed to have a signal-to-noise ratio (SNR) of greater than 20 dB, A-weighted. This speech is assumed to be mostly read from prepared text. This is the default baseline condition for the PE. F1: SPONTANEOUS BROADCAST SPEECH This condition describes speech that is directed to one or more human conversational partners, either in the studio or at a remote site, and that is recorded in a quite studio environment presumed to have an SNR of greater than 20 dB, A-weighted. This speech is assumed to be spontaneous. F2: SPEECH OVER TELEPHONE CHANNELS This condition describes speech that is collected over reduced-bandwidth conditions, such as local or long distance telephony, cellular telephony, or similar media, using either the conventional handset or other input devices such as a speakerphone. F3: SPEECH IN THE PRESENCE OF BACKGROUND MUSIC This condition describes speech that satisfies the attributes of Baseline Broadcast Speech or Spontaneous Broadcast Speech, except that it is broadcast with additive background music. The signal-to-music power ratio is such that the speech is intelligible to the normal listener, presumably in a range of about 10 to 20 dB, A-weighted. F4: SPEECH UNDER DEGRADED ACOUSTICAL CONDITIONS This condition describes speech that is acoustically degraded for reasons other than the use of telephone-bandwidth channels or the presence of background music. Sources of degradation could include additive noise, environmental noise, or nonlinear distortions. The SNR is presumed to be about 10 to 20 dB, A-weighted. F5: SPEECH FROM NON-NATIVE SPEAKERS This condition describes speech that satisfies the attributes of Baseline Broadcast Speech, except that it is spoken by non-native speakers of American English. The speech is assumed to be of sufficiently high intelligibility that it is intended to be understood by the broadcast audience, and spoken by fluent speakers of English with a foreign accent. British speakers are considered to be non-native speakers of American English. * Specification of the Evaluation Test Set The UE will consist of approximately 4 episodes, each from a different show, of duration of approximately 20 minutes to be selected by NIST according to the Evaluation Test Data Goals described above. The core speech material for the PE will consist of the audio that is used for the UE. The core PE data will be supplemented by additional speech samples selected by NIST to provide enough speech data so that speech recognition results in each of the focus conditions can be reasonably expected to be statistically significant. (Because recognition error rates will vary across the focus conditions, it is not necessary that equal durations of speech data be provided for each focus condition.) To the extent possible, these portions will be selected on a story-by-story basis, to provide support for experimentation in adaptive language modeling. TRAINING DATA LDC in coordination with NIST will release to participating sites approximately 50 hours of standard acoustical training data. These data will be recorded from shows that are similar in form and content to shows used for the evaluation test data, subject to the following additional guidelines: * There should be a partial overlap between shows used in the training set and shows used in the evaluation test set, but (to the extent possible) the training set and evaluation test sets will make use of material from different announcers. While there will inevitably be some overlap in speakers, particularly in the case of newsworthy individuals, the goal is to minimize the utility of training speaker-specific models to speakers in the training and development test sets. * Episodes used in the training set will not overlap in time with episodes used in the evaluation test set. DEVELOPMENT TEST DATA NIST in coordination with LDC will release to participating sites approximately 3 hours of development test data. These data will be similar in form and content to the evaluation test data subject to the following additional guidelines: * As in the case of the training data, there should be a partial overlap between shows used in the development test and shows used in the evaluation test set, but (to the extent possible) the development and evaluation test sets will make use of material from different announcers. Again, the goal is to minimize the utility of training speaker-specific models to speakers in the training and development test sets. * Episodes used in the development test set will not overlap in time with episodes used in the evaluation test set. ANNOTATION OF DATA The LDC, in conjunction with NIST, will be responsible for developing a transcription and annotation system for the data. The annotation scheme must have sufficient detail and flexibility to enable it to identify sections of speech that are usable for each of the focus conditions of the PE. A description of the annotation specification will be distributed to the evaluating sites. The annotation of the reference transcriptions to be used in scoring the evaluation data will be performed extremely carefully using multiple transcribers, in order to preclude the need for a formal adjudication process. SUPPLEMENTAL TEXTS FOR LANGUAGE MODEL TRAINING A separate working group chaired by Alex Rudnicky, working together with the LDC, will provide a large corpus of commercial text transcripts of broadcast news shows similar to those to be used in the evaluation. This text is provided for the development of statistical language models, and the necessary text conditioning tools will be provided as well. EVALUATION CONDITIONS Sites are required to participate in the PE. Participation in the UE is optional. For both the UE and the PE, any recognition approach is allowed, including running a decoder in unsupervised transcription mode. Any audio segment in the evaluation test data may be used for adaptation modules used to decode any other segment of audio. (In other words, adaptation modules may make use of audio across episode boundaries and show boundaries.) The only side information available for the UE is the locations of endpoints of temporally-contiguous portions of audio, plus the beginnings and endings of commercials and sports results. Side information provided in the PE is limited to the above plus segment boundaries, story boundaries, and labels according to the named focus conditions conditions F0 through F5. Side information for segments in the PE that do not fall within the specific definition for any of the named focus conditions (currently labelled FX) will also include the status of all annotation condition labels used to characterize focus conditions conditions F0 through F5. SCORING Sites will generate decodings that include word time alignments, so that an updated version of the scoring algorithm used for the 1995 Hub 4 Dry Run can be used for this evaluation. Word error will be the primary metric. Evaluating sites are encouraged to submit the output of their system for a portion of the the development test to NIST by November 1 to verify that the system output is processed properly by the NIST scoring software. NIST will make its scoring software available in a timely manner to participants on request. SITE COMMITMENTS October 28, 1996, will be the last day to enter or withdraw from the evaluation. Site commitments are used to control evaluation and to manage evaluation resources. It is imperative that sites honor their commitments in order for the evaluation to have beneficial impact. Sites must notify NIST (Attn: Dave Pallett) as soon as possible, prior to the distribution of the evaluation data, if it appears that a commitment may not be honored. Defaulting on a commitment may jeopardize permission to participate, and to obtain early distributions of future test data, in subsequent evaluations. SCHEDULE July 15, 1996 - distribution of 50 hours of acoustic training data July 25 - distribution of devtest data July 30 - distribution of LM text data and tools August 15 - distribution of second 50 hours of acoustic training data October 5 - release of 50 hours of annotated transcripts completed October 28 - final site commitments November 8 - deadline for optional submission of devtest results November 11 - distribution of eval test data December 12 (0700 EST) - deadline for core evaluation results December 16 - date for release of core results December 19 (0700 EST) - deadline for contrast results December 23 - date for release of contrast results February 2-5, 1997 - ARPA Speech Workshop, Westfields Conference Center, Chantilly, VA Evaluation results will be reported by NIST, along with invited and contributed presentations by participants, at a Workshop that will be held sometime early in 1997. Presentations and results at the Workshop will be published in a written publicly-available Proceedings. SYSTEM DESCRIPTIONS Sites are required to submit a standard system description to NIST along with the results for each system run on any test. The format for these system descriptions is given in documentation that NIST supplies with the test data. Evaluating sites will be required to provide a verbal description of computational resource requirements including processor speed and storage requirements at the Workshop that will describe the evaluation results, and to publish information about the complexity of new algorithms. MULTIPLE SYSTEMS RUNNING A SINGLE TEST In order to discourage the running of several systems on a single test to improve one's chances of scoring well, sites must designate one system as the preferred system if more than one system is run on a single test. This designation is to be made before looking at any results. Results must be reported for all systems run on any test. CAVEATS AND RESTRICTIONS 1. ACOUSTIC TRAINING DATA Baseline acoustic training data consist of the approximately 50 hours of annotated training data and developmental test data provided by NIST and the LDC for this evaluation, plus any acoustical training data, development test data, and evaluation test data developed for previous ARPA speech recognition evaluations (including the various Resource Management, ATIS, Wall Street Journal, North American Business News, Marketplace, Switchboard, Macrophone, and Call Home databases). Sites may also make use of other acoustic data that they acquire privately or from outside sources, including additional untranscribed audio training data distributed by LDC for this evaluation, provided that they also supply as a contrast condition the evaluation results obtained from the same system trained only on the baseline acoustic training data. In addition, privately-acquired data may be used provided that it can be made available to the LDC in a form suitable for publication, and that is unencumbered by intellectual property rights, such that it could be released as an LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if the government (e.g., NIST or ARPA) elects to have the data published and an implied statement that the data is legally unencumbered. Delivery of the data to LDC may be done after the evaluation, provided that it is accomplished no later than March 31, 1997. Sites may not make use of any material dated after June 30, 1996 for acoustical training. They also may not make any use of shows (from any date) that are identified by NIST as reserved for testing purposes. 2. LANGUAGE MODEL TRAINING DATA Sites may make use of any corpora for language model training dated on or before June 30, 1996. They may not, however, make any use of shows that are identified by NIST as reserved for testing purposes. In addition to the supplemental texts for language model training provided by LDC and NIST, sites may make use of language model data that they acquire privately or from commercial sources. Privately-acquired data must be made available to the LDC in a form suitable for publication, and that is unencumbered by intellectual property rights, such that it could be released as an LDC-supported corpus. Use of such data implies a willingness to cooperate with the LDC if the government (e.g., NIST or ARPA) elects to have the data published and an implied statement that the data is legally unencumbered. Delivery of the data to LDC may be done after the evaluation, provided that it is accomplished no later than March 31, 1997. APPENDIX. SHOWS RESERVED BY NIST FOR TESTING PURPOSES The following shows (of any date) may not be used for acoustic or language model training: ABC Primetime (ABC_PRT): TV news magazine CNN Morning News (CNN_MNE): TV anchored news CNN World View (CNN_WVW): TV anchored news NPR Morning Edition (NPR_MED): radio anchored news NPR The World (NPR_TWD): radio anchored news