SPECIFICATION FOR THE ARPA NOVEMBER 1996 HUB 4 EVALUATION

  Approved by the SRCC July 26, 1996 and October 2, 1996; 
  this revision November 1, 1996

        (Developed by the Hub 4 working group, Richard Stern, Chair,
  with major input from George Doddington, Dave Pallett, and Charles Wayne.)


INTRODUCTION

This document concerns the specification of the Hub 4 test set for the 
November 1996 ARPA CSR Evaluation.  

The purpose of the 1996 Hub 4 evaluation is to improve the basic performance 
of speaker-independent unlimited-vocabulary recognition systems.   Using 
broadcast news sources, it is intended that the inclusion of a challenging 
combination of read speech, spontaneous speech, as well as a combination of 
recording environments in broadcast studios and in the field will provide 
impetus to improve core speech recognition capability, to improve the 
adaptability of recognition systems to new speakers, dialects, and recording 
environments, and to improve the systems' abilities to cope with the difficult 
problems of unknown words, spontaneous speech, and unconstrained syntax and 
semantics.  
 
The 1996 evaluation will consist of two components, referred to as the
"unpartitioned evaluation" component (UE) and the "partitioned evaluation" 
component (PE).  Sites will be required to evaluate on the PE, while the
UE is optional.  The UE will be similar to the 1995 Hub 4 evaluation in
that it contains relatively complete portions of television and radio news
broadcasts, but using a wider variety of source material than had been
employed in the 1995 evaluation.  The PE will contain all of the same
material as the UE, but manually segmented into homogeneous regions, plus
some additional material.  This component will provide for a set of
controlled contrastive conditions referred to as "evaluation focus
conditions" that are intended to supersede the hubs and spokes of the CSR
evaluations in previous years based on read speech from the Wall Street
Journal and other North American business news sources. 

        
DEFINITIONS AND TERMINOLOGY

For the purposes of this document, a "show" refers to a particular television
or radio broadcast production, encompassing all of the dates and times of 
broadcast.  Examples include "CNN Headline News" and "NPR All things 
Considered".

An "episode" refers to an instance of a show on a particular date (and 
possibly time), such as "All Things Considered on July 5, 1996" or 
"CNN Headline News at 1000 EDT on July 5, 1996".

A "portion" is a subset of an episode.  It generally contains multiple speech
styles or recording environments.

A "story" is a contiguous subset of a portion that describes or discusses
a single topic.

A "segment" refers to a contiguous section of audio for which the "Focus 
Conditions for the Partitioned Evaluation" as defined below remain unchanged.

                
EVALUATION TEST DATA
 
The evaluation test data will be obtained from the audio component of a 
variety of television and broadcast news sources.  Broadcast sources for which 
the Linguistic Data Consortium (LDC) has obtained legal clearances for use in 
this evaluation include television news programs from CNN, ABC, and C-SPAN, as 
well as news radio broadcasts from NPR.

* Evaluation Test Data Goals

The evaluation test data will consist of approximately 2.5 hours of speech.  

The actual data to be used for the UE and PE components of the evaluation will 
be selected by NIST, in consultation with members of the LDC.  The data will 
be selected according to the following guidelines, which should be followed
to the extent that is practical:

        * The evaluation test content should be 60% from television shows and 
          40% from radio shows.
        
        * Half of the test material will be taken from shows that are 
          considered to be "anchored news broadcasts" and half of the test
          material will be taken from shows that are considered to be "news
          magazines", as identified by the working group on program content
          chaired by Long Nguyen.   

        * Labelled focus conditions for the PE are specified below.  
          Beginnings and endings of stories are also marked for the PE.

        * The PE will use the same acoustic data as the UE plus some 
          additional data. 
        
        * Neither the UE nor the PE components of the evaluation should 
          include commercials or sports results because of significant 
          differences in syntax and semantics.  (These portions of programs 
          will be distributed as part of the training data for possible use in 
          future evaluations, but they will be not included in this year's 
          evaluation test or development test data.)  The UE evaluation,
          however, may include segments consisting only of background music.

        * The data will consist of a single monophonic channel of audio, even 
          if the original program material is distributed in stereo.


* Focus Conditions for the Partitioned Evaluation
          
The partitioned evaluation will include speech that is segmented and labelled 
according to recording environment and speech style in a fashion that supports 
the focus conditions below.

The PE component will include the following components, plus additional 
segments:


F0: BASELINE BROADCAST SPEECH

This condition describes speech that is directed to the general broadcast
audience, and that is recorded in a quiet studio environment presumed to
have a signal-to-noise ratio (SNR) of greater than 20 dB, A-weighted. 
This speech is assumed to be mostly read from prepared text.  This is the
default baseline condition for the PE. 

F1: SPONTANEOUS BROADCAST SPEECH

This condition describes speech that is directed to one or more human 
conversational partners, either in the studio or at a remote site, and that 
is recorded in a quite studio environment presumed to have an SNR of greater 
than 20 dB, A-weighted.  This speech is assumed to be spontaneous. 

F2: SPEECH OVER TELEPHONE CHANNELS

This condition describes speech that is collected over reduced-bandwidth
conditions, such as local or long distance telephony, cellular telephony, or 
similar media, using either the conventional handset or other input devices 
such as a speakerphone.  

F3: SPEECH IN THE PRESENCE OF BACKGROUND MUSIC

This condition describes speech that satisfies the attributes of Baseline 
Broadcast Speech or Spontaneous Broadcast Speech, except that it is broadcast 
with additive background music.  The signal-to-music power ratio is such that 
the speech is intelligible to the normal listener, presumably in a range of 
about 10 to 20 dB, A-weighted.

F4: SPEECH UNDER DEGRADED ACOUSTICAL CONDITIONS

This condition describes speech that is acoustically degraded for reasons 
other than the use of telephone-bandwidth channels or the presence of 
background music.  Sources of degradation could include additive noise, 
environmental noise, or nonlinear distortions.  The SNR is presumed to be 
about 10 to 20 dB, A-weighted.  

F5: SPEECH FROM NON-NATIVE SPEAKERS

This condition describes speech that satisfies the attributes of Baseline 
Broadcast Speech, except that it is spoken by non-native speakers of American 
English.  The speech is assumed to be of sufficiently high intelligibility 
that it is intended to be understood by the broadcast audience, and spoken
by fluent speakers of English with a foreign accent.  British speakers are 
considered to be non-native speakers of American English.

* Specification of the Evaluation Test Set

The UE will consist of approximately 4 episodes, each from a different
show, of duration of approximately 20 minutes to be selected by NIST
according to the Evaluation Test Data Goals described above.  The core
speech material for the PE will consist of the audio that is used for the
UE.  The core PE data will be supplemented by additional speech samples
selected by NIST to provide enough speech data so that speech recognition
results in each of the focus conditions can be reasonably expected to be
statistically significant.  (Because recognition error rates will vary
across the focus conditions, it is not necessary that equal durations of
speech data be provided for each focus condition.) To the extent possible,
these portions will be selected on a story-by-story basis, to provide
support for experimentation in adaptive language modeling. 


TRAINING DATA

LDC in coordination with NIST will release to participating sites 
approximately 50 hours of standard acoustical training data.  These data will 
be recorded from shows that are similar in form and content to shows used for
the evaluation test data, subject to the following additional guidelines:

        * There should be a partial overlap between shows used in the     
          training set and shows used in the evaluation test set, but (to 
          the extent possible) the training set and evaluation test sets will 
          make use of material from different announcers.  While there will 
          inevitably be some overlap in speakers, particularly in the case of 
          newsworthy individuals, the goal is to minimize the utility of 
          training speaker-specific models to speakers in the training and 
          development test sets.
          
        * Episodes used in the training set will not overlap in time with 
          episodes used in the evaluation test set.


DEVELOPMENT TEST DATA 

NIST in coordination with LDC will release to participating sites 
approximately 3 hours of development test data.  These data will be 
similar in form and content to the evaluation test data subject to the 
following additional guidelines:

        * As in the case of the training data, there should be a partial 
          overlap between shows used in the development test and shows used in 
          the evaluation test set, but (to the extent possible) the 
          development and evaluation test sets will make use of material from 
          different announcers.  Again, the goal is to minimize the utility of 
          training speaker-specific models to speakers in the training and 
          development test sets.
          
        * Episodes used in the development test set will not overlap in time 
          with episodes used in the evaluation test set.


ANNOTATION OF DATA

The LDC, in conjunction with NIST, will be responsible for developing a 
transcription and annotation system for the data.  The annotation scheme
must have sufficient detail and flexibility to enable it to identify sections 
of speech that are usable for each of the focus conditions of the PE.  A 
description of the annotation specification will be distributed to the 
evaluating sites. 

The annotation of the reference transcriptions to be used in scoring the
evaluation data will be performed extremely carefully using multiple 
transcribers, in order to preclude the need for a formal adjudication 
process.


SUPPLEMENTAL TEXTS FOR LANGUAGE MODEL TRAINING

A separate working group chaired by Alex Rudnicky, working together with the 
LDC, will provide a large corpus of commercial text transcripts of broadcast 
news shows similar to those to be used in the evaluation.  This text is 
provided  for the development of statistical language models, and the 
necessary text conditioning tools will be provided as well.


EVALUATION CONDITIONS

Sites are required to participate in the PE.  Participation in the UE is 
optional.


For both the UE and the PE, any recognition approach is allowed, including 
running a decoder in unsupervised transcription mode.  Any audio segment in 
the evaluation test data may be used for adaptation modules used to decode any 
other segment of audio.  (In other words, adaptation modules may make use of 
audio across episode boundaries and show boundaries.)

The only side information available for the UE is the locations of endpoints 
of temporally-contiguous portions of audio, plus the beginnings and endings of 
commercials and sports results.  Side information provided in the PE is 
limited to the above plus segment boundaries, story boundaries, and labels 
according to the named focus conditions conditions F0 through F5.  Side 
information for segments in the PE that do not fall within the specific 
definition for any of the named focus conditions (currently labelled FX) will 
also include the status of all annotation condition labels used to 
characterize focus conditions conditions F0 through F5.  


SCORING

Sites will generate decodings that include word time alignments, so that an
updated version of the scoring algorithm used for the 1995 Hub 4 Dry Run
can be used for this evaluation.  Word error will be the primary metric.  

Evaluating sites are encouraged to submit the output of their system for a 
portion of the the development test to NIST by November 1 to verify that 
the system output is processed properly by the NIST scoring software.

NIST will make its scoring software available in a timely manner to 
participants on request. 


SITE COMMITMENTS

October 28, 1996, will be the last day to enter or withdraw from the 
evaluation.  

Site commitments are used to control evaluation and to manage evaluation 
resources.  It is imperative that sites honor their commitments in order for 
the evaluation to have beneficial impact.  Sites must notify NIST (Attn: Dave 
Pallett) as soon as possible, prior to the distribution of the evaluation 
data, if it appears that a commitment may not be honored.  Defaulting on a 
commitment may jeopardize permission to participate, and to obtain early 
distributions of future test data, in subsequent evaluations.


SCHEDULE 

        July 15, 1996 - distribution of 50 hours of acoustic training data
        July 25 - distribution of devtest data
        July 30 - distribution of LM text data and tools
        August 15 - distribution of second 50 hours of acoustic training data
        October 5 - release of 50 hours of annotated transcripts completed
        October 28 - final site commitments
        November 8 - deadline for optional submission of devtest results       
        November 11 - distribution of eval test data
        December 12 (0700 EST) - deadline for core evaluation results
        December 16 - date for release of core results    
        December 19 (0700 EST) - deadline for contrast results
        December 23 - date for release of contrast results
        February 2-5, 1997 - ARPA Speech Workshop, Westfields Conference 
                Center, Chantilly, VA


Evaluation results will be reported by NIST, along with invited and 
contributed presentations by participants, at a Workshop that will be held
sometime early in 1997.  Presentations and results at the Workshop will
be published in a written publicly-available Proceedings. 


SYSTEM DESCRIPTIONS

Sites are required to submit a standard system description to NIST along with 
the results for each system run on any test.  The format for these system 
descriptions is given in documentation that NIST supplies with the test data.

Evaluating sites will be required to provide a verbal description of
computational resource requirements including processor speed and storage 
requirements at the Workshop that will describe the evaluation results, and to 
publish information about the complexity of new algorithms.


MULTIPLE SYSTEMS RUNNING A SINGLE TEST

In order to discourage the running of several systems on a single test to 
improve one's chances of scoring well, sites must designate one system as the 
preferred system if more than one system is run on a single test.  This 
designation is to be made before looking at any results.  Results must be 
reported for all systems run on any test.


CAVEATS AND RESTRICTIONS

1. ACOUSTIC TRAINING DATA

Baseline acoustic training data consist of the approximately 50 hours of 
annotated training data and developmental test data provided by NIST and the 
LDC for this evaluation, plus any acoustical training data, development test 
data, and evaluation test data developed for previous ARPA speech recognition 
evaluations (including the various Resource Management, ATIS, Wall Street 
Journal, North American Business News, Marketplace, Switchboard, Macrophone,
and Call Home databases).

Sites may also make use of other acoustic data that they acquire privately or 
from outside sources, including additional untranscribed audio training 
data distributed by LDC for this evaluation, provided that they also supply as 
a contrast condition the evaluation results obtained from the same system 
trained only on the baseline acoustic training data.  

In addition, privately-acquired data may be used provided that it can be made 
available to the LDC in a form suitable for publication, and that is 
unencumbered by intellectual property rights, such that it could be released 
as an LDC-supported corpus.  Use of such data implies a willingness to 
cooperate with the LDC if the government (e.g., NIST or ARPA) elects to have 
the data published and an implied statement that the data is legally 
unencumbered.  Delivery of the data to LDC may be done after the evaluation, 
provided that it is accomplished no later than March 31, 1997.

Sites may not make use of any material dated after June 30, 1996 for
acoustical training.  They also may not make any use of shows (from any
date) that are identified by NIST as reserved for testing purposes. 


2. LANGUAGE MODEL TRAINING DATA

Sites may make use of any corpora for language model training dated on or 
before June 30, 1996.  They may not, however, make any use of shows that 
are identified by NIST as reserved for testing purposes.   

In addition to the supplemental texts for language model training provided
by LDC and NIST, sites may make use of language model data that they acquire 
privately or from commercial sources.  Privately-acquired data must be made 
available to the LDC in a form suitable for publication, and that is 
unencumbered by intellectual property rights, such that it could be released 
as an LDC-supported corpus.  Use of such data implies a willingness to 
cooperate with the LDC if the government (e.g., NIST or ARPA) elects to have 
the data published and an implied statement that the data is legally 
unencumbered.  Delivery of the data to LDC may be done after the evaluation, 
provided that it is accomplished no later than March 31, 1997.


        APPENDIX. SHOWS RESERVED BY NIST FOR TESTING PURPOSES


The following shows (of any date) may not be used for acoustic or language 
model training:
        
          ABC Primetime (ABC_PRT): TV news magazine 
          CNN Morning News (CNN_MNE): TV anchored news 
          CNN World View (CNN_WVW): TV anchored news 
          NPR Morning Edition (NPR_MED): radio anchored news 
          NPR The World (NPR_TWD): radio anchored news