1998 DARPA/NIST Continuous Speech Recognition 
Broadcast News Hub-4 English Evaluation Test Material

This single CD contains the evaluation test material used in the 1998 DARPA/NIST Continuous Speech Recognition Broadcast News Hub-4 English Benchmark Test administered by the NIST Spoken Natural Language Processing Group and produced by the Linguistic Data Consortium (LDC); catalog number LDC2000S86, isbn 1-58563-172-8. In addition, the transcripts from the evaluation and the utf.dtd used to validate the transcripts is now included. For more complete information, see the 1998 Hub-4 Website.

Note: This CD-ROM does not contain the Human Reference and Baseline Recognizer transcripts for the Information Extraction - Named Entity (IE-NE) Spoke. This material was released separately prior to the start of the IE-NE Spoke.

Note: This CD-ROM does not contain the material for the Hub-4 Non-English evaluation. It will be released separately.

Please read this document in its entirety before beginning a test. 

Note that the waveform and transcript data on this disc are licensed through the Linguistic Data Consortium (LDC) and are subject to usage restrictions. Contact the LDC for license agreement information.


DOCUMENTATION


Instructions

The Hub-4 English Evaluation Specification contains the rules and conditions for implementing the Hub-4 English tests. Note that the new IE-NE Spoke is still somewhat in a state of flux and the dates/metrics for IE-NE are likely to change. See the most recent version of the evaluation specification on the 1998 Hub-4 Website for updates.

Instructions for self-scoring and preparing and submitting your results to NIST for official scoring will be distributed in email separately.

This year the Hub-4 test includes two distinct 1.5-hour test sets. The set1 test set contains material from 1996 news broadcasts and is meant to be comparable in terms of epoch and selection process with the 1997 test set. The set2 test set contains material from 1998 news broadcasts and is meant to provide more contemporary data to exercise systems' abilities to handle new words, speakers, etc. The waveforms and related files have been named to correspond to these two test sets.

Evaluation Schedule

Other Documentation

The Universal Transcription Format (UTF) used to annotate/transcribe the 1998 Hub-4 reference transcripts are documented in utf1_0v2.ps (PostScript Version) and a copy of the utf.dtd is included in the doc/ subdirectory.
 


TEST MATERIAL


The 1998 Hub-4 Broadcast News test material is to be used in accordance with the Hub-4 English Evaluation Specification

Evaluation Map Files

As in 1997, the 1998 Hub-4E Benchmark Test supports only one CSR evaluation mode in which no partitioning information is provided. The basic timing information required to implement the evaluation is given in the map files, h4e_98_1.uem (set1) and h4e_98_2.uem (set2). These files contain only the pointers to the beginning and end of the complete test sets. No side information is provided.
 

Segmentation Files

Automatically generated segmentation information for each of the two test sets is provided in the files h4e_98_1.seg (set1) and h4e_98_2.seg (set2). Although sites are free to use any segmentation scheme of their choice, these files are included for the convenience of sites without access to segmentation algorithms and were generated using the CMUseg Version 0.5 (compressed tar archive) automatic segmentation and classification utility. The CMUseg utility has been graciously supplied to the DARPA community by Carnegie Mellon University for use as a common acoustic segmentation utility.

Participants are not required to use this segmentation, or the CMUseg utility. They have been supplied to facilitate participation in the test.

Waveform Files

This year, the test material is contained in two SPHERE-formatted waveform files. The file h4e_98_1.sph (set1) contains 1.5 hours of Broadcast News excerpts from 1996. The file h4e_98_2.sph (set2) contains 1.5 hours of Broadcast News excerpts from 1998. Each file should be separately recognized per the Hub-4 English Evaluation Specification.

Copyright

Portions Copyright 1996 by PRI-Public Radio International

Portions Copyright 1996 by ABC News

Portions Copyright 1996 Cable News Network, Inc. All Rights Reserved.

Information from the USC program 'Marketplace' contained herein is the property of USC Radio and the University of Southern California and is protected by copyright. Use, duplication or disclosure by you is subject to the restrictions set forth in the user agreement and attached to the computer readable media provided to you by the Linguistic Data Consortium of the University of Pennsylvania. Copyright 1996 University of Southern California. all Rights Reserved. Marketplace is produced by USC Radio at the University of Southern Califnoria, and is distributed to public Radio stations nationwide by PRI-Public Radio International. Marketplace is made possible by GE, the Corporation for Public Radio, and Public Radio Stations nationwide.
 
 

Transcript Files

The UTF-formatted reference transcriptions for the test material are included in this publication in transcripts.

Reference STM Files

The reference STM file for the test material that were used in scoring the test results with SCLITE is included in this publication in h4e_98.stm.

Transcript Orthography Mapping File and Software

The orthography mapping file for the test material which is used in pre-processing the reference and system-generated transcripts using tranfilt Version 1.9 (compressed tar archive) prior to scoring is h4e_98.glm. For your convenience, the orthography mapping file used in last year's evaluation is available in the file en971128.glm.


SOFTWARE


SCLITE Speech Recognition Scoring Software

The NIST SCLITE Speech Recognition Scoring Toolkit Version 1.2 (compressed tar archive) will be used to score the results of the Hub-4 CSR tests.

Speech Waveform Manipulation Utilities

The Hub-4 Benchmark Test waveform files are encoded using the NIST SPeech HEader REsources (SPHERE) format and may be manipulated using the SPHERE Version 2.6a (compressed tar archive) utilities and libraries. If you have questions about installing or using SPHERE, you may send email to jonathan.fiscus@nist.gov.

Note that SPHERE is currently available only for UNIX platforms.

Software Updates

Current versions of NIST software are available via the NIST Speech Software Website


CONTACT INFORMATION


If you have questions regarding the HUB-4 data and protocols listed in this document. NIST software, data filtering, or scoring your recognizer output, contact jonathan.fiscus@nist.gov.

If you are interested in participating in future NIST speech recognition tests, contact david.pallett@nist.gov.


CAVEAT


Certain commercial equipment, instruments, software, and materials are identified on this CD-ROM in order to adequately specify experimental procedures used. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology (NIST), nor does it imply that the equipment, instruments, software, or materials identified are necessarily the best available for the purpose.