1999 NIST Broadcast News Transcription Evaluation
English Test Material

This CD-ROM contains the English evaluation test material used in the 1999 NIST Broadcast News Transcription Evaluation administered by the NIST Spoken Natural Language Processing Group and produced by the Linguistic Data Consortium (LDC); catalog number LDC2000S88, isbn 1-58563-176-0. For more complete information, see the 1999 Hub-4 Website. Please read this document in its entirety before beginning a test.

Note that the waveform and transcript data on this disc are licensed through the Linguistic Data Consortium (LDC) and are subject to usage restrictions. Contact the LDC for license agreement information.


DOCUMENTATION


Instructions

The 1999 Broadcast News Evaluation Specification contains the rules and conditions for implementing the Broadcast News (BN) Transcription tests.

Instructions for self-scoring and preparing and submitting your results to NIST for official scoring will be distributed in email separately.

This year the BN transcription test includes two distinct 1.5-hour test sets. The set1 test set contains material from the same test epoch as was used to create set2 of the 1998 Broadcast News evaluation and is meant to provide a means for year to year comparisons. The set2 of this year's evaluation data contains news broadcast material from the late summer of 1998 and is meant to provide more contemporary data to exercise systems' abilities to handle new words, speakers, etc. The waveforms and related files have been named to correspond to these two test sets.

Evaluation Schedule

Other Documentation

The Universal Transcription Format (UTF) used to annotate/transcribe the 1999 Broadcast News reference transcripts are documented in utf1_0v2.ps (PostScript Version)




TEST MATERIAL


Evaluation Map Files

As in 1998 Hub-4 evaluation, the 1999 Broadcast News Benchmark Test supports only one CSR evaluation mode in which no partitioning information is provided. The basic timing information required to implement the evaluation is given in the map files, bn99en_1.uem (set1) and bn99en_2.uem (set2) . These files contain only the pointers to the beginning and end of the complete test sets. No side information is provided.

Segmentation Files

Automatically generated segmentation information for each of the two test sets is provided in the files bn99en_1.seg (set1) and bn99en_2.seg (set2). Although sites are free to use any segmentation scheme of their choice, these files are included for the convenience of sites without access to segmentation algorithms and were generated using the CMUseg Version 0.5 (compressed tar archive) automatic segmentation and classification utility. The CMUseg utility has been graciously supplied to the DARPA community by Carnegie Mellon University for use as a common acoustic segmentation utility.

Participants are not required to use this segmentation, or the CMUseg utility. They have been supplied to facilitate participation in the test.

Waveform Files

This year, the test material is contained in two SPHERE-formatted waveform files. The file bn99en_1.sph (set1) contains 1.5 hours of Broadcast News excerpts from last year's set2 epoch. The file bn99en_2.sph (set2) contains 1.5 hours of Broadcast News excerpts from the summer of 1998. Each file should be separately recognized per the Broadcast News English Evaluation Specification.

Transcript Files

The UTF-formatted reference transcriptions for the test material are included in this publication in bn99en_1.utf and bn99en_2.utf.

Reference STM Files

The reference STM file for the test material that were used in scoring the test results with SCLITE is included in this publication in bn99en_1.stm, or bn99en_2.stm,

Transcript Orthography Mapping File and Software

The orthography mapping file for the test material which is used in pre-processing the reference and system-generated transcripts using tranfilt Version 1.9 (compressed tar archive) prior to scoring will be made available after the primary test results are due. For your convenience, the orthography mapping file used in the 1997 evaluation is available in en991231.glm and en981118.glm.


SOFTWARE


SCLITE Speech Recognition Scoring Software

The NIST SCLITE Speech Recognition Scoring Toolkit Version 1.2 (compressed tar archive) will be used to score the results of the Broadcast News CSR tests.

Speech Waveform Manipulation Utilities

The Broadcast News Benchmark Transcription Test waveform files are encoded using the NIST SPeech HEader REsources (SPHERE) format and may be manipulated using the SPHERE Version 2.6a (compressed tar archive) utilities and libraries. If you have questions about installing or using SPHERE, you may send email to jonathan.fiscus@nist.gov.

Note that SPHERE is currently available only for UNIX platforms.

Software Updates

Current versions of NIST software are available via the NIST Speech Software Website


CONTACT INFORMATION


If you have questions regarding the Broadcast News data and protocols listed in this document. NIST software, data filtering, or scoring your recognizer output, contact jonathan.fiscus@nist.gov.

If you are interested in participating in future NIST speech recognition tests, contact david.pallett@nist.gov.


CAVEAT


Certain commercial equipment, instruments, software, and materials are identified on this CD-ROM in order to adequately specify experimental procedures used. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology (NIST), nor does it imply that the equipment, instruments, software, or materials identified are necessarily the best available for the purpose.


CONTENT COPYRIGHT


Portions Copyright 1998 PRI-Public Radio International

Portions Copyright 1997-1998 ABC News

Portions Copyright 1998 NBC News

Portions Copyright 1997-1998 Cable News Network, Inc. All Rights Reserved.