1999 NIST Broadcast News Transcription
This CD-ROM contains the English evaluation test material used in the 1999 NIST
Broadcast News Transcription Evaluation administered by the NIST Spoken Natural Language Processing
Group and produced by the Linguistic
Data Consortium (LDC); catalog number LDC2000S88, isbn
1-58563-176-0. For more complete information, see the 1999 Hub-4 Website. Please read
this document in its entirety before beginning a test.
English Test Material
Note that the
waveform and transcript data on this disc are licensed through the Linguistic Data Consortium (LDC) and are
subject to usage restrictions. Contact the
LDC for license agreement information.
The 1999 Broadcast News Evaluation
Specification contains the rules and conditions for implementing the
Broadcast News (BN) Transcription tests.
Instructions for self-scoring and preparing and submitting your results
to NIST for official scoring will be distributed in email separately.
This year the BN transcription test includes two distinct 1.5-hour test
sets. The set1 test set contains material from the same test epoch
as was used to create set2 of the 1998 Broadcast News evaluation
and is meant to provide a means for year to year comparisons. The set2
of this year's evaluation data contains news broadcast material from the
late summer of 1998 and is meant to provide more contemporary data to exercise
systems' abilities to handle new words, speakers, etc. The waveforms and
related files have been named to correspond to these two test sets.
The following schedule pertains to all Broadcast News Transcription
|October 25, 1999
||Deadline for site commitment to participate
|November 1, 1999
||Evaluation test data to be at participating sites, test begins
|December 30, 1999 (0700 EST)
||Results to be submitted for ALL test conditions (baseline, contrast,
and <10X systems)
|January 12 , 2000
||NIST releases scores for the Broadcast News evaluation tasks
||Tentative date for a "Speech Transcription Workshop"
The Universal Transcription Format (UTF) used to annotate/transcribe the
1999 Broadcast News reference transcripts are documented in utf1_0v2.ps
Evaluation Map Files
As in 1998 Hub-4 evaluation, the 1999 Broadcast News Benchmark Test supports
only one CSR evaluation mode in which no
is provided. The basic timing information required to implement the evaluation
is given in the map files, bn99en_1.uem (set1)
and bn99en_2.uem (set2) . These files
contain only the pointers to the beginning and end of the complete test
sets. No side information is provided.
Automatically generated segmentation information for each of the two test
sets is provided in the files bn99en_1.seg
(set1) and bn99en_2.seg (set2).
Although sites are free to use any segmentation scheme of their choice,
these files are included for the convenience of sites without access to
segmentation algorithms and were generated using the CMUseg
Version 0.5 (compressed tar archive) automatic segmentation and classification
utility. The CMUseg utility has been graciously supplied to the DARPA community
by Carnegie Mellon University for use as a common acoustic segmentation
Participants are not required to use this segmentation, or the CMUseg
utility. They have been supplied to facilitate participation in the test.
This year, the test material is contained in two SPHERE-formatted waveform
files. The file bn99en_1.sph (set1)
contains 1.5 hours of Broadcast News excerpts from last year's set2 epoch.
The file bn99en_2.sph (set2) contains
1.5 hours of Broadcast News excerpts from the summer of 1998. Each file
should be separately recognized per the Broadcast
News English Evaluation Specification.
The UTF-formatted reference transcriptions for the test material are included
in this publication in bn99en_1.utf
Reference STM Files
The reference STM file for the test material that were used in scoring
the test results with SCLITE is included in this publication in bn99en_1.stm,
Transcript Orthography Mapping File and Software
The orthography mapping file for the test material which is used in pre-processing
the reference and system-generated transcripts using tranfilt
Version 1.9 (compressed tar archive) prior to scoring will be made
available after the primary test results are due. For your convenience,
the orthography mapping file used in the 1997 evaluation is available in
en991231.glm and en981118.glm.
SCLITE Speech Recognition Scoring Software
The NIST SCLITE Speech Recognition Scoring
Toolkit Version 1.2 (compressed tar archive) will be used to score
the results of the Broadcast News CSR tests.
Speech Waveform Manipulation Utilities
The Broadcast News Benchmark Transcription Test waveform files are encoded
using the NIST SPeech HEader REsources (SPHERE) format and may be manipulated
using the SPHERE Version 2.6a (compressed
tar archive) utilities and libraries. If you have questions about installing
or using SPHERE, you may send email to firstname.lastname@example.org.
Note that SPHERE is currently available only for UNIX platforms.
Current versions of NIST software are available via the
Speech Software Website
If you have questions regarding the Broadcast News data and protocols
listed in this document. NIST software, data filtering, or scoring your
recognizer output, contact
If you are interested in participating in future NIST speech recognition
Certain commercial equipment, instruments, software, and materials are
identified on this CD-ROM in order to adequately specify experimental procedures
used. Such identification does not imply recommendation or endorsement
by the National Institute of Standards and Technology (NIST), nor does
it imply that the equipment, instruments, software, or materials identified
are necessarily the best available for the purpose.
Portions Copyright 1998 PRI-Public Radio International
Portions Copyright 1997-1998 ABC News
Portions Copyright 1998 NBC News
Portions Copyright 1997-1998 Cable News Network, Inc. All Rights Reserved.