1996 DARPA/NIST Continous Speech Recognition
Broadcast News Hub-4 Development Test Material
This CD-ROM contains the development test material used in conjunction
with the
1996 DARPA/NIST Continuous Speech Recognition Broadcast News Hub-4
Benchmark Test administered by the
NIST Spoken Natural Language Processing Group.
The 1996 Hub-4 Benchmark Test is documented in the proceedings
of the
1997 DARPA Speech Recognition Workshop.
To ease access, most of the links on this page point to files on this
CD-ROM. But, for your convenience, a few external links are also
included.
Note: This CD-ROM has been made available for archival purposes
only. Since the 1996 Hub-4 Benchmark Test is past, please do NOT submit
results for this material to NIST for scoring unless you have made prior
explicit arrangements with NIST.
Note that the waveform and transcript data on this disc are licensed
through the Linguistic Data Consortium (LDC) and are subject to usage restrictions. Contact
the LDC for license agreement information.
HUB-4 DATA AND DOCUMENTATION
Instructions
Sites participating in the Hub-4 Benchmark Test were provided with a
detailed test specification document. The test specification document,
h496evsp.txt
defined by the Hub-4 Working Group, contains the rules and conditions for
implementing the test.
The test implementation document,
h496evsc.txt
contains instructions for preparing and scoring your test results
using the NIST scoring protocol and software.
The 1996 Hub-4 Benchmark Test supported Partitioned and Unpartitioned
Evaluation modes. In the Partitioned Evaluation, certain side
information was provided for the recognizers about recording
conditions, speaker changes, etc. - essentially performing the job of
segmentation for the recognizers. In the Unpartitioned Evaluation, no
side information was provided for the recognizers except that long
stretches without speech were skipped over.
Evaluation Map files were created for each mode to
provide pointers to the precise waveform data to be recognized
with the pertainent side information for the Partitioned Evaluation.
Evaluation Map Files
To
implement a Partitioned Evaluation development test, you will need to use the
Partitioned Evaluation Map (PEM) file,
h496dt.pem.
This file contains pointers into the waveforms on the CD-ROM for
the "Partitions" to be evaluated. The Partitions are determined by
changes in any of several attributes which have been collated into
7 Focus Conditions. The Focus Conditions and determining attributes
are included for each Partition.
The Hub-4 focus condition
definitions are provided in the
Hub-4 Annotation Specification.
To implement the Unpartitioned
Evaluation test, you will need to use the Unpartitioned Evaluation Map
(UEM) file,
h496dt.uem.
This file contains pointers into the waveforms on the test CD-ROM for
the excerpts to be evaluated.
Waveform Files
The following SPHERE-formatted waveform files, which contain
the Hub-4 Development Test material are located in the "devdata"
directory of this CD-ROM. Each waveform file contains an
excerpt of a radio or television broadcast. Do NOT perform
recognition on all of the data in each waveform file. Perform recognition
on only the excerpts specified in the map file (PEM or UEM) for the
test you are implementing.
Other Documentation
The conventions and formats used to annotate/transcribe the Hub-4
data are documented in the 1996 Hub-4 Annotation Specification
for Evaluation of Speech Recognition
on Broadcast News which is provided in two formats:
h4annot.doc (Microsoft Word 6.0 Version)
h4annot.ps (PostScript Version)
h4annot.pdf (Adobe Acrobat Version)
This document contains the Hub-4 Partitioning definitions as well as
the formats for the Partitioned Evaluation Map
(PEM) file, Unpartitioned Evaluation Map (UEM) file, and Segment Time Marked
(STM) file used in scoring recognition output.
Transcript Files
The transcription files corresponding to the above waveform files are
also located in the "devdata" directory of this CD-ROM and are
as follows:
Reference STM File
Reference Segment Time Mark (STM) files are used by the NIST scoring
software to score recognition results and may be derived from the
above transcript files using the PERL utility bn_filt.pl in conjunction with the 1996 Hub-4
speaker information database spkr_rdb.sgm. The file, bn_filt.doc contains documentation for
bn_filt.pl. For your convenience, STM files for both Partitioned and
Unpartitioned evaluation modes have been generated for this
development test data. The STM file for the PE is located in h496dtpe.stm and the STM file for the
UE is located in h496dtue.stm
Transcript Orthography Mapping Files
The NIST scoring software uses a set of rules files which transform
multiple representations of the same words into a predetermined single
representation prior to scoring. The file, et96_1.glm, contains rules for global
substitutions, lexical equivalents and contractions and the file, et96_1.utm, contains rules for
utterance-specific lexical equivalents (this file is empty, but
is still required by the program.) Note that these rules files have
been developed especially for the 1996 Hub-4 evaluation test data but
may be applied to the development test data as well. However, please
note that some word variants in the development test data may not be covered
in the rule set.
Speaker Information Database
The file, spkr_rdb.sgm, is an SGML tagged
database file which contains information about each of the
speakers in the 1996 Hub-4 corpus.
NIST SOFTWARE
Scoring Software
You can score the output of your recognition system against the
reference transcriptions using the NIST SCLITE (Version 1.4a)
speech recognition scoring software located in the sclite directory of this disc
Please refer to the
files, readme.txt and install.txt to get started with SCLITE.
The documentation for using SCLITE is located in the file, sclite.htm. If you have questions
about installing or using SCLITE, you may send email to jonathan.fiscus@nist.gov.
If you would like to score your
results using the same protocol used in the November 1996 Hub-4
Benchmark Test, please follow the instructions in the file, h496evsc.txt.
Note that SCLITE is currently available only for UNIX platforms.
The file, h4_score.sh, contains a
sample script for scoring Hub-4 results using SCLITE.
Speech Waveform Manipulation Utilities
The Hub-4 Benchmark Test waveform files are encoded using the NIST SPeech
HEader REsources (SPHERE) format and may be manipulated using
the SPHERE (Version 2.6a) utilities and libraries located in the
sphere
directory of this disc. Please refer to the file
readme.txt to get started with SPHERE.
If you have questions about installing or using SPHERE, you may send
email to
jonathan.fiscus@nist.gov.
Note that SPHERE is currently available only for UNIX platforms.
Software Updates
NIST software updates are made periodically and are available via
anonymous ftp to
jaguar.ncsl.nist.gov/pub.
CONTACT INFORMATION
If you have questions regarding the HUB-4 data and protocols listed
in this document, contact john.garofolo@nist.gov.
If you have questions regarding NIST software, data filtering, or
scoring your recognizer output, contact
jonathan.fiscus@nist.gov.
If you are interested in participating in future NIST speech recognition
tests, contact
david.pallett@nist.gov.
CAVEAT
Certain commercial equipment, instruments, software, and materials are
identified on this CD-ROM in order to adequately specify experimental
procedures used. Such identification does not imply recommendation or
endorsement by the National Institute of Standards and Technology
(NIST), nor does it imply that the equipment, instruments, software,
or materials identified are necessarily the best available for the
purpose.