1996 DARPA/NIST Continous Speech Recognition
Broadcast News Hub-4 Development Test Material

This CD-ROM contains the development test material used in conjunction with the 1996 DARPA/NIST Continuous Speech Recognition Broadcast News Hub-4 Benchmark Test administered by the NIST Spoken Natural Language Processing Group. The 1996 Hub-4 Benchmark Test is documented in the proceedings of the 1997 DARPA Speech Recognition Workshop. To ease access, most of the links on this page point to files on this CD-ROM. But, for your convenience, a few external links are also included.

Note: This CD-ROM has been made available for archival purposes only. Since the 1996 Hub-4 Benchmark Test is past, please do NOT submit results for this material to NIST for scoring unless you have made prior explicit arrangements with NIST.

Note that the waveform and transcript data on this disc are licensed through the Linguistic Data Consortium (LDC) and are subject to usage restrictions. Contact the LDC for license agreement information.


HUB-4 DATA AND DOCUMENTATION


Instructions

Sites participating in the Hub-4 Benchmark Test were provided with a detailed test specification document. The test specification document, h496evsp.txt defined by the Hub-4 Working Group, contains the rules and conditions for implementing the test.

The test implementation document, h496evsc.txt contains instructions for preparing and scoring your test results using the NIST scoring protocol and software.

The 1996 Hub-4 Benchmark Test supported Partitioned and Unpartitioned Evaluation modes. In the Partitioned Evaluation, certain side information was provided for the recognizers about recording conditions, speaker changes, etc. - essentially performing the job of segmentation for the recognizers. In the Unpartitioned Evaluation, no side information was provided for the recognizers except that long stretches without speech were skipped over. Evaluation Map files were created for each mode to provide pointers to the precise waveform data to be recognized with the pertainent side information for the Partitioned Evaluation.


Evaluation Map Files

To implement a Partitioned Evaluation development test, you will need to use the Partitioned Evaluation Map (PEM) file, h496dt.pem. This file contains pointers into the waveforms on the CD-ROM for the "Partitions" to be evaluated. The Partitions are determined by changes in any of several attributes which have been collated into 7 Focus Conditions. The Focus Conditions and determining attributes are included for each Partition. The Hub-4 focus condition definitions are provided in the Hub-4 Annotation Specification.

To implement the Unpartitioned Evaluation test, you will need to use the Unpartitioned Evaluation Map (UEM) file, h496dt.uem. This file contains pointers into the waveforms on the test CD-ROM for the excerpts to be evaluated.


Waveform Files

The following SPHERE-formatted waveform files, which contain the Hub-4 Development Test material are located in the "devdata" directory of this CD-ROM. Each waveform file contains an excerpt of a radio or television broadcast. Do NOT perform recognition on all of the data in each waveform file. Perform recognition on only the excerpts specified in the map file (PEM or UEM) for the test you are implementing.

Other Documentation

The conventions and formats used to annotate/transcribe the Hub-4 data are documented in the 1996 Hub-4 Annotation Specification for Evaluation of Speech Recognition on Broadcast News which is provided in two formats:
h4annot.doc (Microsoft Word 6.0 Version)
h4annot.ps (PostScript Version)
h4annot.pdf (Adobe Acrobat Version)

This document contains the Hub-4 Partitioning definitions as well as the formats for the Partitioned Evaluation Map (PEM) file, Unpartitioned Evaluation Map (UEM) file, and Segment Time Marked (STM) file used in scoring recognition output.


Transcript Files

The transcription files corresponding to the above waveform files are also located in the "devdata" directory of this CD-ROM and are as follows:

Reference STM File

Reference Segment Time Mark (STM) files are used by the NIST scoring software to score recognition results and may be derived from the above transcript files using the PERL utility bn_filt.pl in conjunction with the 1996 Hub-4 speaker information database spkr_rdb.sgm. The file, bn_filt.doc contains documentation for bn_filt.pl. For your convenience, STM files for both Partitioned and Unpartitioned evaluation modes have been generated for this development test data. The STM file for the PE is located in h496dtpe.stm and the STM file for the UE is located in h496dtue.stm


Transcript Orthography Mapping Files

The NIST scoring software uses a set of rules files which transform multiple representations of the same words into a predetermined single representation prior to scoring. The file, et96_1.glm, contains rules for global substitutions, lexical equivalents and contractions and the file, et96_1.utm, contains rules for utterance-specific lexical equivalents (this file is empty, but is still required by the program.) Note that these rules files have been developed especially for the 1996 Hub-4 evaluation test data but may be applied to the development test data as well. However, please note that some word variants in the development test data may not be covered in the rule set.


Speaker Information Database

The file, spkr_rdb.sgm, is an SGML tagged database file which contains information about each of the speakers in the 1996 Hub-4 corpus.


NIST SOFTWARE


Scoring Software

You can score the output of your recognition system against the reference transcriptions using the NIST SCLITE (Version 1.4a) speech recognition scoring software located in the sclite directory of this disc Please refer to the files, readme.txt and install.txt to get started with SCLITE. The documentation for using SCLITE is located in the file, sclite.htm. If you have questions about installing or using SCLITE, you may send email to jonathan.fiscus@nist.gov.

If you would like to score your results using the same protocol used in the November 1996 Hub-4 Benchmark Test, please follow the instructions in the file, h496evsc.txt.

Note that SCLITE is currently available only for UNIX platforms.

The file, h4_score.sh, contains a sample script for scoring Hub-4 results using SCLITE.


Speech Waveform Manipulation Utilities

The Hub-4 Benchmark Test waveform files are encoded using the NIST SPeech HEader REsources (SPHERE) format and may be manipulated using the SPHERE (Version 2.6a) utilities and libraries located in the sphere directory of this disc. Please refer to the file readme.txt to get started with SPHERE. If you have questions about installing or using SPHERE, you may send email to jonathan.fiscus@nist.gov.

Note that SPHERE is currently available only for UNIX platforms.


Software Updates

NIST software updates are made periodically and are available via anonymous ftp to jaguar.ncsl.nist.gov/pub.


CONTACT INFORMATION


If you have questions regarding the HUB-4 data and protocols listed in this document, contact john.garofolo@nist.gov.

If you have questions regarding NIST software, data filtering, or scoring your recognizer output, contact jonathan.fiscus@nist.gov.

If you are interested in participating in future NIST speech recognition tests, contact david.pallett@nist.gov.


CAVEAT


Certain commercial equipment, instruments, software, and materials are identified on this CD-ROM in order to adequately specify experimental procedures used. Such identification does not imply recommendation or endorsement by the National Institute of Standards and Technology (NIST), nor does it imply that the equipment, instruments, software, or materials identified are necessarily the best available for the purpose.