1996 English Broadcast News Dev and Eval (HUB4)

Item Name: 1996 English Broadcast News Dev and Eval (HUB4)
Author(s): David Graff, Jennifer Alabiso, Jonathan G. Fiscus, John S. Garofolo, William Fisher, David Pallett
LDC Catalog No.: LDC97S66
ISBN: 1-58563-108-6
ISLRN: 827-422-903-193-6
DOI: https://doi.org/10.35111/gxsc-gf19
Member Year(s): 1997, 1998
DCMI Type(s): Sound
Sample Type: 1-channel pcm
Sample Rate: 16000
Data Source(s): broadcast news
Project(s): Hub4, GALE, EARS
Application(s): speech recognition
Language(s): English
Language ID(s): eng
License(s): NPR and USC Archive User Agreement
Online Documentation: LDC97S66 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Graff, David, et al. 1996 English Broadcast News Dev and Eval (HUB4) LDC97S66. Web Download. Philadelphia: Linguistic Data Consortium, 1997.
Related Works: View

LDC97S44 - Speech data LDC97S66 - Dev and eval LDC97T22 - Transcripts


The 1996 Broadcast News Speech Corpus contains a total of 104 hours of broadcasts from ABC, CNN and CSPAN television networks and NPR and PRI radio networks with corresponding transcripts. The primary motivation for this collection is to provide training data for the DARPA "HUB4" Project on continuous speech recognition in the broadcast domain.


The speech files are available in a 19 disc training data set with one additional disc of development data and an additional disc of evaluation data. The following programs are represented in this corpus:

  • ABC Nightline
  • ABC World Nightly News
  • ABC World News Tonight
  • CNN Early Edition
  • CNN Early Prime News
  • CNN Headline News
  • CNN Prime Time News
  • CNN The World Today
  • CSPAN Washington Journal
  • NPR All Things Considered
  • NPR Marketplace

    Transcripts have been made of all recordings in this publication, manually time aligned to the phrasal level, annotated to identify boundaries between news stories, speaker turn boundaries, and gender information about the speakers. The released version of the transcripts is in SGML format and there is accompanying documentation and an SGML DTD file, included with the transcription release. The transcripts are available via FTP.


    There are no updates at this time.

    Additional Licensing Instructions

    This 'members-only' corpora is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

Available Media

View Fees

Login for the applicable fee