RT-04 MDE Training Data Speech

Item Name: RT-04 MDE Training Data Speech
Author(s): Haejoong Lee, Stephanie Strassel
LDC Catalog No.: LDC2005S16
ISBN: 1-58563-355-0
ISLRN: 514-959-558-272-6
DOI: https://doi.org/10.35111/27r9-h809
Release Date: August 17, 2005
Member Year(s): 2005
DCMI Type(s): Sound
Sample Type: varied
Sample Rate: varied
Data Source(s): broadcast news, telephone conversations
Project(s): EARS, GALE
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2005S16 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Lee, Haejoong, and Stephanie Strassel. RT-04 MDE Training Data Speech LDC2005S16. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
RT-04 MDE Training Data Speech was developed by the Linguistic Data Consortium (LDC) and contains approximately 63 hours of English broadcast news and conversational telephone speech (CTS).

This corpus was created to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. This data set has been created and distributed by LDC. This data was previously released to the EARS MDE community as LDC2004E31.

The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE.

The transcript and annotation files corresponding to this release are available as RT-04 MDE Training Data Text/Annotations (LDC2005T24).


There are 419 files, 22.6 hours of Broadcast News, and 40.4 hours of CTS contained in the corpus. The CTS data was drawn from Switchboard-1 Release 2 (LDC97S62).

The BN speech data was drawn from the 1997 English Broadcast News Speech (Hub-4) corpus, from 4 distinct sources:

Name Abbreviation Years Collected
Broadcasting Company (ABC) (1998, 2001)
National Broadcasting Company (NBC) (1998, 2001)
Public Radio International (PRI) (1998)
Cable News Network (CNN) (2001)

The audio data in this corpus conforms to the following technical specifications:

Type Format Encoding Channels Sample Rate
CTS WAVE u-Law 2 8000/sec
BN WAVE 16-bit PCM 1 16000/sec


For an example of the data in this publication, please listen to this broadcast news (WAV) sample and this telephone conversation (WAV) sample.


