RT-03 MDE Training Data Speech

Item Name: RT-03 MDE Training Data Speech
Author(s): Stephanie Strassel, Christopher Walker, Haejoong Lee
LDC Catalog No.: LDC2004S08
ISBN: 1-58563-300-3
ISLRN: 111-911-583-428-2
DOI: https://doi.org/10.35111/2jrd-8f06
Release Date: June 15, 2004
Member Year(s): 2004
DCMI Type(s): Sound
Sample Type: u-law, pcm
Data Source(s): telephone speech, broadcast news
Project(s): GALE, EARS
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2004S08 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Strassel, Stephanie, Christopher Walker, and Haejoong Lee. RT-03 MDE Training Data Speech LDC2004S08. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: View

Introduction

MDE RT-03 Training Data Speech corpus was produced by Linguistic Data Consortium (LDC), catalog number LDC2004S08 and ISBN 1-58563-300-3.

This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes.

The data in this release consists of English Conversational Telephone Speech (CTS) and Broadcast News (BN) audio data. The corresponding transcripts and annotations are available as MDE RT-03 Training Data Text and Annotations.

Data

There are 633 files, totalling approximately 5.39 GB (uncompressed) representing over 60 hours of recorded speech. There are approximately 20 hours of Broadcast News and over 40 hours of Conversational Telephone Speech contained in the corpus. The annotated data was originally developed to support the DARPA EARS Metadata Extraction (MDE) Program, and was distributed as training data for the RT-03F evaluation cycle.

The CTS data was drawn from the Switchboard-1 Release 2 corpus.

The BN speech data was drawn from the 1997 English Broadcast News Speech (HUB4) corpus, from four distinct sources:

American Broadcasting Company (ABC) (1998, 2001)
National Broadcasting Company (NBC) (1998, 2001)
Public Radio International (PRI) (1998)
Cable News Network (CNN) (2001)

Data Format

The audio data in this corpus conforms to the following technical specifications.
    Type Format Encoding Channels Sample Rate
    CTS WAVE u-Law 2 8000/sec
    BN WAVE 16-bit PCM 1 16000/sec

Note that the data is in wave format. This is the audio file format that our annotation tool (MDE Tool) supports. Since the annotation data is best explored with this open-source annotation tool, the WAVE format is our choice of data format.

Annotations

The transcripts corresponding to this speech have been annotated for various kinds of metadata. The goal of MDE is to enable technology that can take raw Speech-To-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, here by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs).

General information about the EARS MDE Annotation effort, including free annotation tools, annotation guidelines and additional information can be found at LDC's main EARS MDE Project Page.

Updates

There are no updates available at this time.

Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania

Available Media

View Fees





Login for the applicable fee