RT-04 MDE Training Data Speech


Item Name: RT-04 MDE Training Data Speech
Authors: Haejoong Lee and Stephanie Strassel
LDC Catalog No.: LDC2005S16
ISBN: 1-58563-355-0
Release Date: Aug 17, 2005
Data Type: speech
Sample Rate: varied Hz
Sampling Format: varied
Data Source(s): broadcast news, telephone conversations
Project(s): EARS, GALE
Language(s): English
Language ID(s): eng
Distribution: 2 DVD
Member fee: $0 for 2005 members
Non-member Fee: US $2000.00
Reduced-License Fee: US $1000.00
Extra-Copy Fee: US $400.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Haejoong Lee and Stephanie Strassel
2005
RT-04 MDE Training Data Speech
Linguistic Data Consortium, Philadelphia

Introduction

This file contains documentation on the MDE RT-04 Training Data Speech, Linguistic Data Consortium (LDC) catalog number LDC2005S16 and ISBN 1-58563-355-0.

This corpus was created by Linguistic Data Consortium to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. This data set has been created and distributed by Linguistic Data Consortium. This data was previously released to the EARS MDE community as LDC2004E31.

The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE.

Samples

For an example of the data in this publication, please review this broadcast news sample and this telephone conversation sample.

Content Copyright

Portions 2004 Trustees of the University of Pennsylvania, 2003 American Broadcasting Company, 2003 National Broadcasting Company, 2003 Public Radio International, 2003 Cable News Network, Inc. All Rights Reserved, 2003 National Cable Satellite Corporation.

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.