Home › Language Resources › Data

RT-03 MDE Training Data Text and Annotations

Item Name:	RT-03 MDE Training Data Text and Annotations
Author(s):	Stephanie Strassel, Christopher Walker, Haejoong Lee
LDC Catalog No.:	LDC2004T12
ISBN:	1-58563-301-1
ISLRN:	754-359-961-593-5
DOI:	https://doi.org/10.35111/ztjc-kx37
Release Date:	June 15, 2004
Member Year(s):	2004
DCMI Type(s):	Text
Data Source(s):	broadcast news, telephone conversations
Project(s):	EARS, GALE
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2004T12 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Strassel, Stephanie, Christopher Walker, and Haejoong Lee. RT-03 MDE Training Data Text and Annotations LDC2004T12. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: Hide	View isAnnotationOf LDC97S62 Switchboard-1 Release 2 LDC98S71 1997 English Broadcast News Speech (HUB4) LDC2004S08 RT-03 MDE Training Data Speech hasContinuation LDC2005T24 RT-04 MDE Training Data Text/Annotations isSimilarWith LDC2009T01 English CTS Treebank with Structural Metadata isCreatedBy Annotation Graph Toolkit http://agtk.sourceforge.net/

Introduction

MDE RT-03 Training Data Text and Annotations was produced by the Linguistic Data Consortium (LDC) and contains transcripts and metadata annotations for approximately 20 hours of Broadcast News (BN) and 40 hours of Conversational Telephone Speech (CTS) in English.

This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes.

The corresponding speech data for these files is available as MDE RT-03 Training Data Speech (LDC2004S08).

Data

There are 633 files, totalling approximately 747 MB with a total of 764,978 tokens. The annotated data was originally distributed as training data for the RT-03F evaluation cycle.

The CTS data was drawn from Switchboard-1 Release 2 (LDC97S62). There are two sets of CTS data. The main set is located in the "train" folder of the release and contains 377 files of text and annotation representing 40 hours of audio. The "meteer-mapped" folder contains another 40 files with Meteer annotation representing approximately 6 hours of audio. The Meteer annotation specifications differ from the SimpleMDE specifications in important ways; these files are included to compare the two different annotation modes.

The BN speech data was drawn from 1997 English Broadcast News Speech (HUB4) (LDC98S71), from four distinct sources:

American Broadcasting Company (1998, 2001)
National Broadcasting Company (1998, 2001)
Public Radio International (1998)
Cable News Network (2001)

In simple terms, the main goal of MDE is the creation of automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units, or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, in this case by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels, and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs).

The data appears in two formats. The AG Atlas (ag.xml) format represents the native annotation format, and utilizes the Annotation Graph Library.

The data is also provided in RTTM format developed by NIST to support the EARS Program. The RTTM format labels each token in the reference transcript according to the properties it displays: lexeme vs. non-lexeme; edit, filler, SU, etc.

General information about the EARS MDE Annotation effort can be found under the EARS heading at LDC's Past Projects Page.

Samples

Updates

There are no updates available at this time.

Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania

Copyright

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania