RT-03 MDE Training Data Text and Annotations
Item Name: | RT-03 MDE Training Data Text and Annotations |
Author(s): | Stephanie Strassel, Christopher Walker, Haejoong Lee |
LDC Catalog No.: | LDC2004T12 |
ISBN: | 1-58563-301-1 |
ISLRN: | 754-359-961-593-5 |
DOI: | https://doi.org/10.35111/ztjc-kx37 |
Release Date: | June 15, 2004 |
Member Year(s): | 2004 |
DCMI Type(s): | Text |
Data Source(s): | broadcast news, telephone conversations |
Project(s): | EARS, GALE |
Language(s): | English |
Language ID(s): | eng |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2004T12 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Strassel, Stephanie, Christopher Walker, and Haejoong Lee. RT-03 MDE Training Data Text and Annotations LDC2004T12. Web Download. Philadelphia: Linguistic Data Consortium, 2004. |
Related Works: | View |
Introduction
MDE RT-03 Training Data Text and Annotations was produced by the Linguistic Data Consortium (LDC) and contains transcripts and metadata annotations for approximately 20 hours of Broadcast News (BN) and 40 hours of Conversational Telephone Speech (CTS) in English.
This data was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes.
The corresponding speech data for these files is available as MDE RT-03 Training Data Speech (LDC2004S08).
Data
There are 633 files, totalling approximately 747 MB with a total of 764,978 tokens. The annotated data was originally distributed as training data for the RT-03F evaluation cycle.
The CTS data was drawn from Switchboard-1 Release 2 (LDC97S62). There are two sets of CTS data. The main set is located in the "train" folder of the release and contains 377 files of text and annotation representing 40 hours of audio. The "meteer-mapped" folder contains another 40 files with Meteer annotation representing approximately 6 hours of audio. The Meteer annotation specifications differ from the SimpleMDE specifications in important ways; these files are included to compare the two different annotation modes.
The BN speech data was drawn from 1997 English Broadcast News Speech (HUB4) (LDC98S71), from four distinct sources:
- American Broadcasting Company (1998, 2001)
- National Broadcasting Company (1998, 2001)
- Public Radio International (1998)
- Cable News Network (2001)
In simple terms, the main goal of MDE is the creation of automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um," discourse markers like "you know," asides and parentheticals, and editing terms like "sorry" and "I mean." Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units, or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, in this case by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels, and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs).
The data appears in two formats. The AG Atlas (ag.xml) format represents the native annotation format, and utilizes the Annotation Graph Library.
The data is also provided in RTTM format developed by NIST to support the EARS Program. The RTTM format labels each token in the reference transcript according to the properties it displays: lexeme vs. non-lexeme; edit, filler, SU, etc.
General information about the EARS MDE Annotation effort can be found under the EARS heading at LDC's Past Projects Page.
Samples
Updates
There are no updates available at this time.
Portions (c) 1998 American Broadcasting Company, Inc., (c) 1997-98 Cable News Network, Inc., (c) 1997 Public Radio International, (c) 1997 National Cable Satellite Corporation, (c) 2004 Trustees of the University of Pennsylvania