1.0 Introduction The data contained in this release was originally created to support the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program in Metadata Extraction (MDE). The goal of EARS MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. The data in this release consists of English Conversational Telephone Speech (CTS) and Broadcast News (BN) transcripts and annotations. The corresponding speech data can be found in MDE RT-03 Training Data Speech (LDC2004S08). The CTS data is drawn from the SWITCHBOARD-I Release 2 corpus (LDC97S62). The BN data is drawn from 4 distinct sources, selected from the 1997 Hub-4 Broadcast News Corpus (LDC98S71, LDC98T28): American Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company (NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN) (2001) Note that the BN MDE annotation was performed on parts of broadcasts rather than complete programs. Information about the exact segments of the audio files included in this release are provided in clipinfo.tbl, within this directory. 2.0 Data There are 633 files, totalling approximately 747 MB with a total of 764,978 tokens. There are approximately 20 hours of Broadcast News and over 40 hours of Conversational Telephone Speech contained in the corpus. The annotated data was originally developed to support the DARPA EARS Metadata Extraction (MDE) Program, and was distributed as training data for the RT-03F evaluation cycle (LDC2003E19). 2.1 Data Set Descriptions annot/cts/meteer-mapped/ The 40 CTS files contained within this directory are drawn from the SWITCHBOARD-1 Release 2 corpus, using updated transcripts provided by ISIP. The conversations in this directory also appear in the Treebank 3/Meteer annotations. The Meteer annotation specification shares some features with the current SimpleMDE approach, but also contains important differences. These 40 CTS files provide data for comparing Meteer and SimpleMDE annotations. annot/cts/train/ This directory contains 377 conversational telephone speech files from the Switchboard corpus, developed as training data for the RT-03F evaluation. annot/bn/train/ This directory contains 216 broadcast news files from various sources, developed as training data for the RT-03F evaluation. Approximately 10% (66 files) of the files contained in this release are "gold standard" files that have been improved through dual annotation and discrepancy resolution. The adjudicated files are as follows: CTS: sw2166 sw2173 sw2332 sw2348 sw2375 sw2522 sw2553 sw2698 sw2702 sw2742 sw2822 sw2861 sw2933 sw2964 sw2986 sw2997 sw3089 sw3264 sw3301 sw3422 sw3444 sw3540 sw3634 sw3783 sw3846 sw3878 sw3943 sw3953 sw3954 sw3999 sw4070 sw4146 sw4189 sw4259 sw4265 sw4299 sw4300 sw4497 sw4594 sw4680 sw4769 sw4843 sw4891 sw4901 sw4925 BN: ed980106-split001 ed980108-split005 ea980114-split004 ee970627-split009 ee970701-split005 ee970702-split004 ee970723-split008 eh971008-split010 eh971015-split009 eh971016-split006 eh971027-split004 eh971030-split009 em971003-split005 eo970829-split002 eo970906-split005 eo970907-split003 eo971224-split002 eo971225-split006 eo971225-split008 ew970626-split001 ew970701-split004 2.2 Data format Within the ./annot directory the data appears in two formats. The AG Atlas (ag.xml) format represents the native annotation format, and utilizes the Annotation Graph Library. This data is best explored using the LDC MDE Toolkit, which is freely available at http://www.ldc.upenn.edu/Projects/MDE/Tools. The data is also provided in RTTM format developed by NIST. The RTTM format labels each token in the reference transcript according to the properties it displays: lexeme vs. non-lexeme; edit, filler, SU, etc. The docs directory in this release contains a complete description of the RTTM format as well as the Evaluation Plan for the RT-03 Fall Evaluation, for which this data was created. Note that for the broadcast news files, the token times (word times) and morpheme times are not manually-created. Tokens times are interpolated from manually-created segment times (provided as part of the original reference transcripts). 3.0 Annotations The transcripts within this corpus have been annotated for various kinds of metadata. The goal of MDE is to enable technology that can take raw Speech-To-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. To this end, LDC has defined a SimpleMDE annotation task. Under SimpleMDE, annotators identify four types of fillers: filled pauses like "uh" and "um", discourse markers like "you know", asides and parentheticals, and editing terms like "sorry" and "I mean". Edit disfluencies are also identified; the full extent of the disfluency (or string of adjacent disfluencies) and interruption points are tagged. Annotators further identify SUs (alternately semantic units, sense units, syntactic units, slash units or sentence units); that is, units within the discourse that function to express a complete thought or idea on the part of a speaker. As with disfluency annotation, the goal of SU labeling is to improve transcript readability, here by creating a transcript in which information is presented in small, structured, coherent chunks rather than long turns or stories. There are four types of sentence-level SUs: statements, questions, backchannels and incomplete SUs. To enhance inter-annotator consistency, the annotation task also identifies a number of sub-sentence SU boundaries (coordination and clausal SUs). The docs directory contains the complete set of SimpleMDE annotation guidelines used to create this data. 4.0 Additional Information Further information about the EARS MDE Project including annotation guidelines, tools and updates about ongoing work can be found on LDC's EARS MDE Project site: http://www.ldc.upenn.edu/Projects/MDE/ Further information about the EARS Program and the Rich Transcription Evaluations administered by the National Institute of Standards and Technology (NIST) can be found at: http://www.nist.gov/speech/tests/rt/ 5.0 Content Copyright Portions of this release are covered by the following copyrights: (c) 2004, Trustees of the University of Pennsylvania Copyright (c) 1998, American Broadcasting Companies, Inc. (c) 1997-98, Cable News Network, Inc. (c) 1997, Public Radio International (c) 1997, National Cable Satellite Corporation The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.