README File for LDC2005S16 MDE RT-04 Training Speech July 28, 2005 Linguistic Data Consortium 1. Introduction This corpus was created by Linguistic Data Consortium to provide training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation, part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text) Program. This data set has been created and distributed by Linguistic Data Consortium. This data was previously released to the EARS MDE community as LDC2004E31. The goal of MDE is to enable technology that can take raw Speech-to-Text output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: flagging non-content words like filled pauses and discourse markers for optional removal; marking sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation and standardized spelling, plus sensible conventions for representing speaker turns and identity are further elements in the readable transcript. LDC has defined a SimpleMDE annotation task specification and has annotated English telephone and broadcast news data to provide training data for MDE. The transcript and annotation files corresponding to this release are available as LDC2005S24 (MDE RT04 Training Text/Annotations). In general, there is one-to-one correspondence between speech files and annotation files with one exception: several .ag.xml annotation files correspond to one speech file in the bnews corpus. It's because the bnews files were divided into roughly 5 minute chunks and then each chunk was annotated as a unit. These chunks are labeled with "-split001", "-split002", etc. Note that the ag-to-rttm script combines these chunks together, so the one-to-one correspondence is kept between the speech files and .rttm files. 2. Corpus Description There are 419 files, totalling approximately 5 GB (uncompressed) representing over 62 hours of recorded speech. There are approximately 22 hours of Broadcast News and over 40 hours of Conversational Telephone Speech contained in the corpus. The CTS data was drawn from the SWITCHBOARD-1 Release 2 corpus. The BN speech data was drawn from the 1997 English Broadcast News Speech (Hub-4) corpus, from 4 distinct sources: American Broadcasting Company (ABC) (1998, 2001) National Broadcasting Company (NBC) (1998, 2001) Public Radio International (PRI) (1998) Cable News Network (CNN) (2001) 3. Data Format The audio data in this corpus conforms to the following technical specifications. Type Format Encoding Channels Sample Rate CTS WAVE u-Law 2 8000/sec BN WAVE 16-bit PCM 1 16000/sec Note that the data is in WAVE format. This is the audio file format that our annotation tool (MDE Tool) supports. Since the annotation data is best explored with this open-source annotation tool, the WAVE format is our choice of data format. 4. Directory structure The speech files are divided into two data directories (cts for conversational telephone speech and bnews for broadcast news data). Each of cts and bnews directories compries a DVD disc, and it is located under the data/audio directory. bnews disc: data/ audio/ bnews/ cts disc: data/ audio/ cts/ 5. Futher Information o Further information about the EARS Program and the Rich Transcription 2004 (RT-04) Evaluations administered by the National Institute of Standards and Technology (NIST) can be found at: http://www.nist.gov/speech/tests/rt/rt2004/fall/ o THe complete annotation guidelines and further information about the EARS MDE Project at LDC can be found at: http://www.ldc.upenn.edu/Projects/MDE/ o LDC EARS web site: http://www.ldc.upenn.edu/Projects/EARS/ o Annotation Graphs (AG): http://www.ldc.upenn.edu/AG/ http://agtk.sourceforge.net/ o The MDE AG format spec: docs/MDE/mdeformat.txt 6. Copyrights Portions of this release are covered by the following copyrights: (c) 2004 Trustees of the University of Pennsylvania (c) 2003 American Broadcasting Company (c) 2003 National Broadcasting Company (c) 2003 Public Radio International (c) 2003 Cable News Network, Inc. All Rights Reserved (c) 2003 National Cable Satellite Corporation The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.