README File

				       for

                                   LDC2005S16
                            MDE RT-04 Training Speech
		      
                                  July 28, 2005

                            Linguistic Data Consortium


1. Introduction

   This corpus was created by Linguistic Data Consortium to provide
   training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation,
   part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text)
   Program.  This data set has been created and distributed by Linguistic
   Data Consortium.  This data was previously released to the EARS MDE
   community as LDC2004E31.  

   The goal of MDE is to enable technology that can take raw Speech-to-Text
   output and refine it into forms that are of more use to humans and to
   downstream automatic processes.  In simple terms, this means the
   creation of automatic transcripts that are maximally readable.  This
   readability might be achieved in a number of ways: flagging non-content
   words like filled pauses and discourse markers for optional removal; marking
   sections of disfluent speech; and creating boundaries between natural
   breakpoints in the flow of speech so that each sentence or other
   meaningful unit of speech might be presented on a separate line within
   the resulting transcript.  Natural capitalization, punctuation and
   standardized spelling, plus sensible conventions for representing
   speaker turns and identity are further elements in the readable
   transcript.  LDC has defined a SimpleMDE annotation task specification
   and has annotated English telephone and broadcast news data to
   provide training data for MDE.  

   The transcript and annotation files corresponding to this release are
   available as LDC2005S24 (MDE RT04 Training Text/Annotations).  In
   general, there is one-to-one correspondence between speech files and
   annotation files with one exception: several .ag.xml annotation files
   correspond to one speech file in the bnews corpus.  It's because the
   bnews files were divided into roughly 5 minute chunks and then each
   chunk was annotated as a unit.  These chunks are labeled with
   "-split001", "-split002", etc.  Note that the ag-to-rttm script combines
   these chunks together, so the one-to-one correspondence is kept between
   the speech files and .rttm files.


2. Corpus Description

   There are 419 files, totalling approximately 5 GB (uncompressed)
   representing over 62 hours of recorded speech. There are approximately 22
   hours of Broadcast News and over 40 hours of Conversational Telephone Speech
   contained in the corpus.

   The CTS data was drawn from the SWITCHBOARD-1 Release 2 corpus.

   The BN speech data was drawn from the 1997 English Broadcast News Speech
   (Hub-4) corpus, from 4 distinct sources:

	American Broadcasting Company 	(ABC) 	(1998, 2001)
	National Broadcasting Company 	(NBC) 	(1998, 2001)
	Public Radio International 	(PRI) 	(1998)
	Cable News Network		(CNN) 	(2001)

3. Data Format 

   The audio data in this corpus conforms to the following technical
   specifications.

      Type 	Format 	Encoding 	Channels 	Sample Rate
      CTS 	WAVE 	u-Law		2		8000/sec
      BN 	WAVE 	16-bit PCM 	1		16000/sec

   Note that the data is in WAVE format. This is the audio file format that our
   annotation tool (MDE Tool) supports. Since the annotation data is best
   explored with this open-source annotation tool, the WAVE format is our
   choice of data format.

4. Directory structure
   
   The speech files are divided into two data directories (cts for
   conversational telephone speech and bnews for broadcast news data).  Each of
   cts and bnews directories compries a DVD disc, and it is located under the
   data/audio directory.

    bnews disc:
        data/
            audio/
	        bnews/

    cts disc:
        data/
	    audio/
	        cts/
		

5. Futher Information

   o Further information about the EARS Program and the Rich
     Transcription 2004 (RT-04) Evaluations administered by the
     National Institute of Standards and Technology (NIST) can be
     found at:

     http://www.nist.gov/speech/tests/rt/rt2004/fall/

    o THe complete annotation guidelines and further information about the EARS
     MDE Project at LDC can be found at:

     http://www.ldc.upenn.edu/Projects/MDE/

   o LDC EARS web site:

     http://www.ldc.upenn.edu/Projects/EARS/

   o Annotation Graphs (AG): http://www.ldc.upenn.edu/AG/

     http://agtk.sourceforge.net/

   o The MDE AG format spec:

     docs/MDE/mdeformat.txt

6. Copyrights

   Portions of this release are covered by the following copyrights:

   (c) 2004 Trustees of the University of Pennsylvania
   (c) 2003 American Broadcasting Company
   (c) 2003 National Broadcasting Company
   (c) 2003 Public Radio International
   (c) 2003 Cable News Network, Inc. All Rights Reserved
   (c) 2003 National Cable Satellite Corporation

   The World is a co-production of Public Radio International and the
   British Broadcasting Corporation and is produced at WGBH Boston.