README File

				       for

                                   LDC2005T24 
                    MDE RT-04 Training Data Text/Annotations
		      
                                  July 19, 2005

                            Linguistic Data Consortium


1. Introduction

   This corpus was created by Linguistic Data Consortium to provide
   training data for the RT-04 Fall Metadata Extraction (MDE) Evaluation,
   part of the DARPA EARS (Efficient, Affordable, Reusable Speech-to-Text)
   Program.  This data set has been created and distributed by Linguistic
   Data Consortium.  This data was previously released to the EARS MDE
   community as LDC2004E31.  

   The goal of MDE is to enable technology that can take raw Speech-to-Text
   output and refine it into forms that are of more use to humans and to
   downstream automatic processes.  In simple terms, this means the
   creation of automatic transcripts that are maximally readable.  This
   readability might be achieved in a number of ways: flagging non-content
   words like filled pauses and discourse markers for optional removal; marking
   sections of disfluent speech; and creating boundaries between natural
   breakpoints in the flow of speech so that each sentence or other
   meaningful unit of speech might be presented on a separate line within
   the resulting transcript.  Natural capitalization, punctuation and
   standardized spelling, plus sensible conventions for representing
   speaker turns and identity are further elements in the readable
   transcript.  LDC has defined a SimpleMDE annotation task specification
   and has annotated English telephone and broadcast news data to
   provide training data for MDE.  

   In this release, some original annotations contained in LDC2004E31 have
   been re-mapped to new MDE elements to support better annotation
   consistency.  In particular, the mapping affects Discourse Responses
   (DR), Discourse Markers (DM) and Backchannel SUs (BC).  A description of
   the original mapping proposed by ICSI appears in 3) below, with complete
   documentation of the mapping rules contained in the
   docs/drmap-discussion directory.  The scripts used to apply the
   mapping can be found in the docs/scripts/drmap directory.


2. Corpus Description

   The following table shows a summary of data included in this release:

   Type  # Files          Hours (Approx.)  Source
   ----- ---------------  ---------------  ----------------------------
   CTS   396              40               Switchboard, with ISIP transcripts
   BN    216 (23 shows)   20               Hub-4 Broadcast News Corpus

   (See docs/fileinfo.tbl for more detail.)

   Note: There are 23 full BN shows except that the file ep970812 is
   truncated to 76 minutes (the end of the show has not been annotated).

   Due to technical reasons, the BN transcripts have been divided into roughly
   5 minute chunks before annotation, and then annotated.  These chunks of
   files have been labeled with "-split001", "-split002", etc. in their file
   name.
   
   Approximately 10% of the training data has been dually annotated by
   two annotators working independently, and adjudicated to resolve any
   discrepancies between the two annotations.  These adjudicated versions
   of the data are included in this release.

   The docs/fileinfo.tbl document included in this release contains
   additional information about each file, including its annotation QC
   status (First_Pass, Second_Pass or Adjudication); approximate duration;
   number of tokens; and number of annotations of various types.  This
   last piece of information is provided as a quick way to assess how
   "interesting" each file might be for various MDE phenomena.


3. Directory structure & files
   
   The release is divided into two data directories (cts for conversational
   telephone speech and bnews for broadcast news data) plus a docs directory
   containing additional information about the release.  The data directories
   contain a variety of file formats: MDE AG XML (.ag.xml), RTTM (.rttm) and
   UEM (.uem) files.  MDE AG XML is the LDC internal file format, and RTTM is
   the official file format of the MDE program.  UEM file specifies the
   portion of a speech file that is subject to MDE evaluation.  The RTTM and
   UEM files have been generated using a conversion program developed by NIST.
   This script (ag-to-rttm+uems-v21.pl) can be found in the docs/scripts
   directory of the annotation package.
   
   The speech files corresponding to this release are available as
   LDC2005S16 (MDE RT04 Training Data Speech).   In general,
   there is one-to-one correspondence between speech files and
   annotation files with one exception: several .ag.xml files correspond to one
   speech file in the bnews corpus.  It's because the bnews files were divided
   into roughly 5 minute chunks and then each chunk was annotated as a unit.
   These chunks are labeled with "-split001", "-split002", etc.  Note that the
   ag-to-rttm script combines these chunks together, so the one-to-one
   correspondence is kept between the speech files and .rttm files.
      
4. Mapping Rules for Discourse Reponses, Backchannels and Discourse Markers

   In cooperation with ICSI, LDC has converted a number of annotated 
   objects of one type into annotated objects of another type.  A common
   example of this conversion would take an object of type 
   'responsiveDiscourseMarker' covering the token sequence 'yeah' and 
   convert it into an object of type 'Backchannel SU'.  This mapping is
   represented in the documentation accomanying the present release in 
   the following manner:
 
   orig_DR	b/	++	yeah

   A full description of the mappings proposed by ICSI and implemented by
   LDC can be found in:

   docs/drmap-discussion/

   This directory contains a number of documents:

   key.txt       A key describing the notational conventions used in 
                 the mapping documents;

   prefix1.txt  ICSI's description of the proposed 'prefix rules' for
                extending the domain (token sequences) of the mapping
                rules productively when the mapping algoritm is 
                implemented;

   prefix2.txt  LDC's description of the actual implementation of the
                'prefix rules' in the mapping algorithm;

   map.txt      The set of mapping rules which (when augmented by the
                'prefix rules') are used as the basis of the mapping
                algorithm;

   postfix.txt  The set of 'postprocess' correction that are made to
                the data following the application of the mapping 
       	        algorithm;


5. Futher Information

   o Further information about the EARS Program and the Rich
     Transcription 2004 (RT-04) Evaluations administered by the
     National Institute of Standards and Technology (NIST) can be
     found at:

     http://www.nist.gov/speech/tests/rt/rt2004/fall/

    o THe complete annotation guidelines and further information about the EARS
     MDE Project at LDC can be found at:

     http://www.ldc.upenn.edu/Projects/MDE/

   o LDC EARS web site:

     http://www.ldc.upenn.edu/Projects/EARS/

   o Annotation Graphs (AG): http://www.ldc.upenn.edu/AG/

     http://agtk.sourceforge.net/

   o The MDE AG format spec:

     docs/MDE/mdeformat.txt


6. Copyrights

   Portions of this release are covered by the following copyrights:

   (c) 2004 Trustees of the University of Pennsylvania
   (c) 2003 American Broadcasting Company
   (c) 2003 National Broadcasting Company
   (c) 2003 Public Radio International
   (c) 2003 Cable News Network, Inc. All Rights Reserved
   (c) 2003 National Cable Satellite Corporation

   The World is a co-production of Public Radio International and the
   British Broadcasting Corporation and is produced at WGBH Boston.