ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 September 20, 2006 Introduction ------------ This release contains the English evaluation data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the Automatic Content Extraction (ACE) program. The evaluation was held in August 2004 and a workshop in September 2004. This data was used to evaluate participants' performance, and it is now being released for general use. It is meant as a supplement to the training data, available as LDC2005T07 "ACE Time Normalization (TERN) 2004 English Training Data v 1.0" All files have been doubly-annotated by two separate annotators and then reconciled, using the TIDES 2003 Standard for the Annotation of Temporal Expressions. The annotation specifications for this corpus were developed under DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program, with continuing support from ACE. Corpus Applications ------------------- Keywords: temporal reasoning, information extraction, automatic content extraction, question answering, summarization The purpose of this corpus and the TERN evaluation is to advance the state of the art in the automatic recognition and normalization of natural language temporal expressions. In most language contexts such expressions are indexical. For example, with "Monday", "last week", or "three months starting October 1", one must know the narrative reference time in order to pinpoint the time interval being conveyed by the expression. In addition, for data exchange purposes, it is essential that the identified interval be rendered according to an established standard, i.e., normalized. Accurate identification and normalization of temporal expressions is in turn essential for the temporal reasoning being demanded by advanced NLP applications such as question answering, information extraction, and summarization. Data Profile ------------ All of the TERN data is text, and is grouped by individual directories according to each source type: bnews - Broadcast news data from the TDT-4 Corpus nwire - Newswire data from the TDT-4 Corpus This publication includes both the source data files in sgml format and the annotated files, also in sgml format (*.tmx.sgml). Languages --------- This release of the TERN data contains only English language files. The 2004 TERN evaluation also made use of Chinese language files, but these are not being released at this time, pending further quality control procedures. Corpus Structure and Size ------------------------- The data/ directory contains the English corpus, which consists of 192 files (54K words), distributed by news source as follows: words documents bnews 26,418 127 nwire 28,196 65 TOTAL 54,614 192 DTDs ---- The dtd/ directory contains srctext_for_TIME2.dtd, based on LDC's srctext.dtd, available from LDC for use with their TDT-4 data. It has been modified to accommodate files annotated according to the TIDES 2003 Standard for the Annotation of Temporal Expressions. DOC ---- In addition to this readme, the doc/ directory contains a copy of the "TIDES 2003 Standard for the Annotation of Temporal Expressions," which are the annotation guidelines used to create this corpus. Additional Resources -------------------- For information regarding the ACE program and ACE technology evaluations administered by the National Institute of Standards and Technology (NIST), visit http://www.nist.gov/speech/tests/ace/2004/ Follow the link for the Time Expression Recognition and Normalization (TERN) evaluation for additional resources for TERN, including the evaluation results, interannotator agreement measures, annotation guidelines, scoring software, and the annotation software used to create this corpus. If you have any questions about this corpus, please contact Lisa Ferro, lferro@mitre.org. Authors -------- Lisa Ferro Janet Hitzeman Elizabeth Lima Beth Sundheim Copyrights ---------- Portions Copyright 1998 Los Angeles Times-Washington Post News Service, Inc. Copyright 1998, 2000 American Broadcasting Corporation Copyright 1998, 2000 Cable News Network, Inc. Copyright 1998, 2000 Press Association, Inc. Copyright 1998, 2000 New York Times Copyright 1998, 2000 National Broadcasting Company, Inc. Copyright 1998, 2000 Public Radio International "The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston. --- this readme created September 20, 2006 updated May 2, 2008 Approved for Public Release: 06-1177. Distribution Unlimited. Copyright 2010 The MITRE Corporation.