ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0

                                 September 20, 2006

Introduction
------------

This release contains the English evaluation data prepared for the
2004 Time Expression Recognition and Normalization (TERN) Evaluation, 
sponsored by the Automatic Content Extraction (ACE) program.  The evaluation 
was held in August 2004 and a workshop in September 2004. This data was used
to evaluate participants' performance, and it is now being released 
for general use. It is meant as a supplement to the training data,
available as LDC2005T07 "ACE Time Normalization (TERN) 2004 
English Training Data v 1.0"

All files have been doubly-annotated by two separate annotators
and then reconciled, using the TIDES 2003 Standard for the Annotation 
of Temporal Expressions. 

The annotation specifications for this corpus were developed under DARPA's 
Translingual Information Detection Extraction and Summarization (TIDES) 
program, with continuing support from ACE.

Corpus Applications
-------------------

Keywords:  temporal reasoning, information extraction,
     automatic content extraction, question answering, summarization

The purpose of this corpus and the TERN evaluation is to advance the
state of the art in the automatic recognition and normalization of 
natural language temporal expressions.  In most language contexts such 
expressions are indexical.  For example, with "Monday", "last week", or 
"three months starting October 1", one must know the narrative reference 
time in order to pinpoint the time interval being conveyed by the expression.
In addition, for data exchange purposes, it is essential that the identified
interval be rendered according to an established standard, i.e., normalized.
Accurate identification and normalization of temporal expressions is in turn 
essential for the temporal reasoning being demanded by advanced NLP 
applications such as question answering, information extraction, and 
summarization.

Data Profile
------------

All of the TERN data is text, and is grouped by individual directories 
according to each source type:

  bnews - Broadcast news data from the TDT-4 Corpus

  nwire - Newswire data from the TDT-4 Corpus

This publication includes both the source data files in sgml format and the
annotated files, also in sgml format (*.tmx.sgml). 

Languages
---------

This release of the TERN data contains only English language files.  The 2004 
TERN evaluation also made use of Chinese language files, but these are not 
being released at this time, pending further quality control procedures. 

Corpus Structure and Size
-------------------------
The data/ directory contains the English corpus, which consists of 192 files (54K words), distributed by news source as follows:

          words    documents
bnews     26,418   127
nwire     28,196    65                                  
TOTAL     54,614   192


DTDs
----

The dtd/ directory contains srctext_for_TIME2.dtd, based 
on LDC's srctext.dtd, available from LDC for use with their 
TDT-4 data. It has been modified to accommodate files annotated 
according to the TIDES 2003 Standard for the Annotation of Temporal 
Expressions.

DOC
---- 

In addition to this readme, the doc/ directory contains a copy of
the "TIDES 2003 Standard for the Annotation of Temporal Expressions,"
which are the annotation guidelines used to create this corpus. 


Additional Resources
--------------------

For information regarding the ACE program and ACE technology evaluations
administered by the National Institute of Standards and Technology (NIST),
visit

  http://www.nist.gov/speech/tests/ace/2004/

Follow the link for the Time Expression Recognition and Normalization (TERN)
evaluation for additional resources for TERN, including the evaluation 
results, interannotator agreement measures, annotation guidelines, 
scoring software, and the annotation software used to create this corpus.

If you have any questions about this corpus, 
please contact Lisa Ferro, lferro@mitre.org.

Authors
--------  

Lisa Ferro <lferro@mitre.org> 
Janet Hitzeman
Elizabeth Lima
Beth Sundheim

Copyrights
----------

Portions 
Copyright 1998 Los Angeles Times-Washington Post News Service, Inc.
Copyright 1998, 2000 American Broadcasting Corporation
Copyright 1998, 2000 Cable News Network, Inc.
Copyright 1998, 2000 Press Association, Inc.
Copyright 1998, 2000 New York Times
Copyright 1998, 2000 National Broadcasting Company, Inc.
Copyright 1998, 2000 Public Radio International

"The World" is a co-production of Public Radio International and the 
British Broadcasting Corporation and is produced at WGBH Boston. 


---

this readme created September 20, 2006
updated May 2, 2008

Approved for Public Release: 06-1177. Distribution Unlimited.
Copyright 2010 The MITRE Corporation.