Home › Language Resources › Data

ACE Time Normalization (TERN) 2004 English Training Data v 1.0

Item Name:	ACE Time Normalization (TERN) 2004 English Training Data v 1.0
Author(s):	Lisa Ferro, Laurie Gerber, Janet Hitzeman, Elizabeth Lima, Beth Sundheim
LDC Catalog No.:	LDC2005T07
ISBN:	1-58563-331-3
ISLRN:	357-991-519-054-6
DOI:	https://doi.org/10.35111/9nye-wg76
Release Date:	February 15, 2005
Member Year(s):	2005
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	ACE, GALE, TIDES
Application(s):	automatic content extraction, information extraction, question-answering, summarization, temporal analysis
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2005T07 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Ferro, Lisa, et al. ACE Time Normalization (TERN) 2004 English Training Data v 1.0 LDC2005T07. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: Hide	View hasContinuation LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 isSimilarWith LDC2005T09 ACE 2004 Multilingual Training Corpus LDC2006T06 ACE 2005 Multilingual Training Corpus LDC2014T18 ACE 2007 Multilingual Training Corpus

Introduction

ACE Time Normalization (TERN) 2004 English Training Data v 1.0 was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST) with support from the Automatic Content Extraction (ACE) program. It contains 862 files totalling 306,000 words of English news and treebank text.

This release contains the English training data prepared for the 2004 Time Expression Recognition and Normalization (TERN) Evaluation, sponsored by the ACE program. The evaluation was held in August 2004 and a workshop in September 2004. Evaluation participants received this data for training purposes, and it is now being released for general use.

The annotation specifications for this corpus were developed under DARPA's Translingual Information Detection Extraction and Summarization (TIDES) program, with continuing support from ACE.

The purpose of this corpus and the TERN evaluation is to advance the state of the art in the automatic recognition and normalization of natural language temporal expressions. In most language contexts such expressions are indexical. For example, with "Monday," "last week," or "three months starting October 1," one must know the narrative reference time in order to pinpoint the time interval being conveyed by the expression. In addition, for data exchange purposes, it is essential that the identified interval be rendered according to an established standard, i.e., normalized. Accurate identification and normalization of temporal expressions is in turn essential for the temporal reasoning being demanded by advanced NLP applications such as question answering, information extraction, and summarization.

Data

The data in this corpus is divided into three data sets, ace_2002, ace_2003, and ace_2004. Here are the genres and sources included in this corpus:

bnews - Broadcast news data from TDT4 Multilingual Text and Annotations (LDC2005T16)
nwire - Newswire data from TDT4 Multilingual Text and Annotations (LDC2005T16)
npaper - Washington Post articles (ace_2002 only)
arabic_treebank - Data from the Arabic Treebank 1 Corpus English translations from the MT-2003 translation data set
chinese_treebank - Data from the Chinese Treebank Version 4 English translations from the Chinese Treebank English Parallel Text Corpus

And here are the details for the data sets:

Data Set	Genre	Words	Documents
ace_2002	bnews	17,922	85
	npaper	14,682	17
	nwire	34,134	78
	Total	66,738	180
ace_2003	bnews	34,681	147
	nwire	58,592	102
	Total	93,273	249
ace_2004	bnews	61,621	222
	nwire	58,543	116
	arabic_treebank	13,466	58
	chinese_treebank	12,522	37
	Total	146,452	433
Grand Totals		306,463	862

The data in this corpus includes the original source files in SGML format (.sgm) and the annotated files, also in SGML format (.tmx.sgml).

Samples

For example of the data in this corpus, please view this source sample (SGML) and annotation sample (SGML).

Updates

None at this time.

Copyright

Portions © 1998 Los Angeles Times-Washington Post News Service, Inc., © 1998, 2000 American Broadcasting Corporation, © 1998, 2000 Cable News Network, LP, LLLP, © 1998, 2000 The Associated Press, © 1998, 2000 New York Times, © 1998, 2000 National Broadcasting Company, Inc., ©1998, 2000 Public Radio International, © 2005 Trustees of the University of Pennsylvania

"The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.