Home › Language Resources › Data

ACE 2004 Multilingual Training Corpus

Item Name:	ACE 2004 Multilingual Training Corpus
Author(s):	Alexis Mitchell, Stephanie Strassel, Shudong Huang, Ramez Zakhary
LDC Catalog No.:	LDC2005T09
ISBN:	1-58563-334-8.
ISLRN:	789-870-824-708-5
DOI:	https://doi.org/10.35111/8m4r-v312
Release Date:	March 15, 2005
Member Year(s):	2005
DCMI Type(s):	Text
Data Source(s):	broadcast news, newswire, telephone conversations
Project(s):	ACE, GALE, TIDES
Application(s):	automatic content extraction
Language(s):	English, Standard Arabic, Mandarin Chinese
Language ID(s):	eng, arb, cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2005T09 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Mitchell, Alexis, et al. ACE 2004 Multilingual Training Corpus LDC2005T09. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: Hide	View hasOutcome LDC2011T08 Datasets for Generic Relation Extraction (reACE) isContinuationOf LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data hasContinuation LDC2006T06 ACE 2005 Multilingual Training Corpus LDC2014T18 ACE 2007 Multilingual Training Corpus isSimilarWith LDC2003T11 ACE-2 Version 1.0 LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0

Introduction

ACE 2004 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains the various genre text in English (158,000 words), Chinese (307,000 characters, 154,000 words), and Arabic (151,000 words) annotated for entities and relations.

This corpus represents the complete set of English, Arabic, and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation created by LDC with support from the ACE Program and additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This data was previously distributed as an e-corpus (LDC2004E17) to participants in the 2004 ACE evaluation.

The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.

The current publication consists of the official training data for these evaluation tasks. A seventh evaluation area, Timex Detection and Recognition, is supported by ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07). The TERN corpus source data largely overlaps with the English source data contained in the current release.

For more information about linguistic resources for the ACE program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website.

Data

Here is a breakdown of the data amounts by language:

	English		Chinese			Arabic
Genre	Files	Words	Files	Words	Characters	Files	Words
Broadcast News	220	60,291	314	67,702	135,405	304	63,238
Newswire	128	59,840	226	60,251	120,502	253	63,122
Chinese Treebank	37	12,337	106	25,749	51,499
Arabic Treebank	58	12,855				132	25,010
Fisher CTS	8	12,630
Totals	451	157,953	646	153,703	307,406	689	151,360

All files are annotated for entities and relations. Annotators tag all mentions of each entity within a document, whether named, nominal or pronominal. For every mention, the annotator identifies the maximal extent of the string that represents the entity, and labels the head of each mention. Annotators also identify relations between entities and their temporal attributes. Relations that are supported by explicit textual evidence are distinguished from those that depend on contextual inference on the part of the reader.

The files are stored in four separate formats:

APF (.apf.xml) - The Official ACE Program Format.
ALF (.alf.xml) - The Ace LDC Format is an intermediate format similar to APF designed to store all annotation content represented in the AG files.
AG (-pp.ag.xml) - The LDC Annotation Graph Format (postprocessed). LDC's internal annotation files format for ACE. These files can be viewed with LDC's free ACE annotation tools.
Source (.sgm) - Source text files in with SGML tagging.

Samples

The files listed below are samples from the English data. They should provide a good example of the material in this corpus.

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Updates

None at this time.

Copyright

Portions (c) 1994-1998, 2000 Xinhua News Agency (c) 1997 Department of Information Services, Hong Kong Special Administrative Region (c) 1996-1998, 2000-2001 Sinorama Magazine (c) 2000 Agence France-Presse, (c) 2000 New York Times, (c) 2000 Associated Press, (c) 2000 SPH AsiaOne, Ltd. (Zaobao), (c) 2000 An-Nahar, (c) 2000 Al-Hayat, (c) 2000 Nile TV, (c) 2000 Cable News Network, All Rights Reserved, (c) 2000 American Broadcasting Corporation, (c) 2000 National Broadcasting Company, Inc., (c) 2000 China National Radio, (c) 2000 China Television System, (c) 2000 China Central TV, (c) 2000 China Broadcasting System, (c) 2000 Public Radio International., (c) 2005 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston