Home › Language Resources › Data

ACE 2005 Multilingual Training Corpus

Item Name:	ACE 2005 Multilingual Training Corpus
Author(s):	Christopher Walker, Stephanie Strassel, Julie Medero, Kazuaki Maeda
LDC Catalog No.:	LDC2006T06
ISBN:	1-58563-376-3
ISLRN:	458-031-085-383-4
DOI:	https://doi.org/10.35111/mwxc-vh88
Release Date:	February 15, 2006
Member Year(s):	2006
DCMI Type(s):	Text
Data Source(s):	broadcast conversation, broadcast news, newsgroups, telephone conversations, weblogs
Project(s):	ACE
Application(s):	automatic content extraction
Language(s):	Mandarin Chinese, Standard Arabic, English
Language ID(s):	cmn, arb, eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2006T06 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Walker, Christopher, et al. ACE 2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: Hide	View hasAnnotation LDC2008T03 ACE 2005 English SpatialML Annotations LDC2010T09 ACE 2005 Mandarin SpatialML Annotations LDC2011T02 ACE 2005 English SpatialML Annotations Version 2 hasOutcome LDC2011T08 Datasets for Generic Relation Extraction (reACE) LDC2024T05 Automatic Content Extraction for Portuguese isContinuationOf LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data LDC2005T09 ACE 2004 Multilingual Training Corpus hasContinuation LDC2014T18 ACE 2007 Multilingual Training Corpus isSimilarWith LDC2003T11 ACE-2 Version 1.0 LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0

Introduction

ACE 2005 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains approximately 1,800 files of mixed genre text in English, Arabic, and Chinese annotated for entities, relations, and events. This represents the complete set of training data in those languages for the 2005 Automatic Content Extraction (ACE) technology evaluation. The genres include newswire, broadcast news, broadcast conversation, weblog, discussion forums, and conversational telephone speech. The data was annotated by LDC with support from the ACE Program and additional assistance from LDC.

The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form.

In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation, and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese, and Arabic. Events tasks were evaluated in English and Chinese only. This release comprises the official training data for these evaluation tasks.

For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, see LDC's ACE website.

Data

Below is information about the amount of data in this release and its annotation status. Further information such as breakdown of genres and formats can be found in the associated README file.

1P: data subject to first pass (complete) annotation
DUAL: data also subject to dual first pass (complete) annotation
ADJ: data also subject to discrepancy resolution/adjudication
NORM: data also subject to TIMEX2 normalization

English
words				files
1P	DUAL	ADJ	NORM	1P	DUAL	ADJ	NORM
303833	297185	216545	259889	666	650	535	599

Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word.
chars			files
1P	DUAL	ADJ	1P	DUAL	ADJ
334121	325834	307991	687	671	633

Arabic
words			files
1P	DUAL	ADJ	1P	DUAL	ADJ
112233	103504	100114	433	409	403

Samples

For examples of the data in this publication, please review the following samples:

Updates

None at this time.

Copyright

Portions © 2000-2003 Agence France Presse, © 2003 The Associated Press, © 2003 New York Times, © 2000-2001, 2003 Xinhua News Agency, © 2003 Cable News Network LP, LLLP, © 2000-2001 SPH AsiaOne Ltd, © 2000-2001 China Broadcasting System, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 2000-2001 China Central TV, © 2000-2001 Al Hayat, © 2000-2001 An-Nahar, © 2000-2001 Nile TV, © 2005, 2006 Trustees of the University of Pennsylvania