ACE 2005 Multilingual Training Corpus


Item Name: ACE 2005 Multilingual Training Corpus
Authors: Christopher Walker, Stephanie Strassel, Julie Medero, and Kazuaki Maeda
LDC Catalog No.: LDC2006T06
ISBN: 1-58563-376-3
Release Date: Feb 15, 2006
Data Type: text
Data Source(s): broadcast conversation, broadcast news, newsgroups, weblogs
Project(s): ACE
Application(s): automatic content extraction
Language(s): English, Mandarin Chinese, Modern Standard Arabic
Language ID(s): arb, cmn, eng
Distribution: 1 DVD
Member fee: $0 for 2006 members
Non-member Fee: US $4000.00
Reduced-License Fee: US $2000.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Christopher Walker, et al.
2006
ACE 2005 Multilingual Training Corpus
Linguistic Data Consortium, Philadelphia

Introduction

This publication contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from LDC. This data was previously distributed as an e-corpus (LDC2005E18) to participants in the 2005 ACE evaluation.

The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form.

In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. The current publication comprises the official training data for these evaluation tasks.

A complete description of the ACE 2005 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST).

For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website

Below is information about the amount of data included in the current release and its annotation status.

  • 1P: data subject to first pass (complete) annotation
  • DUAL: data also subject to dual first pass (complete) annotation
  • ADJ: data also subject to discrepancy resolution/adjudication
  • NORM: data also subject to TIMEX2 normalization
English
words files
1P DUAL ADJ NORM 1P DUAL ADJ NORM
NW 60658 57807 33459 48399 128 124 81 106
BN 59239 58144 52444 55967 239 234 217 226
BC 46612 46110 33874 40415 68 67 52 60
WL 45210 43648 35529 37897 127 122 114 119
UN 45161 44473 26371 37366 58 57 37 49
CTS 47003 47003 34868 39845 46 46 34 39
Total 303833 297185 216545 259889 666 650 535 599
Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word.
chars files
1P DUAL ADJ 1P DUAL ADJ
NW 127319 124175 121797 248 242 238
BN 134963 133696 120513 332 328 298
WL 71839 68063 65681 107 101 97
Total 334121 325834 307991 687 671 633
Arabic
words files
1P DUAL ADJ 1P DUAL ADJ
NW 61287 56158 53026 239 226 221
BN 29259 27165 26907 134 128 127
WL 21687 20181 20181 60 55 55
Total 112233 103504 100114 433 409 403

Samples

For examples of the data in this publication, please review the following samples:

Content Copyright

Portions 2000-2003 Agence France Presse, 2003 The Associated Press, 2003 New York Times, 2000-2001, 2003 Xinhua News Agency, 2003 Cable News Network LP, LLLP, 2000-2001 SPH AsiaOne Ltd, 2000-2001 China Broadcasting System, 2000-2001 China National Radio, 2000-2001 China Television System, 2000-2001 China Central TV, 2000-2001 Al Hayat, 2000-2001 An-Nahar, 2000-2001 Nile TV, 2005, 2006 Trustees of the University of Pennsylvania