ACE 2005 Multilingual Training Corpus
Item Name: | ACE 2005 Multilingual Training Corpus |
Author(s): | Christopher Walker, Stephanie Strassel, Julie Medero, Kazuaki Maeda |
LDC Catalog No.: | LDC2006T06 |
ISBN: | 1-58563-376-3 |
ISLRN: | 458-031-085-383-4 |
DOI: | https://doi.org/10.35111/mwxc-vh88 |
Release Date: | February 15, 2006 |
Member Year(s): | 2006 |
DCMI Type(s): | Text |
Data Source(s): | weblogs, broadcast news, newsgroups, broadcast conversation |
Project(s): | ACE |
Application(s): | automatic content extraction |
Language(s): | Mandarin Chinese, Standard Arabic, English |
Language ID(s): | cmn, arb, eng |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2006T06 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Walker, Christopher, et al. ACE 2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium, 2006. |
Related Works: | View |
Introduction
ACE 2005 Multilingual Training Corpus contains the complete set of English, Arabic and Chinese training data for the 2005 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities, relations and events by the Linguistic Data Consortium (LDC) with support from the ACE Program and additional assistance from LDC.
The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form.
In November 2005, sites were evaluated on system performance in five primary areas: the recognition of entities, values, temporal expressions, relations, and events. Entity, relation and event mention detection were also offered as diagnostic tasks. All tasks with the exception of event tasks were performed for three languages, English, Chinese and Arabic. Events tasks were evaluated in English and Chinese only. This release comprises the official training data for these evaluation tasks.
For more information about linguistic resources for the ACE Program, including annotation guidelines, task definitions and other documentation, see LDC's ACE website.
Data
Below is information about the amount of data in this release and its annotation status.
- 1P: data subject to first pass (complete) annotation
- DUAL: data also subject to dual first pass (complete) annotation
- ADJ: data also subject to discrepancy resolution/adjudication
- NORM: data also subject to TIMEX2 normalization
English | |||||||||
words | files | ||||||||
1P | DUAL | ADJ | NORM | 1P | DUAL | ADJ | NORM | ||
NW | 60658 | 57807 | 33459 | 48399 | 128 | 124 | 81 | 106 | |
BN | 59239 | 58144 | 52444 | 55967 | 239 | 234 | 217 | 226 | |
BC | 46612 | 46110 | 33874 | 40415 | 68 | 67 | 52 | 60 | |
WL | 45210 | 43648 | 35529 | 37897 | 127 | 122 | 114 | 119 | |
UN | 45161 | 44473 | 26371 | 37366 | 58 | 57 | 37 | 49 | |
CTS | 47003 | 47003 | 34868 | 39845 | 46 | 46 | 34 | 39 | |
Total | 303833 | 297185 | 216545 | 259889 | 666 | 650 | 535 | 599 |
Chinese Note: Chinese data expressed in terms of characters. We assume a correspondence of roughly 1.5 characters/word. | |||||||||
chars | files | ||||||||
1P | DUAL | ADJ | 1P | DUAL | ADJ | ||||
NW | 127319 | 124175 | 121797 | 248 | 242 | 238 | |||
BN | 134963 | 133696 | 120513 | 332 | 328 | 298 | |||
WL | 71839 | 68063 | 65681 | 107 | 101 | 97 | |||
Total | 334121 | 325834 | 307991 | 687 | 671 | 633 |
Arabic | |||||||||
words | files | ||||||||
1P | DUAL | ADJ | 1P | DUAL | ADJ | ||||
NW | 61287 | 56158 | 53026 | 239 | 226 | 221 | |||
BN | 29259 | 27165 | 26907 | 134 | 128 | 127 | |||
WL | 21687 | 20181 | 20181 | 60 | 55 | 55 | |||
Total | 112233 | 103504 | 100114 | 433 | 409 | 403 |
Samples
For examples of the data in this publication, please review the following samples: