ACE 2007 Multilingual Training Corpus
Item Name: | ACE 2007 Multilingual Training Corpus |
Author(s): | Song Chen, Kazuaki Maeda, Christopher Walker, Stephanie Strassel |
LDC Catalog No.: | LDC2014T18 |
ISBN: | 1-58563-688-6 |
ISLRN: | 600-375-253-846-9 |
DOI: | https://doi.org/10.35111/ygjb-7f15 |
Release Date: | September 15, 2014 |
Member Year(s): | 2014 |
DCMI Type(s): | Text |
Data Source(s): | weblogs, newswire |
Project(s): | ACE |
Application(s): | automatic content extraction |
Language(s): | Spanish, Standard Arabic |
Language ID(s): | spa, arb |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2014T18 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Chen, Song, et al. ACE 2007 Multilingual Training Corpus LDC2014T18. Web Download. Philadelphia: Linguistic Data Consortium, 2014. |
Related Works: | View |
Introduction
ACE 2007 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions.
The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.
The LDC Catalog contains a series of publications from the ACE project and from researchers building on that work. Among them are:
- ACE-2 Version 1.0 (LDC2003T11)
- TIDES Extraction (ACE) 2003 Multilingual Training Data (LDC2004T09)
- ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07)
- ACE 2004 Multilingual Training Corpus (LDC2005T09)
- ACE 2005 Multilingual Training Corpus (LDC2006T06)
- ACE 2005 English SpatialML Annotations (LDC2008T03)
- ACE 2005 Mandarin SpatialML Annotations (LDC2010T09)
- ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 (LDC2010T18)
- ACE 2005 English SpatialML Annotations Version 2 (LDC2011T02)
- Datasets for Generic Relation Extraction (reACE) (LDC2011T08)
Data
The Arabic data is composed of newswire (60%) published in October 2000-December 2000 and weblogs (40%) published during the period November 2004-February 2005. The Spanish data set consists entirely of newswire material from multiple sources published in January 2005-April 2005.
Data selection was semi-automatic. A document pool was established for each language based on genre and epoch requirements. Humans reviewed the pool to select individual documents suitable for ACE annotation, such as documents that were representative of their genre and contained targeted ACE entity types. One annotator completed the entity and temporal expression (TIMEX2) markup in the first pass annotation. This work was reviewed in the second pass by a senior annotator. TIMEX2 values were normalized by an annotator specifically trained for that task.
The table below describes the amount of data included in the current release and its annotation status. Corpus content for each language and data type is represented in the three stages of annotation: first pass annotation (1P), second pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM).
Arabic | ||||||
---|---|---|---|---|---|---|
Words | Files | |||||
1P | 2P | NORM | 1P | 2P | NORM | |
NW | 58,015 | 58,015 | 58,015 | 257 | 257 | 257 |
WL | 40,338 | 40,338 | 40,338 | 121 | 121 | 121 |
Total | 98,353 | 98,353 | 98,353 | 378 | 378 | 378 |
Spanish | ||||||
Words | Files | |||||
1P | 2P | NORM | 1P | 2P | NORM | |
NW | 100,401 | 100,401 | 100,401 | 352 | 352 | 352 |
Total | 100,401 | 100,401 | 100,401 | 352 | 352 | 352 |
For a given document, there is a source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories "1p", "2p" and "timex2norm". In other words, for each newswire story or weblog entry, the three annotation directories each contain an identical copy of the source text (SGML .sgm file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml files and plain text .tab files). Note that in many cases, two annotation stages have produced identical output for a given source text, if no changes were made in the latter stage. All files are presented in UTF-8
Samples
Please view the following samples
Updates
None at this time.