Home › Language Resources › Data

ACE 2007 Multilingual Training Corpus

Item Name:	ACE 2007 Multilingual Training Corpus
Author(s):	Song Chen, Kazuaki Maeda, Christopher Walker, Stephanie Strassel
LDC Catalog No.:	LDC2014T18
ISBN:	1-58563-688-6
ISLRN:	600-375-253-846-9
DOI:	https://doi.org/10.35111/ygjb-7f15
Release Date:	September 15, 2014
Member Year(s):	2014
DCMI Type(s):	Text
Data Source(s):	weblogs, newswire
Project(s):	ACE
Application(s):	automatic content extraction
Language(s):	Spanish, Standard Arabic
Language ID(s):	spa, arb
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2014T18 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Chen, Song, et al. ACE 2007 Multilingual Training Corpus LDC2014T18. Web Download. Philadelphia: Linguistic Data Consortium, 2014.
Related Works: Hide	View isContinuationOf LDC2004T09 TIDES Extraction (ACE) 2003 Multilingual Training Data LDC2005T09 ACE 2004 Multilingual Training Corpus LDC2006T06 ACE 2005 Multilingual Training Corpus isSimilarWith LDC2003T11 ACE-2 Version 1.0 LDC2005T07 ACE Time Normalization (TERN) 2004 English Training Data v 1.0 LDC2008T03 ACE 2005 English SpatialML Annotations LDC2010T09 ACE 2005 Mandarin SpatialML Annotations LDC2010T18 ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 LDC2011T02 ACE 2005 English SpatialML Annotations Version 2 LDC2011T08 Datasets for Generic Relation Extraction (reACE) LDC2015T20 ACE 2007 Spanish DevTest - Pilot Evaluation

Introduction

ACE 2007 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions.

The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.

The LDC Catalog contains a series of publications from the ACE project and from researchers building on that work. Among them are:

ACE-2 Version 1.0 (LDC2003T11)
TIDES Extraction (ACE) 2003 Multilingual Training Data (LDC2004T09)
ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07)
ACE 2004 Multilingual Training Corpus (LDC2005T09)
ACE 2005 Multilingual Training Corpus (LDC2006T06)
ACE 2005 English SpatialML Annotations (LDC2008T03)
ACE 2005 Mandarin SpatialML Annotations (LDC2010T09)
ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 (LDC2010T18)
ACE 2005 English SpatialML Annotations Version 2 (LDC2011T02)
Datasets for Generic Relation Extraction (reACE) (LDC2011T08)

Data

The Arabic data is composed of newswire (60%) published in October 2000-December 2000 and weblogs (40%) published during the period November 2004-February 2005. The Spanish data set consists entirely of newswire material from multiple sources published in January 2005-April 2005.

Data selection was semi-automatic. A document pool was established for each language based on genre and epoch requirements. Humans reviewed the pool to select individual documents suitable for ACE annotation, such as documents that were representative of their genre and contained targeted ACE entity types. One annotator completed the entity and temporal expression (TIMEX2) markup in the first pass annotation. This work was reviewed in the second pass by a senior annotator. TIMEX2 values were normalized by an annotator specifically trained for that task.

The table below describes the amount of data included in the current release and its annotation status. Corpus content for each language and data type is represented in the three stages of annotation: first pass annotation (1P), second pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM).

Arabic
Words			Files
	1P	2P	NORM	1P	2P	NORM
NW	58,015	58,015	58,015	257	257	257
WL	40,338	40,338	40,338	121	121	121
Total	98,353	98,353	98,353	378	378	378
Spanish
Words			Files
	1P	2P	NORM	1P	2P	NORM
NW	100,401	100,401	100,401	352	352	352
Total	100,401	100,401	100,401	352	352	352

For a given document, there is a source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories "1p", "2p" and "timex2norm". In other words, for each newswire story or weblog entry, the three annotation directories each contain an identical copy of the source text (SGML .sgm file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml files and plain text .tab files). Note that in many cases, two annotation stages have produced identical output for a given source text, if no changes were made in the latter stage. All files are presented in UTF-8

ACE 2007 Multilingual Training Corpus

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees