ACE 2007 Multilingual Training Corpus

Item Name: ACE 2007 Multilingual Training Corpus
Author(s): Zhiyi Song, Kazuaki Maeda, Christopher Walker, Stephanie Strassel
LDC Catalog No.: LDC2014T18
ISBN: 1-58563-688-6
ISLRN: 600-375-253-846-9
Release Date: September 15, 2014
Member Year(s): 2014
DCMI Type(s): Text
Data Source(s): weblogs, newswire
Project(s): ACE
Application(s): automatic content extraction
Language(s): Spanish, Standard Arabic
Language ID(s): spa, arb
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2014T18 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Song, Zhiyi, et al. ACE 2007 Multilingual Training Corpus LDC2014T18. Web Download. Philadelphia: Linguistic Data Consortium, 2014.

Introduction

ACE 2007 Multilingual Training Corpus was developed by the Linguistic Data Consortium (LDC) and contains the complete set of Arabic and Spanish training data for the 2007 Automatic Content Extraction (ACE) technology evaluation, specifically, Arabic and Spanish newswire data and Arabic weblogs annotated for entities and temporal expressions.

The objective of the ACE program was to develop automatic content extraction technology to support automatic processing of human language in text form from a variety of sources including newswire, broadcast programming and weblogs. In the 2007 evaluation, participants were tested on system performance for the recognition of entities, values, temporal expressions, relations, and events in Chinese and English and for the recognition of entities and temporal expressions in Arabic and Spanish. LDC's work in the ACE program is described in more detail on the LDC ACE project pages.

The LDC Catalog contains a series of publications from the ACE project and from researchers building on that work. Among them are:

  • ACE-2 Version 1.0 (LDC2003T11)
  • TIDES Extraction (ACE) 2003 Multilingual Training Data (LDC2004T09)
  • ACE Time Normalization (TERN) 2004 English Training Data v 1.0 (LDC2005T07)
  • ACE 2004 Multilingual Training Corpus (LDC2005T09)
  • ACE 2005 Multilingual Training Corpus (LDC2006T06)
  • ACE 2005 English SpatialML Annotations (LDC2008T03)
  • ACE 2005 Mandarin SpatialML Annotations (LDC2010T09)
  • ACE Time Normalization (TERN) 2004 English Evaluation Data V1.0 (LDC2010T18)
  • ACE 2005 English SpatialML Annotations Version 2 (LDC2011T02)
  • Datasets for Generic Relation Extraction (reACE) (LDC2011T08)

Data

The Arabic data is composed of newswire (60%) published in October 2000-December 2000 and weblogs (40%) published during the period November 2004-February 2005. The Spanish data set consists entirely of newswire material from multiple sources published in January 2005-April 2005.

Data selection was semi-automatic. A document pool was established for each language based on genre and epoch requirements. Humans reviewed the pool to select individual documents suitable for ACE annotation, such as documents that were representative of their genre and contained targeted ACE entity types. One annotator completed the entity and temporal expression (TIMEX2) markup in the first pass annotation. This work was reviewed in the second pass by a senior annotator. TIMEX2 values were normalized by an annotator specifically trained for that task.

The table below describes the amount of data included in the current release and its annotation status. Corpus content for each language and data type is represented in the three stages of annotation: first pass annotation (1P), second pass annotation (2P) and TIMEX2 normalization and additional quality control (NORM).

 

Arabic
Words     Files      
  1P 2P NORM 1P 2P NORM
NW 58,015 58,015 58,015 257 257 257
WL 40,338 40,338 40,338 121 121 121
Total 98,353 98,353 98,353 378 378 378
Spanish
Words     Files      
  1P 2P NORM 1P 2P NORM
NW 100,401 100,401 100,401 352 352 352
Total 100,401 100,401 100,401 352 352 352

 

For a given document, there is a source .sgm file together with the .ag.xml and .apf.xml annotation files in each of the three directories "1p", "2p" and "timex2norm". In other words, for each newswire story or weblog entry, the three annotation directories each contain an identical copy of the source text (SGML .sgm file) along with distinct versions of the associated annotations (XML .ag.xml, apf.xml files and plain text .tab files). Note that in many cases, two annotation stages have produced identical output for a given source text, if no changes were made in the latter stage. All files are presented in UTF-8

Samples

Please view the following samples

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee