ACE 2004 Multilingual Training Corpus

Item Name: ACE 2004 Multilingual Training Corpus
Author(s): Alexis Mitchell, Stephanie Strassel, Shudong Huang, Ramez Zakhary
LDC Catalog No.: LDC2005T09
ISBN: 1-58563-334-8.
ISLRN: 789-870-824-708-5
Release Date: March 15, 2005
Member Year(s): 2005
DCMI Type(s): Text
Data Source(s): newswire, broadcast news
Project(s): TIDES, GALE, ACE
Application(s): automatic content extraction
Language(s): English, Standard Arabic, Mandarin Chinese
Language ID(s): eng, arb, cmn
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2005T09 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Mitchell, Alexis, et al. ACE 2004 Multilingual Training Corpus LDC2005T09. Web Download. Philadelphia: Linguistic Data Consortium, 2005.

Introduction

This file contains documentation on the ACE 2004 Multilingual Training Corpus, Linguistic Data Consortium (LDC) catalog number LDC2005T09 and ISBN 1-58563-334-8.

This publication contains the complete set of English, Arabic and Chinese training data for the 2004 Automatic Content Extraction (ACE) technology evaluation. The corpus consists of data of various types annotated for entities and relations and was created by Linguistic Data Consortium with support from the ACE Program, with additional assistance from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This data was previously distributed as an e-corpus (LDC2004E17) to participants in the 2004 ACE evaluation.

The objective of the ACE program is to develop automatic content extraction technology to support automatic processing of human language in text form. In September 2004, sites were evaluated on system performance in six areas: Entity Detection and Recognition (EDR), Entity Mention Detection (EMD), EDR Co-reference, Relation Detection and Recognition (RDR), Relation Mention Detection (RMD), and RDR given reference entities. All tasks were evaluated in three languages: English, Chinese and Arabic.

The current publication consists of the official training data for these evaluation tasks. A seventh evaluation area, Timex Detection and Recognition, is supported by the ACE Time Normalization (TERN) 2004 English Training Data Corpus (LDC2005T07). The TERN corpus source data largely overlaps with the English source data contained in the current release.

A complete description of the ACE 2004 Evaluation can be found on the ACE Program website maintained by the National Institute of Standards and Technology (NIST): http://www.nist.gov/speech/tests/ace/

For more information about linguistic resources for the ACE program, including annotation guidelines, task definitions, free annotation tools and other documentation, please visit LDC's ACE website.

Samples

The files listed below are samples from the English data. They should provide a good example of the material in this corpus.

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee