TIDES Extraction (ACE) 2003 Multilingual Training Data

Item Name: TIDES Extraction (ACE) 2003 Multilingual Training Data
Author(s): Alexis Mitchell, Stephanie Strassel, Mark Przybocki, JK Davis, George R. Doddington, Ralph Grishman, Adam Meyers, Ada Brunstein, Lisa Ferro, Beth Sundheim
LDC Catalog No.: LDC2004T09
ISBN: 1-58563-292-9
ISLRN: 685-740-491-198-0
DOI: https://doi.org/10.35111/7xtm-ys65
Release Date: February 16, 2004
Member Year(s): 2004
DCMI Type(s): Text
Data Source(s): broadcast news, newswire, transcribed speech
Project(s): ACE, GALE, TIDES
Application(s): automatic content extraction, information detection
Language(s): English, Standard Arabic, Mandarin Chinese
Language ID(s): eng, arb, cmn
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2004T09 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Mitchell, Alexis, et al. TIDES Extraction (ACE) 2003 Multilingual Training Data LDC2004T09. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: View

Introduction

TIDES Extraction (ACE) 2003 Multilingual Training Data was produced by the Linguistic Data Consortium (LDC) and contains approximately 231,000 words of broadcast news and newswire text in Arabic, Chinese, and English annotated for entities and relations.

This corpus was created and previously distributed by Linguistic Data Consortium as an e-corpus (catalog number LDC2003E18) to support the September 2003 TIDES Extraction (ACE) program evaluation. For more information about ACE annotation and ongoing ACE corpus development, including annotation guidelines, task definitions, annotation tools and other project documentation, please visit LDC's ACE Project page.

Data

The source material for this corpus consists of broadcast and newswire data drawn from October 2000 through the end of December 2000. The sources are listed below with details and whether they include both Entity Detection and Tracking (EDT) and Relation Detection and Characterization (RDC).

Language Genre Source Program Words Files
Arabic
(EDT)
Newswire Agence France-Presse   11,154 66
Al-Hayat   7,437 20
An-Nahar   7,734 20
Broadcast News Voice of America Arabic news programs 8,360 57
Nile TV   7,512 43
Totals 42,197 206
Chinese
(EDT)
(RDC)
Newswire Xinhua   28,157 57
Zaobao   25,591 42
Broadcast News China National Radio   4,758 21
China Television System   7,160 22
Voice of America Chinese news programs 18,160 42
China TV Program Agency   6,017 18
China Broadcasting System   8,130 19
Totals 97,973 221
English
(EDT)
(RDC)
Newswire New York Times   18,983 24
Associated Press Worldstream   38,222 81
Broadcast News Cable News Network "Headline News" 5,706 54
American Broadcasting Co. "World News Tonight" 4,453 15
Public Radio International "The World" 9,785 27
Voice of America English news programs 4,203 28
MSNBC "The News With Brian Williams" 4,356 8
National Broadcasting Company "Nightly News" 4,976 15
Totals 90,684 252
Grand Totals 230,854 679

This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (.apf.xml), as well as the ACE DTD and supporting documentation.

The data files for each language are divided by source type (bnews, nwire). For Chinese, the annotation files (.apf.xml) are encoded in UTF-8. We have included source files (.sgm) in both GB and UTF8 encoding.

Samples

Please view these samples:

Updates

There are no updates available at this time.

© 2000 American Broadcasting Corporation © 2000 Cable News Network, Inc. © 2000 Press Association, Inc. © 2000 New York Times © 2000 National Broadcasting Company, Inc. © 2000 Public Radio International © 2000 Agency France Press © 2000 Al Hayat © 2000 An-Nahar © 2000 Nile TV © 2000 Xinhua News © 2000 SPH AsiaOne Ltd. © 2000 China National Radio © 2000 China Television System © 2000 China TV Program Agency © 2000 China Broadcasting System

Available Media

View Fees





Login for the applicable fee