TIDES Extraction (ACE) 2003 Multilingual Training Data

Item Name: TIDES Extraction (ACE) 2003 Multilingual Training Data
Author(s): Alexis Mitchell, Stephanie Strassel, Mark Przybocki, JK Davis, George R. Doddington, Ralph Grishman, Adam Meyers, Ada Brunstein, Lisa Ferro, Beth Sundheim
LDC Catalog No.: LDC2004T09
ISBN: 1-58563-292-9
ISLRN: 685-740-491-198-0
DOI: https://doi.org/10.35111/7xtm-ys65
Release Date: February 16, 2004
Member Year(s): 2004
DCMI Type(s): Text
Data Source(s): transcribed speech, newswire, broadcast news
Project(s): TIDES, GALE, ACE
Application(s): automatic content extraction, information detection
Language(s): English, Standard Arabic, Mandarin Chinese
Language ID(s): eng, arb, cmn
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2004T09 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Mitchell, Alexis, et al. TIDES Extraction (ACE) 2003 Multilingual Training Data LDC2004T09. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: View

Introduction

TIDES Extraction (ACE) 2003 Multilingual Training Data was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T09 and ISBN 1-58563-292-9.

This corpus was created and previously distributed by Linguistic Data Consortium as an e-corpus (catalog number LDC2003E18) to support the September 2003 TIDES Extraction (ACE) program evaluation. For information regarding the ACE program and ACE technology evaluations administered by the National Institute of Standards and Technology, please visit the NIST website. For more information about ACE annotation and ongoing ACE corpus development, including annotation guidelines, task definitions, annotation tools and other project documentation, please visit LDC's ACE Project page.

The source material for this corpus consists of broadcast and newswire data drawn from October 2000 through the end of December 2000. The sources are listed below.

Newswire:

  • Arabic
    • Agency France Press (AFA)
    • Al Hayat (ALH)
    • An-Nahar (ANN)
  • Chinese
    • Xinhua Newswire (XIN)
    • Zaobao (ZBN)
  • English
    • New York Times Newswire Service (NYT)
    • Associated Press Worldstream Service (APW)
  • Broadcast News:

    • Arabic
      • Voice of America, Arabic news programs (VAR)
      • Nile TV (NTV)
    • Chinese
      • China National Radio (CNR)
      • China Television System (CTS)
      • Voice of America, Chinese news programs (VOM)
      • China TV Program Agency (CTV)
      • China Broadcasting System (CBS)
    • English
      • Cable News Network, "Headline News" (CNN)
      • American Broadcasting Co., "World News Tonight" (ABC)
      • Public Radio International, "The World" (PRI)
      • Voice of America, English news programs (VOA)
      • MSNBC, "The News With Brian Williams" (MNB)
      • National Broadcasting Company, "Nightly News" (NBC)

    Data

    Annotations for this corpus were produced by Linguistic Data Consortium to support the following tasks broken down by language:

      Arabic
      • Entity Detection and Tracking (EDT)

      Chinese

      • Entity Detection and Tracking (EDT)
      • Relation Detection and Characterization (RDC)

      English

      • Entity Detection and Tracking (EDT)
      • Relation Detection and Characterization (RDC)

    This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (APF), as well as the ACE DTD and supporting documentation.

    The data files for each language are divided by source type (bnews, nwire). For Chinese, the annotation files (.apf.xml) are encoded in UTF8. We have included source files (.sgm) in both GB and UTF8 encoding. The following tables outline the word and file counts by language and source.

    Arabic

    Source Words Files
    AFA 11154 66
    ALH 7437 20
    ANN 7734 20
    VAR 8360 57
    NTV 7512 43
    Total 42197 206

    Chinese

    Source Characters Files
    XIN 28157 57
    ZBN 25591 42
    CNR 4758 21
    CTS 7160 22
    VOM 18160 42
    CTV 6017 18
    CBS 8130 19
    Total 97973 221

    English

    Source Words Files
    NYT 18983 24
    APW 38222 81
    CNN 5706 54
    ABC 4453 15
    PRI 9785 27
    VOA 4203 28
    MNB 4356 8
    NBC 4976 15
    Total 90684 252

    Updates

    There are no updates available at this time.

    © 2000 American Broadcasting Corporation © 2000 Cable News Network, Inc. © 2000 Press Association, Inc. © 2000 New York Times © 2000 National Broadcasting Company, Inc. © 2000 Public Radio International © 2000 Agency France Press © 2000 Al Hayat © 2000 An-Nahar © 2000 Nile TV © 2000 Xinhua News © 2000 SPH AsiaOne Ltd. © 2000 China National Radio © 2000 China Television System © 2000 China TV Program Agency © 2000 China Broadcasting System

    Available Media

    View Fees





    Login for the applicable fee