This release contains Version 1.0 of the ACE2 corpus, created and distributed by Linguistic Data Consortium to support the Automatic Content Extraction (ACE) program. This release contains two sets of data. Each of these sets is further divided by source: broadcast news, newspaper, and newswire. The 'ace2_train' directory contains data originally developed as training material for the February 2002 evaluation and again for the September 2002 evaluation. The 'ace2_devtest' directory contains data originally developed as test data for the February 2002 evaluation and later used as devtest data for the September 2002 evaluation. The broadcast and newswire source data is drawn from a subset of the TDT2 Multilanguage Text Corpus Version 4.0 [LDC2001T57]; this has been suplemented with additional newspaper data from the Washington Post. A portion of the ACE2_train broadcast data was drawn from the 1997 English Broadcast News Transcripts (Hub-4) corpus (LDC98T28). All material comes from the first half of 1998. The sources for the broadcast, newswire, and newspaper data are listed below. Newswire New York Times Newswire Service (NYT) Associated Press Worldstream Service (APW) Broadcast News Cable News Network, "Headline News" (CNN for TDT2, ed for Hub-4) American Broadcasting Co., "World News Tonight" (ABC for TDT2, ea for Hub-4) Public Radio International, "The World" (PRI) Voice of America, English news programs (VOA) MSNBC, "The News With Brian Williams" (MNB) National Broadcasting Company, "Nightly News" (NBC) Newspaper Washington Post (WAP) This publication includes both the source data files in .sgm format and the annotation files in ACE Pilot Format (APF), supporting documentation, and version 2.0.1 of the ACE DTD which was used for the September 2002 ACE Evaluation. This release contains 179,007 words of source data, or 519 files, broken down as follows: Words Files Train Dev Train Dev NYT 32892 7487 48 9 APW 29144 7037 82 20 CNN 2290 2653 69 11 ABC 1588 2687 24 10 PRI 1272 5284 43 9 VOA 594 2611 24 7 MNB 0 2539 0 6 NBC 0 2633 0 8 WAP 60247 15070 76 17 ea 2019 0 31 0 ed 1094 0 25 0 --------------------------------------------- Total 131023 47984 422 97 Annotations for the ACE-2 corpus were produced by Linguistic Data Consortium to support two research tasks: Entity Detection and Tracking (EDT) and Relation Detection and Characterization (RDC). The annotation guidelines for these tasks are provided in the /docs directory of this release. For more information about ACE annotation and ongoing ACE corpus development, including annotation guidelines, task definitions, annotation tools and other project documentation, please visit LDC's ACE Project page at http://www.ldc.upenn.edu/Projects/ACE/ For information regarding the ACE program and ACE technology evaluations administered by the National Institue of Standards and Technology (NIST), visit http://www.nist.gov/iaui/894.01/tests/ace/index.htm Please contact Stephanie Strassel, strassel@ldc.upenn.edu or Alexis Mitchell, alexis.mitchell@ldc.upenn.edu with any questions regarding this corpus. Authors: Alexis Mitchell, Stephanie Strassel, Mark Przybocki, JK Davis, George Doddington, Ralph Grishman, Adam Meyers, Ada Brunstein, Lisa Ferro, Beth Sundheim ------------------------------------------------------------ Copyright Information (c) 1998 Los Angeles Times-Washington Post News Service, Inc. (c) 1998 American Broadcasting Corporation (c) 1998 Cable News Network, Inc. (c) 1998 Press Association, Inc. (c) 1998 New York Times (c) 1998 National Broadcasting Company, Inc. (c) 1998 Public Radio International "The World" is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.