This release contains Version 1.0 of the ACE2 corpus, created and
distributed by Linguistic Data Consortium to support the Automatic Content
Extraction (ACE) program.  This release contains two sets of data.  Each of
these sets is further divided by source: broadcast news, newspaper, and
newswire.

The 'ace2_train' directory contains data originally developed as training
material for the February 2002 evaluation and again for the September 2002
evaluation.  The 'ace2_devtest' directory contains data originally developed
as test data for the February 2002 evaluation and later used as devtest
data for the September 2002 evaluation.

The broadcast and newswire source data is drawn from a subset of the TDT2
Multilanguage Text Corpus Version 4.0 [LDC2001T57]; this has been
suplemented with additional newspaper data from the Washington Post.  A
portion of the ACE2_train broadcast data was drawn from the 1997 English
Broadcast News Transcripts (Hub-4) corpus (LDC98T28).  

All material comes from the first half of 1998.  The sources for the
broadcast, newswire, and newspaper data are listed below.

Newswire
 New York Times Newswire Service (NYT)
 Associated Press Worldstream Service (APW)

Broadcast News
 Cable News Network, "Headline News" (CNN for TDT2, ed for Hub-4)
 American Broadcasting Co., "World News Tonight" (ABC for TDT2, ea for Hub-4)
 Public Radio International, "The World" (PRI)
 Voice of America, English news programs (VOA)
 MSNBC, "The News With Brian Williams" (MNB)
 National Broadcasting Company, "Nightly News" (NBC)

Newspaper
 Washington Post (WAP)


This publication includes both the source data files in .sgm format and the
annotation files in ACE Pilot Format (APF), supporting documentation, and
version 2.0.1 of the ACE DTD which was used for the September 2002 ACE
Evaluation.

This release contains 179,007 words of source data, or 519 files, broken down
as follows: 

	Words			Files
	Train	 Dev		Train	Dev
NYT	32892	 7487		48	9
APW	29144	 7037		82	20
CNN	2290	 2653		69	11
ABC	1588	 2687		24	10
PRI	1272	 5284		43	9
VOA	594	 2611		24	7
MNB	0	 2539		0	6
NBC	0	 2633		0	8
WAP	60247	 15070		76	17
ea	2019	 0		31	0
ed	1094	 0		25	0
---------------------------------------------
Total	131023   47984		422	97

Annotations for the ACE-2 corpus were produced by Linguistic Data Consortium
to support two research tasks: Entity Detection and Tracking (EDT) and
Relation Detection and Characterization (RDC).  The annotation guidelines
for these tasks are provided in the /docs directory of this release.

For more information about ACE annotation and ongoing ACE corpus
development, including annotation guidelines, task definitions, annotation
tools and other project documentation, please visit LDC's ACE Project page
at

   http://www.ldc.upenn.edu/Projects/ACE/

For information regarding the ACE program and ACE technology evaluations
administered by the National Institue of Standards and Technology (NIST),
visit

   http://www.nist.gov/iaui/894.01/tests/ace/index.htm

Please contact Stephanie Strassel, strassel@ldc.upenn.edu or Alexis
Mitchell, alexis.mitchell@ldc.upenn.edu with any questions regarding this
corpus.

Authors: Alexis Mitchell, Stephanie Strassel, Mark Przybocki, JK Davis,
George Doddington, Ralph Grishman, Adam Meyers, Ada Brunstein, Lisa Ferro,
Beth Sundheim 

------------------------------------------------------------
Copyright Information

(c) 1998 Los Angeles Times-Washington Post News Service, Inc.
(c) 1998 American Broadcasting Corporation
(c) 1998 Cable News Network, Inc.
(c) 1998 Press Association, Inc.
(c) 1998 New York Times
(c) 1998 National Broadcasting Company, Inc.
(c) 1998 Public Radio International

"The World" is a co-production of Public Radio International and the
British Broadcasting Corporation and is produced at WGBH Boston.