README FILE FOR THE AQUAINT-2 Information-Retrieval Text Research Collection
The AQUAINT-2 Information-Retrieval Text Research Collection consists of newswire data in English drawn
from six distinct sources, listed below in terms of their file name
designations and full names:
afp_eng : Agence France Presse
apw_eng : Associated Press Worldstream Service
cna_eng : Central News Agency (Taiwan) English Service
ltw_eng : Los Angeles Times - Washington Post Newswire Service
nyt_eng : New York Times Newswire Service
xin_eng : Xinhua News Agency (Beijing) English Service
For each source, all the usable data collected by the Linguistic Data
Consortium between October 1, 2004 and March 31, 2006 (a period of 18
months) has been processed into a consistent XML format, in which all
the stories for a given month are concatenated in chronological order
into a single "DOCSTREAM" element; each story is a single "DOC"
element within that stream, and has a globally unique "id" attribute.
The data files in AQUAINT-2 are actually copied from another LDC
corpus: Gigaword English Release 3 (LDC2007T07). The only differences
between these two releases of the data are:
- the AQUAINT-2 files are being published in uncompressed form
- each data directory in AQUAINT-2 includes a DTD file
(a2_newswire_xml.dtd)
- each AQUAINT-2 file has "" and "" tags that
bracket the full file content, along with the following XML
header at the beginning:
Compared to the first AQUAINT text corpus (LDC2002T31), the markup
structure in AQUAINT-2 is simpler and fully consistent across all news
sources. Each DOC element in the DOCSTREAM is rendered as follows:
The HEADLINE element is optional -- some DOC's lack this element
The DATELINE element is optional -- some DOC's lack this element;
the HEADLINE and DATELINE elements, if present, may have
multi-line and relatively complicated content
Paragraphs are line-wrapped. In some DOC units, the TEXT element
does not contain any "P" (paragraph) tags. See the discussion of
the DOC 'type' attribute below.
Regarding the two attributes of the DOC tag:
- 'id' provides the globally unique identifier for every DOC; the
initial 7 characters of the id string are simply an UPPER_CASE copy
of the directory path and the initial part of the file name; the
next 8 digits are the year, month and date when the story was
originally published, and the last 4 digits (following the period)
represent a sequential numbering of stories on the given date.
(Note that there may be gaps in the numbering, but the numeric
order correlates to chronological order.)
- 'type' indicates what sort of content appears in the TEXT element
of the news story; there are four types:
-- "story": by far the most frequent type, these are typical news
stories on a given topic, presented as a sequence of paragraphs
(this is the only type of DOC to contain "P" tags within "TEXT")
-- "multi": similar to "story", in the sense of (usually) having
sentences that describe events, but each DOC presents an
overview of stories, which may or may not be topically related
to each other.
-- "advis": these are not really "news stories", even though they
may contain summaries or portions of news reports; these DOC's
are actually addressed to news editors who receive the wire
services, and as such tend to contain things not found in
typical news stories (word counts, odd abbreviations or terms
familiar only to editors, repetitive "boilerplate" content such
as contact information, etc).
-- "other": these are "none of the above" -- they are intended for
the news reading audience (not just editors), they tend to be on
a single topic, but they do not contain paragraphs describing an
event. In general, these are tabulations such as sports scores,
stock prices, weather conditions, etc.
For more information about the sources, their various properties, and
additional details about the corpus data, please refer to the online
documentation for the Gigaword English Release 3 corpus:
http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07
COPYRIGHT:
Portions (c) 2004-2006 Agence France Presse, The Associated Press,
Central News Agency (Taiwan), Los Angeles Times-Washington Post News
Service, Inc., New York Times, Xinhua News Agency, (c) 2007 Trustees
of the University of Pennsylvania
David Graff
LDC
May 21, 2007