README FILE FOR THE AQUAINT-2 Information-Retrieval Text Research Collection

The AQUAINT-2 Information-Retrieval Text Research Collection consists of newswire data in English drawn
from six distinct sources, listed below in terms of their file name
designations and full names:

  afp_eng : Agence France Presse
  apw_eng : Associated Press Worldstream Service
  cna_eng : Central News Agency (Taiwan) English Service
  ltw_eng : Los Angeles Times - Washington Post Newswire Service
  nyt_eng : New York Times Newswire Service
  xin_eng : Xinhua News Agency (Beijing) English Service

For each source, all the usable data collected by the Linguistic Data
Consortium between October 1, 2004 and March 31, 2006 (a period of 18
months) has been processed into a consistent XML format, in which all
the stories for a given month are concatenated in chronological order
into a single "DOCSTREAM" element; each story is a single "DOC"
element within that stream, and has a globally unique "id" attribute.

The data files in AQUAINT-2 are actually copied from another LDC
corpus: Gigaword English Release 3 (LDC2007T07).  The only differences
between these two releases of the data are:

 - the AQUAINT-2 files are being published in uncompressed form
 - each data directory in AQUAINT-2 includes a DTD file
    (a2_newswire_xml.dtd)
 - each AQUAINT-2 file has "<DOCSTREAM>" and "</DOCSTREAM>" tags that
    bracket the full file content, along with the following XML
    header at the beginning:

  <?xml version="1.0" encoding="UTF-8"?>
  <!DOCTYPE DOCSTREAM SYSTEM 'a2_newswire_xml.dtd'>

Compared to the first AQUAINT text corpus (LDC2002T31), the markup
structure in AQUAINT-2 is simpler and fully consistent across all news
sources.  Each DOC element in the DOCSTREAM is rendered as follows:

 <DOC id="SRC_ENG_yyyymmdd.indx" type="string">
 <HEADLINE>
 The HEADLINE element is optional -- some DOC's lack this element
 </HEADLINE>
 <DATELINE>
 The DATELINE element is optional -- some DOC's lack this element;
 the HEADLINE and DATELINE elements, if present, may have
 multi-line and relatively complicated content
 </DATELINE>
 <TEXT>
 <P>
 Paragraphs are line-wrapped.  In some DOC units, the TEXT element
 does not contain any "P" (paragraph) tags.  See the discussion of
 the DOC 'type' attribute below.
 </P>
 </TEXT>
 </DOC>

Regarding the two attributes of the DOC tag:

 - 'id' provides the globally unique identifier for every DOC; the
   initial 7 characters of the id string are simply an UPPER_CASE copy
   of the directory path and the initial part of the file name; the
   next 8 digits are the year, month and date when the story was
   originally published, and the last 4 digits (following the period)
   represent a sequential numbering of stories on the given date.
   (Note that there may be gaps in the numbering, but the numeric
   order correlates to chronological order.)

 - 'type' indicates what sort of content appears in the TEXT element
   of the news story; there are four types:

   -- "story": by far the most frequent type, these are typical news
      stories on a given topic, presented as a sequence of paragraphs
      (this is the only type of DOC to contain "P" tags within "TEXT")

   -- "multi": similar to "story", in the sense of (usually) having
      sentences that describe events, but each DOC presents an
      overview of stories, which may or may not be topically related
      to each other.

   -- "advis": these are not really "news stories", even though they
      may contain summaries or portions of news reports; these DOC's
      are actually addressed to news editors who receive the wire
      services, and as such tend to contain things not found in
      typical news stories (word counts, odd abbreviations or terms
      familiar only to editors, repetitive "boilerplate" content such
      as contact information, etc).

   -- "other": these are "none of the above" -- they are intended for
      the news reading audience (not just editors), they tend to be on
      a single topic, but they do not contain paragraphs describing an
      event.  In general, these are tabulations such as sports scores,
      stock prices, weather conditions, etc.

For more information about the sources, their various properties, and
additional details about the corpus data, please refer to the online
documentation for the Gigaword English Release 3 corpus:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07


COPYRIGHT:

Portions (c) 2004-2006 Agence France Presse, The Associated Press,
Central News Agency (Taiwan), Los Angeles Times-Washington Post News
Service, Inc., New York Times, Xinhua News Agency, (c) 2007 Trustees
of the University of Pennsylvania


David Graff
LDC
May 21, 2007