README FILE FOR THE AQUAINT-2 Information-Retrieval Text Research Collection The AQUAINT-2 Information-Retrieval Text Research Collection consists of newswire data in English drawn from six distinct sources, listed below in terms of their file name designations and full names: afp_eng : Agence France Presse apw_eng : Associated Press Worldstream Service cna_eng : Central News Agency (Taiwan) English Service ltw_eng : Los Angeles Times - Washington Post Newswire Service nyt_eng : New York Times Newswire Service xin_eng : Xinhua News Agency (Beijing) English Service For each source, all the usable data collected by the Linguistic Data Consortium between October 1, 2004 and March 31, 2006 (a period of 18 months) has been processed into a consistent XML format, in which all the stories for a given month are concatenated in chronological order into a single "DOCSTREAM" element; each story is a single "DOC" element within that stream, and has a globally unique "id" attribute. The data files in AQUAINT-2 are actually copied from another LDC corpus: Gigaword English Release 3 (LDC2007T07). The only differences between these two releases of the data are: - the AQUAINT-2 files are being published in uncompressed form - each data directory in AQUAINT-2 includes a DTD file (a2_newswire_xml.dtd) - each AQUAINT-2 file has "" and "" tags that bracket the full file content, along with the following XML header at the beginning: Compared to the first AQUAINT text corpus (LDC2002T31), the markup structure in AQUAINT-2 is simpler and fully consistent across all news sources. Each DOC element in the DOCSTREAM is rendered as follows: The HEADLINE element is optional -- some DOC's lack this element The DATELINE element is optional -- some DOC's lack this element; the HEADLINE and DATELINE elements, if present, may have multi-line and relatively complicated content

Paragraphs are line-wrapped. In some DOC units, the TEXT element does not contain any "P" (paragraph) tags. See the discussion of the DOC 'type' attribute below.

Regarding the two attributes of the DOC tag: - 'id' provides the globally unique identifier for every DOC; the initial 7 characters of the id string are simply an UPPER_CASE copy of the directory path and the initial part of the file name; the next 8 digits are the year, month and date when the story was originally published, and the last 4 digits (following the period) represent a sequential numbering of stories on the given date. (Note that there may be gaps in the numbering, but the numeric order correlates to chronological order.) - 'type' indicates what sort of content appears in the TEXT element of the news story; there are four types: -- "story": by far the most frequent type, these are typical news stories on a given topic, presented as a sequence of paragraphs (this is the only type of DOC to contain "P" tags within "TEXT") -- "multi": similar to "story", in the sense of (usually) having sentences that describe events, but each DOC presents an overview of stories, which may or may not be topically related to each other. -- "advis": these are not really "news stories", even though they may contain summaries or portions of news reports; these DOC's are actually addressed to news editors who receive the wire services, and as such tend to contain things not found in typical news stories (word counts, odd abbreviations or terms familiar only to editors, repetitive "boilerplate" content such as contact information, etc). -- "other": these are "none of the above" -- they are intended for the news reading audience (not just editors), they tend to be on a single topic, but they do not contain paragraphs describing an event. In general, these are tabulations such as sports scores, stock prices, weather conditions, etc. For more information about the sources, their various properties, and additional details about the corpus data, please refer to the online documentation for the Gigaword English Release 3 corpus: http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2007T07 COPYRIGHT: Portions (c) 2004-2006 Agence France Presse, The Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post News Service, Inc., New York Times, Xinhua News Agency, (c) 2007 Trustees of the University of Pennsylvania David Graff LDC May 21, 2007