AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31). The AQUAINT (Advanced Question-Answering for Intelligence) program addresses interactivity with scenarios or tasks. The scenario provides a context in which questions will be asked and answered, and the task reflects the overall assignment. The program is committed to solve a single problem: how to find topically relevant, semantically related, timely information in massive amounts of data in diverse languages, formats, and genres.
AQUAINT technology is advancing the development of components and functions that allows users to pose a series of intertwined, complex questions and obtain comprehensive answers in the context of broad information-gathering tasks. In addition, while most information retrieval systems present only links to documents, AQUAINT is producing technology that will present answers to the user's questions. This question-answering technology is being developed with features for managing semantic similarity, co-reference, event characterization, opinions, linguistic and social and world inferencing, redundancy, deception, and missing or contradictory information. In order to allow the analyst to guide the exploration in concert with the machine, AQUAINT technology employs interactive question-answering, the automatic suggestion of additional paths of exploration, and the inferencing of the social context of the information search.
Data AQUAINT-2 Information-Retrieval Text Research Collection is a subset of LDC's English Gigaword Third Edition (LDC2007T07). The collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period October 2004 - March 2006. For each source, all of the usable data collected by LDC was processed into a consistent XML format in which the stories for a given month are concatenated in chronological order into a single "DOCSTREAM" element; each story is a single "DOC" element within that stream and has a globally unique "id" attribute. The collection consists of newswire data in English drawn from six distinct sources, listed below in terms of their file name designations and full names:
| afp_eng || Agence France Presse |
| apw_eng || Associated Press |
| cna_eng || Central News Agency (Taiwan) English Service |
| ltw_eng || Los Angeles Times - Washington Post News Service |
| nyt_eng || New York Times |
| xin_eng || Xinhua News Agency (Beijing) English Service |
For an example of the data in this publication, please examine this image of the XML.
Portions © 2004-2006 Agence France Presse, The Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post News Service, Inc., New York Times, Xinhua News Agency, © 2004-2006, 2007, 2008 Trustees of the University of Pennsylvania