AQUAINT-2 Information-Retrieval Text Research Collection

Item Name: AQUAINT-2 Information-Retrieval Text Research Collection
Author(s): Ellen Vorhees, David Graff
LDC Catalog No.: LDC2008T25
ISBN: 1-58563-494-8
ISLRN: 541-800-982-573-5
DOI: https://doi.org/10.35111/af48-z779
Release Date: December 19, 2008
Member Year(s): 2008
DCMI Type(s): Text
Data Source(s): newswire
Project(s): AQUAINT
Application(s): information retrieval
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2008T25 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Vorhees, Ellen, and David Graff. AQUAINT-2 Information-Retrieval Text Research Collection LDC2008T25. Web Download. Philadelphia: Linguistic Data Consortium, 2008.
Related Works: View

Introduction:

AQUAINT-2 Information-Retrieval Text Research Collection, Linguistic Data Consortium (LDC) catalog number LDC2008T25 and ISBN 1-58563-494-8, was developed by LDC for NIST's (National Institute for Standards and Technology) AQUAINT 2007 Question-Answer (QA) track. It consists of approximately 2.5 GB of English news text from six distinct sources collected by LDC (Agence France Presse, Associated Press, Central News Agency (Taiwan), Los Angeles Times-Washington Post, New York Times and Xinhua News Agency) covering the period from October 2004 through March 2006. The AQUAINT-2 collection is the second part of a series intended to provide data useful for developing, evaluating and testing information extraction and retrieval systems. It follows the publication of The AQUAINT Corpus of English News Text (LDC2002T31).

The AQUAINT (Advanced Question-Answering for Intelligence)  program addresses interactivity with scenarios or tasks. The scenario provides a context in which questions will be asked and answered, and the task reflects the overall assignment. The program is committed to solve a single problem: how to find topically relevant, semantically related, timely information in massive amounts of data in diverse languages, formats, and genres.

AQUAINT technology is advancing the development of components and functions that allows users to pose a series of intertwined, complex questions and obtain comprehensive answers in the context of broad information-gathering tasks. In addition, while most information retrieval systems present only links to documents, AQUAINT is producing technology that will present answers to the user's questions. This question-answering technology is being developed with features for managing semantic similarity, co-reference, event characterization, opinions, linguistic and social and world inferencing, redundancy, deception, and missing or contradictory information. In order to allow the analyst to guide the exploration in concert with the machine, AQUAINT technology employs interactive question-answering, the automatic suggestion of additional paths of exploration, and the inferencing of the social context of the information search.

Data

AQUAINT-2 Information-Retrieval Text Research Collection is a subset of LDC's English Gigaword Third Edition (LDC2007T07). The collection comprises approximately 2.5 GB of text (about 907K documents) spanning the time period October 2004 - March 2006.  For each source, all of the usable data collected by LDC was processed into a consistent XML format in which the stories for a given month are concatenated in chronological order into a single "DOCSTREAM" element; each story is a single "DOC" element within that stream and has a globally unique "id" attribute. The collection consists of newswire data in English drawn from six distinct sources, listed below in terms of their file name designations and full names:
afp_eng Agence France Presse
apw_eng Associated Press
cna_eng Central News Agency (Taiwan) English Service
ltw_eng Los Angeles Times - Washington Post News Service
nyt_eng New York Times
xin_eng Xinhua News Agency (Beijing) English Service

Samples

For an example of the data in this publication, please examine this image of the XML.

Available Media

View Fees





Login for the applicable fee