TIPSTER Volume 2
Item Name: | TIPSTER Volume 2 |
Author(s): | Donna Harman, Mark Liberman |
LDC Catalog No.: | LDC93T3C |
ISBN: | 1-58563-022-5 |
ISLRN: | 532-662-320-210-9 |
DOI: | https://doi.org/10.35111/yr79-6v49 |
Member Year(s): | 1993 |
DCMI Type(s): | Text |
Data Source(s): | newswire, varied |
Project(s): | MUC, Tipster, TREC |
Application(s): | information retrieval, language modeling |
Language(s): | English |
Language ID(s): | eng |
License(s): |
Tipster Volume 2 Agreement Individual Tipster Volume 2 Agreement Organization |
Online Documentation: | LDC93T3C Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Harman, Donna, and Mark Liberman. TIPSTER Volume 2 LDC93T3C. Web Download. Philadelphia: Linguistic Data Consortium, 1993. |
Related Works: | View |
Introduction
TIPSTER is sometimes also called the Text Research Collection Volume or TREC. TIPSTER Volume 2 contains disk 2, and TIPSTER Complete (LDC93T3A) contains all disks.
The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.
The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.
Data
The documents in the test collection are varied in style, size and subject domain. The second disk contains material from the Wall Street Journal, (1990, 1991, 1992), the AP Newswire (1988), the Federal Register (1988), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.
Source (vol) | Year | Approx. # Words (Millions) |
Associated Press (2) | 1988 | 37 |
Wall Street Journal (2) | 1990 | 11 |
Wall Street Journal (2) | 1991 | 22 |
Wall Street Journal (2) | 1992 | 5 |
Federal Register (2) | 1988 | 30 |
Ziff/Davis (2) | 1989-90 | 26 |
Samples
Please view these samples:
Updates
The three Tipster discs so far released have been re-issued with updates and corrections and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation.