Home › Language Resources › Data

TIPSTER Complete

Item Name:	TIPSTER Complete
Author(s):	Donna Harman, Mark Liberman
LDC Catalog No.:	LDC93T3A
ISBN:	1-58563-020-9
ISLRN:	741-001-210-040-2
DOI:	https://doi.org/10.35111/bhec-t442
Member Year(s):	1993
DCMI Type(s):	Text
Data Source(s):	newswire, varied
Project(s):	TREC, Tipster, TIDES, MUC, GALE
Application(s):	language modeling, information retrieval
Language(s):	English
Language ID(s):	eng
License(s):	Tipster Volume 1 Agreement Individual Tipster Volume 1 Agreement Organization Tipster Volume 2 Agreement Individual Tipster Volume 2 Agreement Organization Tipster Volume 3 Agreement Individual Tipster Volume 3 Agreement Organization
Online Documentation:	LDC93T3A Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Harman, Donna, and Mark Liberman. TIPSTER Complete LDC93T3A. Web Download. Philadelphia: Linguistic Data Consortium, 1993.
Related Works: Hide	View hasPart LDC93T3B TIPSTER Volume 1 LDC93T3C TIPSTER Volume 2 LDC93T3D TIPSTER Volume 3 hasAnnotation LDC95T7 Treebank-2 LDC99T42 Treebank-3 LDC2005T08 Discourse Graphbank hasOutcome LDC95T6 CSR-III Text LDC99L23 American English Spoken Lexicon relatesTo LDC95T21 North American News Text Corpus LDC95T9 Spanish News Text LDC98T30 North American News Text Supplement LDC2001T55 Arabic Newswire Part 1 LDC2001T58 TDT3 Multilanguage Text Version 2.0

LDC93T3A - Complete TIPSTER corpus

LDC93T3B - Volume 1 of the TIPSTER corpus

LDC93T3C - Volume 2 of the TIPSTER corpus

LDC93T3D - Volume 3 of the TIPSTER corpus

TIPSTER is sometimes also called the Text Research Collection Volume or TREC.

The TIPSTER project was sponsored by the Software and Intelligent Systems Technology Office of the Advanced Research Projects Agency (ARPA/SISTO) in an effort to significantly advance the state of the art in effective document detection (information retrieval) and data extraction from large, real-world data collections.

The detection data is comprised of a test collection built at NIST for the TIPSTER project and the related TREC project. The TREC project has many other participating information retrieval research groups, working on the same task as the TIPSTER groups, but meeting once a year in a workshop to compare results (similar to MUC). The test collection consists of three CD-ROMs of SGML encoded documents distributed by LDC plus queries and answers (relevant documents) distributed by NIST.

Source (vol)	Year	Approx. # Words (Millions)
Associated Press (1)	1989	40
Associated Press (2)	1988	37
Associated Press (3)	1990	37
Wall Street Journal (1)	1987	20
Wall Street Journal (1)	1988	17
Wall Street Journal (1)	1989	6
Wall Street Journal (2)	1990	11
Wall Street Journal (2)	1991	22
Wall Street Journal (2)	1992	5
Dept. of Energy (1)		28
Federal Register (1)	1989	38
Federal Register (2)	1988	30
Ziff/Davis (1)		36
Ziff/Davis (2)	1989-90	26
Ziff/Davis (3)	1991-92	50
San Jose Mercury News (3)	1991	45

The documents in the test collection are varied in style, size and subject domain. The first disk contains material from the Wall Street Journal, (1986, 1987, 1988, 1989), the AP Newswire (1989), the Federal Register (1989), information from Computer Select disks (Ziff-Davis Publishing) and short abstracts from the Department of Energy. The second disk contains information from the same sources, but from different years. The third disk contains more information from the Computer Select disks, plus material from the San Jose Mercury News (1991), more AP newswire (1990) and about 250 megabytes of formatted U.S. Patents. The format of all the documents is relatively clean and easy to use, with SGML-like tags separating documents and document fields. There is no part-of-speech tagging or breakdown into individual sentences or paragraphs as the purpose of this collection is to test retrieval against real-world data.

The three Tipster discs released have been re-issued with updates and corrections and all recipients of the earlier versions should have received these replacements free of charge. If you think you have the unrevised original, contact LDC for confirmation.

TIPSTER Complete

Available Media

View Fees