TDT Pilot Study Corpus

Item Name: TDT Pilot Study Corpus
Authors: James Allan, Yiming Yang, Jaime Carbonell, Jon Yamron, George Doddington, and Charles Wayne
LDC Catalog No.: LDC98T25
ISBN: 1-58563-140-X
Data Type: text
Data Source(s): broadcast news, newswire, transcribed speech
Project(s): EARS, GALE, TDT, TIDES
Application(s): topic detection and tracking
Language(s): English
Language ID(s): ENG
Distribution: Web Download
Member fee: $0 for 1998 members
Non-member Fee: US $500.00
Reduced-License Fee: US $250.00
Extra-Copy Fee: N/A
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: James Allan, et al.
TDT Pilot Study Corpus
Linguistic Data Consortium, Philadelphia


The TDT Pilot Study corpus was created to support an initiative in "topic detection and tracking." This initiative is directed toward computer processing of language data, both text and speech. The objective is namely to explore techniques for detecting the appearance of new and unexpected topics and for tracking the reappearance and evaluation of them.


The TDT corpus comprises a set of stories that includes both newswire (text) and broadcast news (speech). Each story is represented as a stream of text, in which the text is either taken directly from the newswire (Reuters) or is a manual transcription of the broadcast news speech (CNN). The corpus spans the period from July 1, 1994 to June 30, 1995. It contains approximately 16,000 stories, with about half taken from Reuters newswire and half from CNN broadcast news transcripts.

An integral and key part of the corpus is the annotation of the corpus in terms of the events discussed in the stories. 25 events were defined that span a variety of event types and that cover a subset of the events discussed in the corpus stories. Annotation data for these events are included in the corpus and provide a basis for training TDT systems.


There are no updates at this time.


Portions 1994-1995 Cable News Network, LP, LLLP, 1994-1995 Reuters America, Inc., 1998 Trustees of the University of Pennsylvania