Home › Language Resources › Data

TDT Pilot Study Corpus

Item Name:	TDT Pilot Study Corpus
Author(s):	James Allan, Yiming Yang, Jaime Carbonell, Jon Yamron, George R. Doddington, Charles Wayne
LDC Catalog No.:	LDC98T25
ISBN:	1-58563-140-X
ISLRN:	770-765-444-577-6
DOI:	https://doi.org/10.35111/sxw0-5q10
Member Year(s):	1998
DCMI Type(s):	Text
Data Source(s):	transcribed speech, newswire, broadcast news
Project(s):	TIDES, TDT, GALE, EARS
Application(s):	topic detection and tracking
Language(s):	English
Language ID(s):	eng
License(s):	TDT Pilot Study Agreement
Online Documentation:	LDC98T25 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Allan, James, et al. TDT Pilot Study Corpus LDC98T25. Web Download. Philadelphia: Linguistic Data Consortium, 1998.
Related Works: Hide	View isSimilarWith LDC99S84 TDT2 English Audio LDC2000S92 TDT2 Careful Transcription Audio LDC2000T44 TDT2 Careful Transcription Text LDC2001S93 TDT2 Mandarin Audio Corpus LDC2001S94 TDT3 English Audio LDC2001S95 TDT3 Mandarin Audio LDC2001T57 TDT2 Multilanguage Text Version 4.0 LDC2001T58 TDT3 Multilanguage Text Version 2.0 LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus LDC2005T16 TDT4 Multilingual Text and Annotations LDC2006T19 TDT5 Topics and Annotations LDC2006T18 TDT5 Multilingual Text

Introduction

The TDT Pilot Study corpus was created to support an initiative in "topic detection and tracking." This initiative is directed toward computer processing of language data, both text and speech. The objective is namely to explore techniques for detecting the appearance of new and unexpected topics and for tracking the reappearance and evaluation of them.

Data

The TDT corpus comprises a set of stories that includes both newswire (text) and broadcast news (speech). Each story is represented as a stream of text, in which the text is either taken directly from the newswire (Reuters) or is a manual transcription of the broadcast news speech (CNN). The corpus spans the period from July 1, 1994 to June 30, 1995. It contains approximately 16,000 stories, with about half taken from Reuters newswire and half from CNN broadcast news transcripts.

An integral and key part of the corpus is the annotation of the corpus in terms of the events discussed in the stories. 25 events were defined that span a variety of event types and that cover a subset of the events discussed in the corpus stories. Annotation data for these events are included in the corpus and provide a basis for training TDT systems.

Updates

There are no updates at this time.

TDT Pilot Study Corpus

Introduction

Data

Updates

Copyright

Available Media

View Fees