Home › Language Resources › Data

TDT2 Multilanguage Text Version 4.0

Item Name:	TDT2 Multilanguage Text Version 4.0
Author(s):	Mark Liberman, Jennifer Alabiso, David Graff, Christopher Cieri, Charles Wayne, George R. Doddington, Jonathan G. Fiscus
LDC Catalog No.:	LDC2001T57
ISBN:	1-58563-183-3
ISLRN:	662-457-089-041-7
DOI:	https://doi.org/10.35111/zfj3-tp72
Member Year(s):	2001
DCMI Type(s):	Text
Data Source(s):	broadcast news, newswire, transcribed speech
Project(s):	EARS, GALE, TDT, TIDES
Application(s):	topic detection and tracking
Language(s):	English, Mandarin Chinese
Language ID(s):	eng, cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2001T57 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Liberman, Mark, et al. TDT2 Multilanguage Text Version 4.0 LDC2001T57. Web Download. Philadelphia: Linguistic Data Consortium, 2001.
Related Works: Hide	View isAnnotationOf LDC2001S93 TDT2 Mandarin Audio Corpus hasAnnotation LDC2003T11 ACE-2 Version 1.0 isContinuationOf LDC99S84 TDT2 English Audio LDC2000S92 TDT2 Careful Transcription Audio LDC2000T44 TDT2 Careful Transcription Text isSimilarWith LDC98T25 TDT Pilot Study Corpus LDC2001S94 TDT3 English Audio LDC2001S95 TDT3 Mandarin Audio LDC2001T58 TDT3 Multilanguage Text Version 2.0 LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus LDC2005T16 TDT4 Multilingual Text and Annotations LDC2006T19 TDT5 Topics and Annotations LDC2006T18 TDT5 Multilingual Text

Introduction

Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. The TDT2 corpus was created to support three TDT2 tasks: find topically homogeneous sections (segmentation), detect the occurrrence of new events (detection), and track the reoccurrencce of old or new events (tracking).

Data

TDT2 Multilanguage Text Corpus Version 4.0 contains news data collected daily from nine news sources in two languages (American English and Mandarin Chinese), over a period of six months (January - June 1998). Both manually-created reference text and automatically- generated text (ASR and/or machine translation) are provided for all broadcast and all Mandarin data.

This version has been prepared to complement the first general release of the TDT3 Multilanguage Text Corpus, providing new enhancements to make the data content more accessible to a broader research community. The news sources and approximate number of stories per source (in thousands) are as follows:

English sources (thousands of stories)

New York Times Newswire Service 11.8

Associated Press Worldstream Service 12.8

Cable News Network, Headline News 15.8

American Broadcasting Co., World News Tonight 2.1

Public Radio International, The World 2.9

Voice of America (news programs) 8.2

Total English stories: 53.6 thousand)

Mandarin sources (thousands of stories)

Xinhua News Agency 11.3

Zaobao News Agency 5.2

Voice of America (news programs) 2.3

Total Mandarin stories: 18.8 thousand

This release consists of the English and Mandarin text components of the TDT2 corpus. The data was collected daily over a period of six months (January-June 1998) from the following sources.

American Broadcasting Company (ABC)
Associated Press
Cable News Network, Inc. (CNN)
New York Times
Public Radio International (PRI)
Voice of America (VOA)
Xinhua News Agency
ZaoBao News

The data is provided in the following formats.

.sgm: Reference true-text, with markup providing story boundaries and descriptive information .tkn: Tokenized version of sgml data, with all descriptive and boundary information removed .as0: Output of the Dragon ASR system in tokenized form with information on timing, speaker clusters, and confidence .as1: Output of the BBN ASR system in tokenized form with timing information (English Only) .mttkn: SYSTRAN output from .tkn (Mandarin Only) .mtas0: SYSTRAN output from .as0 (Mandarin Only)

The corpus also includes topic relevance tables as well as tables for locating story boundaries.

Updates

7/21/16 - Topic tables were added to the release and the online documentation folder.

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

Copyright

Portions © 1998 American Broadcasting Company, The Associated Press, Cable News Network, LP, LLLP, New York Times, Public Radio International, SPH AsiaOne Ltd, Xinhua News Agency, © 1998-2001 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.

TDT2 Multilanguage Text Version 4.0

Introduction

Data

Updates

Copyright

Available Media

View Fees