Home › Language Resources › Data

TDT5 Topics and Annotations

Item Name:	TDT5 Topics and Annotations
Author(s):	Meghan Glenn, Stephanie Strassel, Junbo Kong, Kazuaki Maeda
LDC Catalog No.:	LDC2006T19
ISBN:	1-58563-418-2
ISLRN:	396-836-683-088-8
DOI:	https://doi.org/10.35111/fjaq-y976
Release Date:	December 19, 2006
Member Year(s):	2006
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	information detection, information extraction, language modeling, machine learning, machine translation, topic detection and tracking
Language(s):	English, Mandarin Chinese, Standard Arabic
Language ID(s):	eng, cmn, arb
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2006T19 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Glenn, Meghan, et al. TDT5 Topics and Annotations LDC2006T19. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: Hide	View isAnnotationOf LDC2006T18 TDT5 Multilingual Text isSimilarWith LDC98T25 TDT Pilot Study Corpus LDC99S84 TDT2 English Audio LDC2000S92 TDT2 Careful Transcription Audio LDC2000T44 TDT2 Careful Transcription Text LDC2001S93 TDT2 Mandarin Audio Corpus LDC2001S94 TDT3 English Audio LDC2001S95 TDT3 Mandarin Audio LDC2001T57 TDT2 Multilanguage Text Version 4.0 LDC2001T58 TDT3 Multilanguage Text Version 2.0 LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus LDC2005T16 TDT4 Multilingual Text and Annotations LDC2007T22 2001 Topic Annotated Enron Email Data Set

Introduction

TDT5 Topics and Annotations was developed by the Linguistic Data Consortium (LDC) and includes about 10,000 topic relevance judgments and other associated information for the TDT5 2004 evaluation topics.

This release contains complete relevance judgments, including the results of adjudication, in which discrepancies between system submissions and LDC annotations are reviewed and relevance judgments updated. This release also contains answer keys for the link detection task.

The TDT5 corpora were created by Linguistic Data Consortium with support from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The multilingual news text corresponding to this publication can be found in TDT5 Multilingual News Text (LDC2006T18).

Data

A total of 250 topics, numbered 55001 - 55250, were annotated by LDC using a search guided annotation technique. Details of the annotation process are described in the annotation task definition.

Approximately 25% of the topics are monolingual English (ENG), 25% are monolingual Mandarin Chinese (MAN), 25% are monolingual Arabic (ARB), and 25% are multilingual:

63	ENG
62	MAN
62	ARB
35	ARB ENG MAN
21	ENG MAN
7	ARB ENG
250	total

Broken down by language and counting both mono- and multi-lingual topics:

126	ENG
118	MAN
104	ARB

Samples

For an example of the data in this corpus, please review this sample (TXT) from the link detection files.

Updates

None at this time.

TDT5 Topics and Annotations

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees