TDT5 Topics and Annotations


Item Name: TDT5 Topics and Annotations
Authors: Meghan Glenn, Stephanie Strassel, Junbo Kong, Kazuaki Maeda
LDC Catalog No.: LDC2006T19
ISBN: 1-58563-418-2
Release Date: Dec 19, 2006
Data Type: text
Data Source(s): newswire
Application(s): information detection, information extraction, language modeling, machine learning, machine translation, topic detection and tracking
Language(s): English, Mandarin Chinese, Modern Standard Arabic
Language ID(s): arb, cmn, eng
Distribution: Web Download
Member fee: $0 for 2006 members
Non-member Fee: US $500.00
Reduced-License Fee: US $250.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Meghan Glenn, et al.
2006
TDT5 Topics and Annotations
Linguistic Data Consortium, Philadelphia

Introduction

This file contains documentation on the TDT5 Topics and Annotations, Linguistic Data Consortium (LDC) catalog number LDC2006T19 and isbn 1-58563-418-2.

This release includes topic relevance judgments and associated information for the TDT5 2004 evaluation topics. This release contains complete relevance judgments, including the results of adjudication, in which discrepancies between system submissions and LDC annotations are reviewed and relevance judgments updated. This release also contains answer keys for the link detection task.

The TDT5 corpora were created by Linguistic Data Consortium with support from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The multilingual news text corresponding to this publication can be found in LDC Publication LDC2006T18, TDT5 Multilingual News Text.

Data

A total of 250 topics, numbered 55001 - 55250, were annotated by LDC using a search guided annotation technique. Details of the annotation process are described in the annotation task definition.

Approximately 25% of the topics are monolingual English (ENG), 25% are monolingual Mandarin Chinese (MAN), 25% are monolingual Arabic (ARB), and 25% are multilingual:

63 ENG
62 MAN
62 ARB
35 ARB ENG MAN
21 ENG MAN
7 ARB ENG
250 total
Broken down by language and counting both mono- and multi-lingual topics:
126 ENG
118 MAN
104 ARB

Samples

For an example of the data in this corpus, please review this sample from the link detection files.

Content Copyright

Portions 2004, 2006 Trustees of the University of Pennsylvania