TDT5 Topics and Annotations
           LDC2006T19
           December 4, 2006


I. Introduction

This file contains documentation about TDT5 Topics and Annotations
, Linguistic Data Consortium (LDC) catalog number LDC2006T19 and
ISBNNumber 1-58563-418-2. This release includes topic relevance judgments
and associated information for the TDT5 2004 evaluation topics. This
release contains complete relevance judgments, including the results of
adjudication, in which discrepancies between system submissions and LDC
annotations are reviewed and relevance judgments updated. This release also
contains answer keys for the link detection task.

The TDT5 corpora were created by Linguistic Data Consortium with support
from the DARPA TIDES (Translingual Information Detection, Extraction and
Summarization) Program.  The multilingual news text corresponding to this
publication can be found in LDC Publication <a
href="http://www.ldc.upenn.edu/Catalog/LDC2006T18.html">LDC2006T18</a>,
TDT5 Multilingual News Text.

A total of 250 topics, numbered 55001 - 55250, were annotated by LDC
using a search guided annotation technique.  Details of the annotation
process are described in the annotation task definition, here:

 http://projects.ldc.upenn.edu/TDT5/Annotation/TDT2004V1.2.pdf

A copy of these guidelines is also contained in the /docs directory.

Approximately 25% of the topics are monolingual English (ENG), 25% are
monolingual Mandarin Chinese (MAN), 25% are monolingual Arabic (ARB),
and 25% are multilingual:

 63 ENG
 62 MAN
 62 ARB
 35 ARB ENG MAN
 21 ENG MAN
  7 ARB ENG
 ---
 250 total

Broken down by language and counting both mono- and multi-lingual
topics:

 126 ENG
 118 MAN
 104 ARB

II. Directory structure

The TDT2004 release is organized by the following directory structure:

	tdt2004_topic_annotations/
		
	- README -- top-level readme, which includes or points to all
                 relevant information for this release

	- annotations/ 
                  
                  o link_detection/

                  This directory contains answer keys for the link
                  detection task. 

                  o relevance_judgments/

                  This directory contains topic relevance tables, as well
                  as a table indicating "completeness" of human relevance
                  judgments for each topic.
			
	- docs/ 

                 This directory contains documentation of the annotation
                  process and additional information about the annotated
                  topics.

                  o TDT2004-topic_profiles.html - HTML document
                    containing full topic descriptions for all 250
                    topics

                  o TDT2004.topic_titles - tab-delimited list of topic
                    IDs and titles, in this format: 

			topicID	title

                  o TDT2004.topics_by_language - tab-delimited list of
                    topic IDs and the languages in which each was
                    annotated, in this format:

			topicID ARB ENG MAN

                  o TDT2004V1.2.pdf - annotation guidelines


III. Description of Annotations

A. Relevance Judgments

Topic relevance assessment involved multi-stage, search-guided
annotation.  Annotators submitted sets of query terms for each topic
to a search engine, and labeled each returned document for its
relevance to the topic.

Results of topic labeling are presented in two tables, one for
on-topic, one for off-topic.  The tables display one topic-document
judgment per line.  The lines also contain comments of this form:

 #  "none" "hard"

"hard" indicates that the annotator deemed that particular topic-document
relevance judgment a "difficult decision".

A preliminary version (V1.0) of the topic tables contained results of initial
topic labeling; these were used as input for initial system scoring.

After system were submitted, LDC performed limited adjudication of
discrepancies between system and human annotations; some topic labels were
updated as a result of this process.  The final version of the topic tables
(V2.0 included here) contains all of the records for the preliminary version,
plus additional records for topic-document tuples that were newly labeled
during adjudication. Some records that had been in the old on-topic table
may be moved to the off-topic table, and vice-versa.

For newly added records, the comment field indicates "new in v2.0".
For records repeated from the initial tables, "mod.in v2.0" will be
appended to the comment (or will replace "none") if the initial label
was "toggled" (old "YES" changed to "NO" or vice-versa).

The relevance judgments and associated annotations are included in
three tables within the annotations/topic_relevance/ directory:

	o TDT2004.topic_rel.v2.0

Contains the unique document ids that were judged on-topic for each
topic-language pair. The format of the table is presented in this
manner:

 <ONTOPIC topicid=55nnn level=YES docno=SRCYYYYMMDD.STIM.IDNO fileid=YYYYMMDD_STIM_ETIM_SRC_LNG comments="none">

	o TDT2004.off_topic.v2.0

Contains the unique document ids that were judged off-topic for each
topic-language pair.  Likewise, the format of this table is as such:
 
 <ONTOPIC topicid=55nnn level=NO docno=SRCYYYYMMDD.STIM.IDNO fileid=YYYYMMDD_STIM_ETIM_SRC_LNG comments="none"> 

	o TDT2004.topic_completeness

Gives a "completeness" status for each topic.  Annotation was limited
to three hours per topic.  Upon finishing a topic, the annotator was
asked:

"Your time is up, and/or you have marked this topic as done. If you
had more time, do you think you could find more stories for this topic
in the corpus?"

The answer for each topic-language appears in this table, in the
following format:

 "55nnn LNG complete/incomplete"

B.  Link Detection 

Answer keys for the link detection task are created by NIST and are derived
from the topic tables described above.  All link detection (lnk) tests use
newswire texts (SR=nwt).  There are six files of keys within the
annotations/link_detection/ directory.  Each file makes reference to a
different evaluation condition.  

For each language condition: Arabic (TE=arb), Mandarin (TE=man), and
multilingual (TE=mul), there are two content encodings either
native-encoded foreign language content (TE=<LANG>,nat) or English
translations of non-English material (TE=<LANG>,eng).  The keys take into
account the adjudicated topic labels.


IV. Adjudication Strategy

In order to identify misses (on topic stories that are not identified
as such), LDC relies on adjudication of research sites' results.  NIST
provides LDC with each research site's results for the topic tracking
task. The sites' systems are scored against the LDC's human-produced
topic relevance tables, with the annotators' judgments taken as ground
truth.  Each system false alarm is a potential LDC miss.  It is not
feasible to completely adjudicate all cases where LDC annotators
differ from system performance; the effort needed to adjudicate all
the cases of discrepancy would exceed the original corpus creation
effort.  Instead, the LDC reviews cases where a majority of systems
disagree with the original annotation and modifies the topic labels as
required.  In previous TDT corpus adjudication efforts, the
probability of a system false alarm correlating to an annotator miss
grew in proportion to the number of systems reporting disagreement
with the original annotation.

Because LDC agrees to adjudicate as many results as possible in the
time allotted, the data to be judged is prioritized in the following
manner: first annotators review all false alarm cases; second, the
cases where four or more sites have identified potential LDC misses;
third, the misses where three or more sites disagree with the original
annotation, until all research sites' results are complete. Due to the
fact that time for adjudication is limited, only a portion of research
sites' results were adjudicated for all 250 topics.

LDC received a total of 2,304,835 relevance judgments to review,
reflecting submissions from four sites. 10,443 of these were potential
LDC false alarms the other 2,294,392 documents were potential LDC
misses.  (Submissions from one site were not received in time to
incorporate into the adjudication task.)

LDC adjudicated 100% of purported false alarms for all topics in all
languages.

At minimum, LDC adjudicated all cases where 4 of 4 systems suggested
an LDC miss; with the following exceptions:

Three topics contained purported misses in Chinese; these documents
were not adjudicated for Chinese because the original annotation was
English-only. Those topics are: 55026, 55058, and 55063.

Additionally, one English topic did not receive complete adjudication
of the cases where 4 out of 4 sites suggested an LDC miss, due to its
sheer size. That topic is: 55058.

V. Contacts

 Coordination and Annotation Supervision - Meghan Glenn <mlglenn@ldc.upenn.edu>
 Management - Stephanie Strassel <strassel@ldc.upenn.edu>
 Technical Consultation - Kazuaki Maeda <maeda@ldc.upenn.edu>, David Graff <graff@ldc.upenn.edu>

-----------
README created Meghan Glenn 11/21/2006
       updated Stephanie Strassel 12/4/2006