TDT5 Topics and Annotations LDC2006T19 December 4, 2006 I. Introduction This file contains documentation about TDT5 Topics and Annotations , Linguistic Data Consortium (LDC) catalog number LDC2006T19 and ISBNNumber 1-58563-418-2. This release includes topic relevance judgments and associated information for the TDT5 2004 evaluation topics. This release contains complete relevance judgments, including the results of adjudication, in which discrepancies between system submissions and LDC annotations are reviewed and relevance judgments updated. This release also contains answer keys for the link detection task. The TDT5 corpora were created by Linguistic Data Consortium with support from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. The multilingual news text corresponding to this publication can be found in LDC Publication LDC2006T18, TDT5 Multilingual News Text. A total of 250 topics, numbered 55001 - 55250, were annotated by LDC using a search guided annotation technique. Details of the annotation process are described in the annotation task definition, here: http://projects.ldc.upenn.edu/TDT5/Annotation/TDT2004V1.2.pdf A copy of these guidelines is also contained in the /docs directory. Approximately 25% of the topics are monolingual English (ENG), 25% are monolingual Mandarin Chinese (MAN), 25% are monolingual Arabic (ARB), and 25% are multilingual: 63 ENG 62 MAN 62 ARB 35 ARB ENG MAN 21 ENG MAN 7 ARB ENG --- 250 total Broken down by language and counting both mono- and multi-lingual topics: 126 ENG 118 MAN 104 ARB II. Directory structure The TDT2004 release is organized by the following directory structure: tdt2004_topic_annotations/ - README -- top-level readme, which includes or points to all relevant information for this release - annotations/ o link_detection/ This directory contains answer keys for the link detection task. o relevance_judgments/ This directory contains topic relevance tables, as well as a table indicating "completeness" of human relevance judgments for each topic. - docs/ This directory contains documentation of the annotation process and additional information about the annotated topics. o TDT2004-topic_profiles.html - HTML document containing full topic descriptions for all 250 topics o TDT2004.topic_titles - tab-delimited list of topic IDs and titles, in this format: topicID title o TDT2004.topics_by_language - tab-delimited list of topic IDs and the languages in which each was annotated, in this format: topicID ARB ENG MAN o TDT2004V1.2.pdf - annotation guidelines III. Description of Annotations A. Relevance Judgments Topic relevance assessment involved multi-stage, search-guided annotation. Annotators submitted sets of query terms for each topic to a search engine, and labeled each returned document for its relevance to the topic. Results of topic labeling are presented in two tables, one for on-topic, one for off-topic. The tables display one topic-document judgment per line. The lines also contain comments of this form: # "none" "hard" "hard" indicates that the annotator deemed that particular topic-document relevance judgment a "difficult decision". A preliminary version (V1.0) of the topic tables contained results of initial topic labeling; these were used as input for initial system scoring. After system were submitted, LDC performed limited adjudication of discrepancies between system and human annotations; some topic labels were updated as a result of this process. The final version of the topic tables (V2.0 included here) contains all of the records for the preliminary version, plus additional records for topic-document tuples that were newly labeled during adjudication. Some records that had been in the old on-topic table may be moved to the off-topic table, and vice-versa. For newly added records, the comment field indicates "new in v2.0". For records repeated from the initial tables, "mod.in v2.0" will be appended to the comment (or will replace "none") if the initial label was "toggled" (old "YES" changed to "NO" or vice-versa). The relevance judgments and associated annotations are included in three tables within the annotations/topic_relevance/ directory: o TDT2004.topic_rel.v2.0 Contains the unique document ids that were judged on-topic for each topic-language pair. The format of the table is presented in this manner: o TDT2004.off_topic.v2.0 Contains the unique document ids that were judged off-topic for each topic-language pair. Likewise, the format of this table is as such: o TDT2004.topic_completeness Gives a "completeness" status for each topic. Annotation was limited to three hours per topic. Upon finishing a topic, the annotator was asked: "Your time is up, and/or you have marked this topic as done. If you had more time, do you think you could find more stories for this topic in the corpus?" The answer for each topic-language appears in this table, in the following format: "55nnn LNG complete/incomplete" B. Link Detection Answer keys for the link detection task are created by NIST and are derived from the topic tables described above. All link detection (lnk) tests use newswire texts (SR=nwt). There are six files of keys within the annotations/link_detection/ directory. Each file makes reference to a different evaluation condition. For each language condition: Arabic (TE=arb), Mandarin (TE=man), and multilingual (TE=mul), there are two content encodings either native-encoded foreign language content (TE=,nat) or English translations of non-English material (TE=,eng). The keys take into account the adjudicated topic labels. IV. Adjudication Strategy In order to identify misses (on topic stories that are not identified as such), LDC relies on adjudication of research sites' results. NIST provides LDC with each research site's results for the topic tracking task. The sites' systems are scored against the LDC's human-produced topic relevance tables, with the annotators' judgments taken as ground truth. Each system false alarm is a potential LDC miss. It is not feasible to completely adjudicate all cases where LDC annotators differ from system performance; the effort needed to adjudicate all the cases of discrepancy would exceed the original corpus creation effort. Instead, the LDC reviews cases where a majority of systems disagree with the original annotation and modifies the topic labels as required. In previous TDT corpus adjudication efforts, the probability of a system false alarm correlating to an annotator miss grew in proportion to the number of systems reporting disagreement with the original annotation. Because LDC agrees to adjudicate as many results as possible in the time allotted, the data to be judged is prioritized in the following manner: first annotators review all false alarm cases; second, the cases where four or more sites have identified potential LDC misses; third, the misses where three or more sites disagree with the original annotation, until all research sites' results are complete. Due to the fact that time for adjudication is limited, only a portion of research sites' results were adjudicated for all 250 topics. LDC received a total of 2,304,835 relevance judgments to review, reflecting submissions from four sites. 10,443 of these were potential LDC false alarms the other 2,294,392 documents were potential LDC misses. (Submissions from one site were not received in time to incorporate into the adjudication task.) LDC adjudicated 100% of purported false alarms for all topics in all languages. At minimum, LDC adjudicated all cases where 4 of 4 systems suggested an LDC miss; with the following exceptions: Three topics contained purported misses in Chinese; these documents were not adjudicated for Chinese because the original annotation was English-only. Those topics are: 55026, 55058, and 55063. Additionally, one English topic did not receive complete adjudication of the cases where 4 out of 4 sites suggested an LDC miss, due to its sheer size. That topic is: 55058. V. Contacts Coordination and Annotation Supervision - Meghan Glenn Management - Stephanie Strassel Technical Consultation - Kazuaki Maeda , David Graff ----------- README created Meghan Glenn 11/21/2006 updated Stephanie Strassel 12/4/2006