TDT5 Multilingual News Text LDC2006T18 December 4, 2006 I. Introduction This file contains documentation about TDT5 Multilingual News Text , Linguistic Data Consortium (LDC) catalog number LDC2006T18 and ISBN Number 1-58563-417-4. The TDT5 corpora were created by Linguistic Data Consortium with support from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This release contains the complete set of English, Arabic and Chinese newswire text used in the 2004 Topic Detection and Tracking technology evaluations. The topic relevance annotations corresponding to this publication can be found in LDC Publication LDC2006T19, TDT5 Topics and Annotations. Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. There were four TDT tasks defined for the 2004 evaluation: the tracking of known topics, the detection of unknown topics, the detection of initial stories on unknown topics, and the detection of pairs of stories on the same topic (links). Of these four tasks, the topic tracking task and the link detection task are considered to be "primary." Previous TDT evaluations also included a story segmentation task. This task applied only to broadcast news. Since TDT5 does not include broadcast news, there is no story segmentation task in the 2004 TDT Evaluation. Complete documentation on the TDT evaluation program can be found on NIST's TDT website. For further information about corpora and annotations to support the TDT Program visit LDC's TDT information pages. II. Data Profile The TDT5 corpus spans collections from April-September 2003 and consists of English, Chinese, and Arabic news text. A total of 15 distinct news "sources" are included (where a "source" comprises data from a given news agency in a particular language; when an agency publishes in multiple languages, each language is considered a different "source"). In contrast to earlier TDT corpora, TDT5 has no broadcast/audio content, only printed news from wire and web sources. Arabic AFA Agence France Presse ANN An-Nahar UMM Ummah Press XIA Xinhua News Agency English AFE Agence France Presse APE Associated Press CNE Central News Agency - Taiwan LAT LA Times/Washington Post NYT New York Times UME Ummah Press XIE Xinhua News Agency Chinese AFC Agence France Presse CNA Central News Agency - Taiwan XIN Xinhua News Agency ZBN Zaobao News Agency TDT5 comprises a subset of the previous LDC release, HARD 2004 Text (LDC2005T28). In the HARD 2004 Text release, data is organized into one file per day per source. For the current TDT5 corpus, roughly half of the daily files are partitioned into chunks of some maximum size (around 30 K-words, on average), in order to support the "look-ahead" condition in the TDT5 evaluation plan and to keep sample file size relatively consistent. Summary statistics about the volume of data by source are available in the tdt5_stats_tables.txt document found in the /docs directory of this release. III. Notes on Time/Date Properties of Corpus Most of the sources included in TDT5 are web sites where we typically download all content at daily intervals, or other types of electronic archives that we receive in bulk, as opposed to being wire services that run continuously on dedicated modems. This difference has an impact on one of the underlying assumptions about TDT data. Modem/wire collections tend to behave like 24-hour news channels, giving a sequence of reports on a given event with details being added over time; each story comes with a time-stamp, and stories are written to data files in chronological order. The web/bulk sources behave differently: for each date, we get a snapshot of information that the source asserts is current at a given moment. The web sources often do not provide anything like time-stamps on the stories (or LDC's download/conditioning may have failed to locate or retain time-stamps); in any case, the sequence in which stories for a given date are received -- and the order in which they are stored to a given daily collection file -- may be random relative to when the story was posted on the web site. We view this timing variation as unrecoverable and having little effect. Of the 15 sources in TDT5, we have time-stamp data and chronological ordering for only 5. These 5 all happen to be sources whose daily quantity will require partitioning of the data into two or more files per day. But there are an additional 5 sources without time-stamps, and "indeterminate" ordering of stories within each day's collection, that need to be split up as well, into as many as 4 partitions per day. To make this fit within the TDT framework, we needed to "invent" time-stamps for these stories, and place them into files that occupy particular time periods each day. (The remaining 5 sources with no time-stamps are low-volume, and their files will not be partitioned; their data files will be assigned to arbitrarily chosen times of day, such that each source will always occupy the same given time-slot.) On balance, it's unlikely that the partitioning of the non-time-stamped sources will produce anomalies where a more detailed follow-up story appears earlier in the day than a shorter, "first-on-topic" type of story within a given source. The archival nature of these sources tends to eliminate multiple, time-ordered versions of a given story within a single daily snapshot. It is conceivable that a given event might show up in AP (a time-stamped modem feed) at 15:50 with "first-on-topic" brevity, while the same event might be reported with follow-up detail in a Zaobao file with a time-stamp of 08:10. But this sort of variance has always been a feature in TDT. Here are some general observations/rules for splitting up the original TDT5 data files (one file per source per day) in order to produce sample units that are more appropriate for TDT. Since most of the variance in file size correlates with source and language, the rules are stated in those terms. afc -> AFP_CHN m : all files remain unsplit cna -> CNA_CHN m : all files remain unsplit cne -> CNA_ENG m : all files remain unsplit ume -> UMM_ENG m : all files remain unsplit umm -> UMM_ARB m : all files remain unsplit ann -> ANN_ARB m : most files remain unsplit, a few split in half xia -> XIN_ARB m : most files remain unsplit, a few split in half nyt -> NYT_ENG t : files either split in half or remain unsplit zbn -> ZBN_CHN m : files either split in half or remain unsplit afa -> AFP_ARB i : most files split in half, a few remain unsplit lat -> LAT_ENG t : most files split into 2 to 4 partitions each xie -> XIN_ENG m : all files split into 2 to 3 partitions each xin -> XIN_CHN m : all files split into 3 or 4 partitions each afe -> AFP_ENG i : files split into 5 to 10 partitions each ape -> APW_ENG t : files split into 6 to 12 partitions each The lower-case letter just before each colon shows the status of time stamps for each source: t = true time stamps already exist in src_sgm markup i = time stamps can be imported from raw data m = must invent "made-up" time stamps for stories The strategy for splitting would be based on the following thresholds: < 36000 tokens -- do not split 36000 - 70000 -- split into 2 70000 - 100000 -- split into 3 100000 - 130000 -- split into 4 ... and so on at intervals of 30000. Regarding the 10 sources that do not come with any time stamp data (marked with "m" above), we can allocate times as follows: afc -> AFP_CHN : 0300_0500 ann -> ANN_ARB : 1100_1230,1230_1400 cna -> CNA_CHN : 0530_0700 cne -> CNA_ENG : 1300_1430 ume -> UMM_ENG : 2100_2300 umm -> UMM_ARB : 1900_2100 xia -> XIN_ARB : 0600_0800,1500_1700 xie -> XIN_ENG : 0900_1100,1300_1500,1700_1900 xin -> XIN_CHN : 0130_0330,0730_0930,1130_1330,1530_1730 zbn -> ZBN_CHN : 1000_1200,1900_2100 IV. Annotations of the Corpus The TDT5 Corpus has been annotated in multiple ways, including topic relevance judgments, link detection and adjudication of site submissions. All annotations are available in the TDT5 Topics and Annotations Corpus, LDC2006T19. Additional information about annotation of the TDT5 corpus is available at http://www.ldc.upenn.edu/Projects/TDT5/Annotation/TDT2004V1.2.pdf V. Corpus Structure The organization of data in the corpus is intended to provide direct support for the research tasks defined in the yearly TDT evaluation plans (available at http://www.nist.gov/speech/tests/tdt/index.htm), while also providing a data format compatible with other research projects including information extraction, information retrieval, summarization and other technologies. Each data sample is presented in a variety of forms, with each form placed in a separate directory under /data. The forms of data in this release (and their directory names) are: tkn_sgm -- Reference text data derived from "tkn" files, in an SGML markup format similar to the TIPSTER text corpora mttkn_sgm -- Machine translation output from ISI, in an SGML markup format similar to the TIPSTER text corpora. The other data formats used in accordance with the NIST TDT5 evaluation plan involved "token-stream" data, which were originally designed to support the story-segmentation task for broadcast data. Because these formats are much bulkier, and are useful only for replicating the NIST TDT5 evaluation procedures, they are being provided in the form of a compressed unix "tar" file: tdt5proj.tgz contains the token stream and boundary table files for the "tkn" and "mttkn" data as used in the NIST TDT5 evaluation. The tar file also contains the "pre-tokenized" versions of all source text data (files identified as "src_sgm"). Users of the GNU "tar" utility (or an equivalent command line tool) can unpack the tar file contents as follows: # copy tdt5proj.tgz to the current working directory, then: tar xzf tdt5proj.tgz Some users may need to uncompress the file before extracting the contents: gunzip < tdt5proj.tgz > tdt5proj.tar tar xf tdt5proj.tar A complete listing of the contents of the tar file (including names and sizes of all data files) is provided in "docs/proj_filelist.txt" . Our thanks to ISI (especially Ignacio Thayer & Kevin Knight) for providing MT output for the corpus. V. Supporting Materials In addition to the data directories cited above, this release contains the following additional directories: dtd -- contains SGML Document Type Definition files to specify the markup format of the boundary table files, token stream files, and the topic tables; the dtd files are necessary for using an SGML parsing utility (e.g. nsgmls) to process the various data files. The functions of the dtd files are: - boundset.dtd -- for all "boundary table" files - docset.dtd -- for all "token stream" files (tkn,mttkn) - tiptext.dtd -- for all "tipsterized sgm" files (tkn_sgm,mttkn_sgm) - srctext.dtd -- for all "src_sgm" files - topicset.dtd -- for resuults of topic annotations (provided in the TDT 5 Topics and Annotations Corpus, LDC2006T19. doc -- tables and listings that describe the corpus content: - tdt5_stats_tables.txt -- summary of quantities by source and month - content_summary.txt -- this file - tdt5proj_filelist.txt -- list of contents in the release ----------- README Created Stephanie Strassel 12/4/2006 Updated David Graff 12/11/2006