README for the TDT4 Multilingual News Text and Annotations Corpus LDC2005T16 April 5, 2005 I. Introduction This file contains documentation about the TDT4 Multilingual News Text and Annotations Corpus, Linguistic Data Consortium (LDC) catalog number LDC2005T16 and ISBN Number 1-58563-339-9. The TDT4 corpora were created by Linguistic Data Consortium with support from the DARPA TIDES (Translingual Information Detection, Extraction and Summarization) Program. This release contains the complete set of English, Arabic and Chinese news text (broadcast news transcripts and newswire data) used in the 2002 and 2003 Topic Detection and Tracking technology evaluations, along with topic annotations created for those evaluations. The audio corresponding to the broadcast news transcripts contained in this release can be found in LDC Publication LDC2005S11, TDT4 Multilingual Broadcast News Speech Corpus. Topic Detection and Tracking (TDT) refers to automatic techniques for finding topically related material in streams of data such as newswire and broadcast news. Evaluation tasks in 2002 and 2003 included the segmentation of a news source into stories, the tracking of known topics, the detection of unknown topics, the detection of initial stories on unknown topics, and the detection of pairs of stories on the same topic. Complete documentation on the TDT evaluation program can be found on NIST's TDT website. For further information about corpora and annotations to support the TDT Program visit LDC's TDT information pages. II. Data Sources The TDT4 corpus contains news data collected daily from 20 news sources in three languages (American English, Mandarin Chinese and Modern Standard Arabic), over a period of four months (October 2000 through January 2001). The sources and their frequency of sampling are as follows: English sources --------------- APW_ENG (2) Associated Press Worldstream Service (English content only) NYT_NYT (3) New York Times Newswire Service (excluding non-NYT sources) CNN_HDL (3) Cable News Network, "Headline News" ABC_WNT (5) American Broadcasting Company, "World News Tonight" NBC_NNW (5) National Broadcasting Company, "NBC Nightly News" PRI_TWD (5) Public Radio International, "The World" VOA_ENG (5) Voice of America, English news programs MNB_NBW (6) MS-NBC, "News with Brian Williams" (no data from Jan. 2001) Mandarin sources ---------------- XIN_MAN (2) Xinhua News Agency ZBN_MAN (2) Zaobao News Agency CBS_MAN (3) China Broadcasting System CTS_MAN (3) China Television System VOA_MAN (5) Voice of America, Mandarin Chinese news programs CNR_MAN (5) China National Radio CTV_MAN (5) China Central Television Arabic sources -------------- AFP_ARB (1) Agence France-Presse ALH_ARB (1) Al-Hayat (no data from Jan. 2001) ANN_ARB (4) An-Nahar (no data from Jan. 2001) VOA_ARB (5) Voice of America, Modern Standard Arabic news programs NTV_ARB (6) Nile TV Sampling levels --------------- (1) >= 90 days sampled, over 100 stories/day (2) >= 90 days sampled, about 70 to 80 stories/day (3) >= 90 days sampled, 10 to 40 stories/day (4) 60-90 days sampled, over 100 stories/day (5) 60-90 days sampled, 10 to 40 stories/day (6) 40-45 days sampled, 10 to 15 stories/day The quantities indicated above for sampling frequencies are approximate; all sources were prone to occasional failures in the data collection process. A more detailed summary of data quantities by source and month is provided in the file "tdt3_stats.tabless". A complete listing of all stories and sample files is provided in the file "tdt4_docno.table". III. Annotations of the TDT4 Corpus Multiple manual annotations have been applied to the TDT4 data. Briefly, these include: -Transcription of audio data -Manual segmentation of audio data into individual story units and time -alignment of audio with transcripts -Topic selection -Topic definition and research -Search guided topic relevance annotation -Adjudication of relevance judgments against system output Each of these tasks is fully described in the Topic Detection and Tracking Annotation Guidelines V1.5, available within the /docs directory of this release. Included in this release within the /annotations directory are the IV. Corpus Structure The organization of data in the corpus is intended to provide direct support for the research tasks defined in the yearly TDT evaluation plans (available at http://www.nist.gov/speech/tests/tdt/index.htm), while also providing a data format compatible with other research projects including information extraction, information retrieval, summarization and other technologies. IV.A. Basic units of data The basic units of the corpus are news stories and sample files. Each news story is uniquely identified by a "DOCNO" (story-id) that indicates the source and date of the story; e.g.: ALH20001207.1300.0073 identifies a story from Al-Hayat collected on December 7, 2000; the final four digits distinguish this story from all other stories collected on the same date from the same source. In the case of broadcast sources (as opposed to newswire sources), the DOCNO also contains four digits to indicate the start time of the broadcast; e.g.: CNR20001001.1700.0000 identifies a China National Radio story from the 5:00pm Broadcast (EST); again, the final four digits distinguish this story from others in the same broadcast. Each sample file represents a contiguous collection of stories from a given source on a given date over a specific period of time; the file name of the sample (file-id) provides all this information; e.g.: 20001207_1300_1500_ALH_ARB 20001001_1700_1730_CNR_MAN These are the file names that happen to contain the example story-ids mentioned above. The ALH file spans a period of collection from 1:00pm to 3:00pm on December 7, and the CNR file covers a half-hour broadcast starting at 5:00pm on October 1. II.B. List of data types ------------------------- Each data sample is presented in a variety of forms, with each form placed in a separate directory under /data. In this DVD-ROM release, two forms of data ("tkn_sgm" and "asr_sgm") are directly accessible as uncompressed files. All other data forms have been packed into the compressed unix tar file "tdt3proj.tgz"; unpacking this tar file will create the additional directories under "tdt3_em". The forms of data in this release (and their directory names) are: annotations -- src_sgm -- Original source text data (newswire, manual transcript or closed caption text) from which reference texts are derived, in an SGML markup format similar to the TIPSTER text corpora. tkn -- Reference text data in tokenized ("token stream") form: story boundaries and other descriptive markup in the tkn_sgm format are removed, each English word or Mandarin GB character is assigned a unique identifier (a sequential "recid" number) and presented on a separate line with an SGML tag: " word". asr -- For all English sources, output of the LIMSI automatic transcription engine; for all non-English sources, output of the BBN automatic transcription engine. For Mandarin broadcast sources, the BBN ASR engine outputs multi-character words. mttkn -- For all Mandarin sources, output of SYSTRAN machine translation from "tkn" reference text data into English, created in 2001; for all Arabic sources, output of the IBM Arabic to English translation engine, created in 2001. See notes below about mttkn data. mttkn2 -- For all non-English sources, output of the ISI translation engine, created in 2004. See notes below about mttkn data. mtasr -- For the Mandarin broadcast data, output of SYSTRAN translation from asr text data into English; for Arabic broadcast data, output of the IBM Arabic to Chinese translation engine from asr text data into English; all created in 2001. tkn_sgm -- Reference text data derived from "tkn" files, in an SGML markup format similar to the TIPSTER text corpora asr_sgm -- ASR text data derived from "as0" and "as1" (Mandarin and English) broadcast files, in an SGML markup format similar to the TIPSTER text corpora mttkn_sgm -- Machine translation output from IBM and SYSTRAN, in an SGML markup format similar to the TIPSTER text corpora. mttkn2_sgm -- Machine translation output from ISI, in an SGML markup format similar to the TIPTSTER text corpora. Notes on mttkn data: Like the "asr" and "tkn" data sets, the mttkn and mttkn2 files are token streams that do not include story boundaries; all tokens in these files are uniquely indentified by token-ID numbers, and you need to refer to the corresponding "_bnd" directories and files to locate reference story boundary information indexed by token-ID. Some strings of Mandarin characters have been left untranslated by SYSTRAN, and these are included in the file in unmodified form (using the GB character set); each token is provided with a tag to indicate whether or not it is a translated token ("tr=Y" or "tr=N"). Note that for non-English broadcast sources, only the "reference text" (manual transcriptions) were submitted to ISI for more translation to English using more recent machine-translation technology. Only the older translations from IBM and SYSTRAN are available for the ASR text data. II.C The "TIPSTER-style" data types ------------------------------------ The "src_sgm", "tkn_sgm" and "asr_sgm" data sets are the only ones in which story boundary information is included as part of the text stream of each file. Of these three, the "tkn_sgm" and "asr_sgm" data sets provide the simplest, most compact formatting of the data, and are the most consistent and useful forms in terms of data content. The tkn_sgm and asr_sgm files use the following SGML tag structure for each story unit: SRC19981... (NEWS or MISCELLANEOUS) (NEWSWIRE, CAPTION, TRANSCRIPT or ASRTEXT) This region, between the TEXT tags, provides the full content of the story, which has been drawn from the corresponding "tkn", "as0" or "as1" data file. Note the following properties of text content in these two data sets: - In English files, all word tokens are space separated. In tkn_sgm files, word tokens may include adjacent punctuation, brackets and quotes; in asr_sgm files, punctuation, brackets and quotes are not present at all, since these are not produced by the ASR system. - In Mandarin tkn_sgm files, there is space separation only among tokens that consist of ASCII content (i.e. digits, punctuation, occasional names), since the original GB Mandarin text content was not segmented into words. In Mandarin asr_sgm files, there is space-separation between word tokens (because the ASR system produced word-segmented output), but there is no punctuation. - In all files, story text is presented with a consistent pattern of line-wrapping, but without paragraph breaks (which exist only in the src_sgm data format). - In files from broadcast sources, the TEXT elements of some story units may be completely empty, because there was no speech in the corresponding segment of audio, or because no transcript or captioning was provided for that segment (in such units, the "DOCTYPE" is always "MISCELLANEOUS" or "UNTRANSCRIBED"). In the src_sgm data, there is a variable amount of additional SGML markup within each , but outside of the element, providing extra information associated with each story: - "" provides a time stamp of when the story was broadcast or transmitted. - "" is used as a bracketing element around "", to contain other elements and information about the story that are not part of the actual story text. - "

", "", "" and "" appear within the "" portion of most newswire files, providing keywords, headline strings and other data about stories that are provided as part of the wire transmission but are external to the story text. - "" appears in broadcast files, providing a time stamp for the end of the story. The portion of each story in src_sgm files may also contain additional tagging, to convey "meta-information" about the story content; in particular: - "" tags are used in broadcast sources to mark speaker changes, when these are known from the original transcription. - " ... " tags are used to bracket other information about the story content, in both broadcast and newswire sources (e.g. comments about noise in the audio, or instructions to editors in longer newswire stories) -- in other words, material enclosed between these tags is NOT part of the actual story content; note that the opening tag, enclosed commentary and closing tag are usually on separate lines in the data files. - "

" is used in some sources to mark paragraph breaks. II.D. The "token stream" data types ------------------------------------ These five data sets (tkn,as0,as1,mttkn,mtas0) have all been packaged together in the compressed tar file "tdt3proj.tgz"; these directories will be created within "tdt3_em" when the tar file is unpacked (see the top-level "index.html" file on the cdrom for instructions to unpack the tar file). The content in these data files all share the same basic SGML markup strategy: A word ... For each of these data sets, there is a separate directory containing a set of "boundary table" files, one boundary table for each sample file, which provides the mapping of story boundaries to the corresponding token stream in terms of the "recid" values assigned to the tokens. A boundary table contains one SGML "" tag for each story unit, and the attributes in this tag identify the DOCNO, the DOCTYPE, the beginning and ending "recid" numbers in the token stream file that make up the token content of the story (if any), and for broadcast sources, the beginning and ending time offsets for the story in the corresponding audio file (to be found in the TDT3 Speech corpora, which are distributed separately); for example: ... ... Note that broadcast files may contain "MISCELLANEOUS TEXT" story units in which nothing is spoken or transcribed; the boundary table entries for such units will lack the "Brecid" and "Erecid" attributes. Also, the "Bsec" and "Esec" attributes apply only to broadcast sources -- they are present in all boundary entries for these sources, and are lacking in all boundary tables for newswire sources. II.E. Summary of data type distributions ----------------------------------------- So, for each data sample in the corpus (i.e. each contiguous recording from a given source on a given date covering a specific period of time), there are several files, stored in separate directories, containing different versions of data or information about the data derived from that sample. For example, a VOA_MAN broadcast has the reference text with TIPSTER-style markup, a tokenized version of the reference text, the output of an ASR system, a "TIPSTER-ized" markup version of the ASR output, machine-translated versions of both the reference text and ASR token streams, and boundary tables for all the various token stream files; their various path names are as follows: asr_sgm/19981220_0700_0800_VOA_MAN.asr_sgm tkn_sgm/19981220_0700_0800_VOA_MAN.tkn_sgm tkn/19981220_0700_0800_VOA_MAN.tkn tkn_bnd/19981220_0700_0800_VOA_MAN.tkn_bnd as0/19981220_0700_0800_VOA_MAN.as0 as0_bnd/19981220_0700_0800_VOA_MAN.as0_bnd mttkn/19981220_0700_0800_VOA_MAN.mttkn mttkn_bnd/19981220_0700_0800_VOA_MAN.mttkn_bnd mtas0/19981220_0700_0800_VOA_MAN.mtas0 mtas0_bnd/19981220_0700_0800_VOA_MAN.mtas0_bnd In each case, the file name extension string is identical to the name of the directory containing the file. The file-id is common to all versions of data derived from the one sample. The number of files present for a given sample depends on the particular source, as follows: Source tkn asr mttkn mtasr ------------------------------------ APW_ENG x NYT_NYT x ABC_WNT x x CNN_HDL x x MSN_NBW x x NBC_NNW x x PRI_TWD x x VOA_ENG x x XIN_MAN x x ZBN_MAN x x AFP_ARB x x ALH_ARB x x ANN_ARB x x CBS_MAN x x x x CNR_MAN x x x x CTS_MAN x x x x CTV_MAN x x x x VOA_MAN x x x x NTV_ARB x x x x VOA_ARB x x x x II.F Differences in content among data types --------------------------------------------- Naturally, when there are two or more distinct token streams drawn from the same data sample, the number of tokens in each story will vary depending on how the token stream was produced. For example, here are the various boundary table entries for one VOA_MAN story: as0_bnd/19981220_0700_0800_VOA_MAN.as0_bnd: mtas0_bnd/19981220_0700_0800_VOA_MAN.mtas0_bnd: mttkn_bnd/19981220_0700_0800_VOA_MAN.mttkn_bnd: tkn_bnd/19981220_0700_0800_VOA_MAN.tkn_bnd: Apart from these obvious differences among the token streams, there are also more subtle differences between "src_sgm" data and the corresponding "tkn" token stream and "tkn_sgm" ("tipsterized") sets, particularly in the case of newswire sources. These differences are created by the "tokenize" perl scripts, and are intended to assure that the "tkn" and "tkn_sgm" data sets contain only the narrative content of each story, in the most consistent form possible. The tokenization process addressed the following issues: - The content of tags in all src_sgm files is removed. - In newswire sources, each story typically begins with a "dateline" at the start of the first paragraph (usually a place name, a date, an abbreviation of the newswire service, and/or an author's name); the dateline is removed. - In English newswires, the text often includes special "typesetting" codes; these are removed. - Mandarin newswires occasionally use "dingbat" characters (circles, X's or other special marks, typically intended as paragraph "bullets"); these are removed. - Xinhua news always ends each story with a single GB character enclosed in parentheses, and this is always the same character; this is removed. - Xinhua uses only 16-bit GB character encoding in its transmission, even when the story content includes alphanumeric or other ASCII symbols (i.e. for digits, proper names, acronyms, bracketing and some punctuation); the GB character set provides 16-bit codes for rendering these symbols, and all "XIN_MAN.src_sgm" files use these codes, whereas the other Mandarin sources (ZBN and VOA) use single-byte ASCII values; the tokenization recognizes the GB codes for ASCII symbols, and converts them to single-byte ASCII values. III. Origins of reference text data for broadcast sources ========================================================= The broadcast sources fall into two distinct groups in terms of broadcast media: Television sources with closed captions: ABC, CNN, NBC, MNB Radio and other video sources: PRI, VOA, CBS, CNR, CTS, CTV, NTV These two groups are distinct in terms of how the reference text data were created, and this affects the relative quality of the reference text with respect to word accuracy (i.e. faithfulness to what was actually spoken in the audio signal). For the first group of television sources (that is, the ones in English), all reference text has been drawn from the closed-caption signal that accompanied the video broadcast. As a result, the text may be relatively "telegraphic" in nature, because it is often the case that closed-caption text tends to simplify or reduce the spoken content. We have also observed that closed captions sometimes contain errors (misspellings or misinterpretations of what is spoken). For the radio sources, and for all non-English video sources, all reference text has been manually transcribed from digital recordings by professional transcription services. In general, the quality of these transcripts is quite good in terms of lexical accuracy, and the English data are nearly or virtually free of spelling errors. IV. Supporting Materials ======================== In addition to the data directories cited above, this release contains the following additional directories: dtd -- contains SGML Document Type Definition files to specify the markup format of the boundary table files, token stream files, and the topic tables; the dtd files are necessary for using an SGML parsing utility (e.g. nsgmls) to process the various data files. The functions of the dtd files are: - boundset.dtd -- for all "boundary table" files - docset.dtd -- for all "token stream" files (as0,as1,tkn,mt*) - tiptext.dtd -- for all "tipsterized sgm" files (asr_sgm,tkn_sgm) - srctext.dtd -- for all "src_sgm" files doc -- tables and listings that describe the corpus content: - tdt4_stats.tables -- summary of quantities by source and month - tdt4_docno.table -- list of all stories (DOCNO, file, DOCTYPE) - content_summary.txt -- this file - tdt4guidelines_v1_5.pdf -- details of annotation procedures Topic annotations that were produced by LDC to support the 1999 and 2000 TDT evaluations are being made available separately, via both LDC and NIST web sites: http://www.ldc.upenn.edu/Projects/TDT4/ http://www.nist.gov/speech/tests/tdt/ Both web sites also provide additional information and resources for the TDT project: the LDC site includes the archives of email discussions among TDT participants via , and access to related resources, such as English/Mandarin glossing lexicons and parallel text collections. The NIST site includes complete documentation and software resources for running TDT system evaluations, and papers presented at TDT workshops by participants.