========================================================== SUMMARY OF CONTENTS IN THE TDT3 MULTI-LANGUAGE TEXT CORPUS ========================================================== Release date: April 25, 2001 Version: 2.0 I. Data Sources ================ The TDT3 corpus contains news data collected daily from 11 news sources in two languages (American English and Mandarin Chinese), over a period of three months (October - December, 1998). The sources and their frequency of sampling are as follows: English sources --------------- NYT_NYT (1) New York Times Newswire Service (excluding non-NYT sources) APW_ENG (1) Associated Press Worldstream Service (English content only) CNN_HDL (1) Cable News Network, "Headline News" ABC_WNT (2) American Broadcasting Company, "World News Tonight" NBC_NNW (2) National Broadcasting Company, "NBC Nightly News" MNB_NBW (3) MS-NBC, "News with Brian Williams" PRI_TWD (3) Public Radio International, "The World" VOA_ENG (4) Voice of America, English news programs Mandarin sources ---------------- XIN_MAN (5) Xinhua News Agency ZBN_MAN (6) Zaobao News Agency VOA_MAN (7) Voice of America, Mandarin Chinese news programs Daily sampling -------------- (1) about 80 stories, in four sample files, per day (2) about 15 stories, in one sample file, per day (3) about 20 stories, in one sample file, per day, 5 days/week (4) about 40 stories, in two sample files, per day (5) about 60 stories, in three sample files, per day (6) about 50 stories, in two sample files, per day (7) 40 to 80 stories, in one or two sample files, per day The quantities indicated above for sampling frequencies are approximate; all sources were prone to occasional failures in the data collection process. A more detailed summary of data quantities by source and month is provided in the file "tdt3_stats.tabless". A complete listing of all stories and sample files is provided in the file "tdt3_docno.table". II. Corpus Structure ===================== The organization of data in the corpus is intended to provide direct support for the research tasks defined in the yearly TDT evaluation plans (available at http://www.nist.gov/speech/tests/tdt/index.htm), while also providing a data format compatible with other research projects involving information extraction. II.A. Basic units of data -------------------------- The basic units of the corpus are news stories and sample files. Each news story is uniquely identified by a "DOCNO" (story-id) that indicates the source and date of the story; e.g.: XIN19981001.0005 identifies a story from Xinhua collected on Oct. 1, 1998; the final four digits distinguish this story from all other stories collected on the same date from the same source. In the case of broadcast sources (as opposed to newswire sources), the DOCNO also contains four digits to indicate the start time of the broadcast; e.g.: VOM19981220.0700.0233 identifies a Voice of America Mandarin story from the Dec. 20 broadcast that began at 7:00am (EST); again, the final four digits distinguish this story from others in the same broadcast. Each sample file represents a contiguous collection of stories from a given source on a given date over a specific period of time; the file name of the sample (file-id) provides all this information; e.g.: 19981001_0023_1310_XIN_MAN 19981220_0700_0800_VOA_MAN These are the file names that happen to contain the example story-ids mentioned above. The XIN file spans a period of collection from 12:23am to 13:10pm on Oct. 1, and the VOA file covers a 1-hour broadcast starting at 7:00am on Dec. 20. Each sample from broadcast sources was manually segmented into story units, and each story unit was manually classified as either a "news story" or as "miscellaneous text": a unit was classified as "news" if it was judged by annotators to contain informative content about any topic or event. Miscellaneous text units include commercial breaks, music interludes, and "introductory" portions of broadcasts where an anchor person is providing a list of "upcoming stories" (typically by making a single statement about each event to be reported on during the broadcast). Only the "news story" units underwent topic relevance annotation, but the content and time stamps of the "miscellaneous text" units have been retained in the data files. II.B. List of data types ------------------------- Each data sample is presented in a variety of forms, with each form placed in a separate directory under "tdt3_em". In this cdrom release, two forms of data ("tkn_sgm" and "asr_sgm") are directly accessible as uncompressed files. All other data forms have been packed into the compressed unix tar file "tdt3proj.tgz"; unpacking this tar file will create the additional directories under "tdt3_em". The forms of data in this release (and their directory names) are: src_sgm -- original source text data (newswire, manual transcript or closed caption text) from which reference texts are derived, in an SGML markup format similar to the TIPSTER text corpora tkn -- reference text data in tokenized ("token stream") form: story boundaries and other descriptive markup in the tkn_sgm format are removed, each English word or Mandarin GB character is assigned a unique identifier (a sequential "recid" number) and presented on a separate line with an SGML tag: " word" as0 -- for the Mandarin broadcast data (VOA_MAN), output of the Dragon Systems speech recognizer, in token stream form, without story boundaries or punctuation; each Mandarin word is assigned a unique recid, with information on starting time and duration (in sec), speaker cluster, and asr confidence score (some Mandarin words from the recognizer comprise multiple GB characters) as1 -- for English broadcast sources, output of the BBN Byblos speech recognizer, same format as as0, (token stream, one word per line) except that speaker cluster and asr confidence score information is not available ("NA") mttkn -- for all Mandarin sources, output of SYSTRAN machine translation from "tkn" reference text data into English, tokenized, without story boundaries; some strings of Mandarin characters have been left untranslated by SYSTRAN, and these are included in the file in unmodified form (using the GB character set); each token is provided with a tag to indicate whether or not it is a translated token ("tr=Y" or "tr=N") mtas0 -- for the Mandarin broadcast data (VOA_MAN), output of SYSTRAN translation from as0 text data into English; same format as mttkn. tkn_sgm -- reference text data derived from "tkn" files, in an SGML markup format similar to the TIPSTER text corpora asr_sgm -- ASR text data derived from "as0" and "as1" (Mandarin and English) broadcast files, in an SGML markup format similar to the TIPSTER text corpora II.C The "TIPSTER-style" data types ------------------------------------ The "src_sgm", "tkn_sgm" and "asr_sgm" data sets are the only ones in which story boundary information is included as part of the text stream of each file. Of these three, the "tkn_sgm" and "asr_sgm" data sets provide the simplest, most compact formatting of the data, and are the most consistent and useful forms in terms of data content. The tkn_sgm and asr_sgm files use the following SGML tag structure for each story unit: SRC19981... (NEWS or MISCELLANEOUS) (NEWSWIRE, CAPTION, TRANSCRIPT or ASRTEXT) This region, between the TEXT tags, provides the full content of the story, which has been drawn from the corresponding "tkn", "as0" or "as1" data file. Note the following properties of text content in these two data sets: - In English files, all word tokens are space separated. In tkn_sgm files, word tokens may include adjacent punctuation, brackets and quotes; in asr_sgm files, punctuation, brackets and quotes are not present at all, since these are not produced by the ASR system. - In Mandarin tkn_sgm files, there is space separation only among tokens that consist of ASCII content (i.e. digits, punctuation, occasional names), since the original GB Mandarin text content was not segmented into words. In Mandarin asr_sgm files, there is space-separation between word tokens (because the ASR system produced word-segmented output), but there is no punctuation. - In all files, story text is presented with a consistent pattern of line-wrapping, but without paragraph breaks (which exist only in the src_sgm data format). - In files from broadcast sources, the TEXT elements of some story units may be completely empty, because there was no speech in the corresponding segment of audio, or because no transcript or captioning was provided for that segment (in such units, the "DOCTYPE" is always "MISCELLANEOUS" or "UNTRANSCRIBED"). In the src_sgm data, there is a variable amount of additional SGML markup within each , but outside of the element, providing extra information associated with each story: - "" provides a time stamp of when the story was broadcast or transmitted. - "" is used as a bracketing element around "", to contain other elements and information about the story that are not part of the actual story text. - "
", "", "" and "" appear within the "" portion of most newswire files, providing keywords, headline strings and other data about stories that are provided as part of the wire transmission but are external to the story text. - "" appears in broadcast files, providing a time stamp for the end of the story. The portion of each story in src_sgm files may also contain additional tagging, to convey "meta-information" about the story content; in particular: - "" tags are used in broadcast sources to mark speaker changes, when these are known from the original transcription. - " ... " tags are used to bracket other information about the story content, in both broadcast and newswire sources (e.g. comments about noise in the audio, or instructions to editors in longer newswire stories) -- in other words, material enclosed between these tags is NOT part of the actual story content; note that the opening tag, enclosed commentary and closing tag are usually on separate lines in the data files. - "

" is used in some sources to mark paragraph breaks. II.D. The "token stream" data types ------------------------------------ These five data sets (tkn,as0,as1,mttkn,mtas0) have all been packaged together in the compressed tar file "tdt3proj.tgz"; these directories will be created within "tdt3_em" when the tar file is unpacked (see the top-level "index.html" file on the cdrom for instructions to unpack the tar file). The content in these data files all share the same basic SGML markup strategy: A word ... For each of these data sets, there is a separate directory containing a set of "boundary table" files, one boundary table for each sample file, which provides the mapping of story boundaries to the corresponding token stream in terms of the "recid" values assigned to the tokens. A boundary table contains one SGML "" tag for each story unit, and the attributes in this tag identify the DOCNO, the DOCTYPE, the beginning and ending "recid" numbers in the token stream file that make up the token content of the story (if any), and for broadcast sources, the beginning and ending time offsets for the story in the corresponding audio file (to be found in the TDT3 Speech corpora, which are distributed separately); for example: ... ... Note that broadcast files may contain "MISCELLANEOUS TEXT" story units in which nothing is spoken or transcribed; the boundary table entries for such units will lack the "Brecid" and "Erecid" attributes. Also, the "Bsec" and "Esec" attributes apply only to broadcast sources -- they are present in all boundary entries for these sources, and are lacking in all boundary tables for newswire sources. II.E. Summary of data type distributions ----------------------------------------- So, for each data sample in the corpus (i.e. each contiguous recording from a given source on a given date covering a specific period of time), there are several files, stored in separate directories, containing different versions of data or information about the data derived from that sample. For example, a VOA_MAN broadcast has the reference text with TIPSTER-style markup, a tokenized version of the reference text, the output of an ASR system, a "TIPSTER-ized" markup version of the ASR output, machine-translated versions of both the reference text and ASR token streams, and boundary tables for all the various token stream files; their various path names are as follows: asr_sgm/19981220_0700_0800_VOA_MAN.asr_sgm tkn_sgm/19981220_0700_0800_VOA_MAN.tkn_sgm tkn/19981220_0700_0800_VOA_MAN.tkn tkn_bnd/19981220_0700_0800_VOA_MAN.tkn_bnd as0/19981220_0700_0800_VOA_MAN.as0 as0_bnd/19981220_0700_0800_VOA_MAN.as0_bnd mttkn/19981220_0700_0800_VOA_MAN.mttkn mttkn_bnd/19981220_0700_0800_VOA_MAN.mttkn_bnd mtas0/19981220_0700_0800_VOA_MAN.mtas0 mtas0_bnd/19981220_0700_0800_VOA_MAN.mtas0_bnd In each case, the file name extension string is identical to the name of the directory containing the file. The file-id is common to all versions of data derived from the one sample. The number of files present for a given sample depends on the particular source, as follows: Source tkn_sgm tkn asr_sgm as0 as1 mttkn mtas0 ----------------------------------------------------- ABC_WNT x x x x CNN_HDL x x x x PRI_TWD x x x x VOA_ENG x x x x APW_ENG x x NYT_NYT x x VOA_MAN x x x x x x XIN_MAN x x x ZBN_MAN x x x II.F Differences in content among data types --------------------------------------------- Naturally, when there are two or more distinct token streams drawn from the same data sample, the number of tokens in each story will vary depending on how the token stream was produced. For example, here are the various boundary table entries for one VOA_MAN story: as0_bnd/19981220_0700_0800_VOA_MAN.as0_bnd: mtas0_bnd/19981220_0700_0800_VOA_MAN.mtas0_bnd: mttkn_bnd/19981220_0700_0800_VOA_MAN.mttkn_bnd: tkn_bnd/19981220_0700_0800_VOA_MAN.tkn_bnd: Apart from these obvious differences among the token streams, there are also more subtle differences between "src_sgm" data and the corresponding "tkn" token stream and "tkn_sgm" ("tipsterized") sets, particularly in the case of newswire sources. These differences are created by the "tokenize" perl scripts, and are intended to assure that the "tkn" and "tkn_sgm" data sets contain only the narrative content of each story, in the most consistent form possible. The tokenization process addressed the following issues: - The content of tags in all src_sgm files is removed. - In newswire sources, each story typically begins with a "dateline" at the start of the first paragraph (usually a place name, a date, an abbreviation of the newswire service, and/or an author's name); the dateline is removed. - In English newswires, the text often includes special "typesetting" codes; these are removed. - Mandarin newswires occasionally use "dingbat" characters (circles, X's or other special marks, typically intended as paragraph "bullets"); these are removed. - Xinhua news always ends each story with a single GB character enclosed in parentheses, and this is always the same character; this is removed. - Xinhua uses only 16-bit GB character encoding in its transmission, even when the story content includes alphanumeric or other ASCII symbols (i.e. for digits, proper names, acronyms, bracketing and some punctuation); the GB character set provides 16-bit codes for rendering these symbols, and all "XIN_MAN.src_sgm" files use these codes, whereas the other Mandarin sources (ZBN and VOA) use single-byte ASCII values; the tokenization recognizes the GB codes for ASCII symbols, and converts them to single-byte ASCII values. III. Origins of reference text data for broadcast sources ========================================================= The broadcast sources fall into two distinct groups in terms of broadcast media: Television sources: ABC, CNN, NBC, MSN Radio sources: PRI, VOA_ENG, VOA_MAN These two groups are distinct in terms of how the reference text data were created, and this affects the relative quality of the reference text with respect to word accuracy (i.e. faithfulness to what was actually spoken in the audio signal). For the television sources, all reference text has been drawn from the closed-caption signal that accompanied the video broadcast. As a result, the text may be relatively "telegraphic" in nature, because it is often the case that closed-caption text tends to simplify or reduce the spoken content. We have also observed that closed captions sometimes contain errors (misspellings or misinterpretations of what is spoken). For the radio sources, all reference text has been manually transcribed from digital recordings by professional transcription services. In general, the quality of these transcripts is quite good in terms of lexical accuracy, and the English data are nearly or virtually free of spelling errors. IV. Supporting Materials ======================== In addition to the data directories cited above, this release contains the following additional directories: tdt3_em/dtd -- contains SGML Document Type Definition files to specify the markup format of the boundary table files, token stream files, and the topic tables; the dtd files are necessary for using an SGML parsing utility (e.g. nsgmls) to process the various data files. The functions of the dtd files are: - boundset.dtd -- for all "boundary table" files - docset.dtd -- for all "token stream" files (as0,as1,tkn,mt*) - srctext.dtd -- for all "src_sgm" files - tiptext.dtd -- for all "tipsterized sgm" files (asr_sgm,tkn_sgm) - topicset.dtd -- for all "topic table" files (available from the LDC and/or NIST web sites) doc -- tables and listings that describe the corpus content: - pub_file_list.txt -- list of all files on the cdrom release - tarset_file_list.txt -- list of all files in "tdt3proj.tgz" - tdt3_stats_tables.txt -- summary of quantities by source and month - tdt3_docno_table.txt -- list of all stories (DOCNO, file, DOCTYPE) - tdt3_release_notes.txt -- description of differences relative to v1.0 - tokenize_*_src.perl -- scripts that were used to create "tkn" and "tkn_bnd" files from "src_sgm" data - tipsterize_tdt.perl -- used to create "tkn_sgm" and "asr_sgm" from "tkn", "as0" and "as1" data, respectively The "tipsterize_tdt" script can be used as follows, to create "TIPSTER-style" SGML format for the machine-translated data sets "mttkn" and "mtas0", assuming that you have copied the full corpus onto a writable disk: cd tdt3_em mkdir mttkn_sgm mkas0_sgm ../corpus_info/tipsterize_tdt.perl -i mttkn -o mttkn_sgm ../corpus_info/tipsterize_tdt.perl -i mtas0 -o mtas0_sgm In each case, the script processes every file from the given input directory, and produces a corresponding "tipsterized" file in the output directory; each output file has the same file name, except that the extension is changed to match the name of the output directory. Another use of this script is to create "tipsterized" data files using an alternative set of boundary tables. By default, "tipsterize_tdt" will use the "ground truth" boundary tables included in this corpus release (i.e. "tkn_bnd" for "tkn" data, etc); since one of the tasks in TDT evaluations is automatic story boundary detection, there can be an alternative set of boundary tables, generated by a detection system. You can create "tipsterized" files from any token stream data set using an alternative set of boundary tables, as follows: - make sure the automatic story boundary information is rendered in a manner equivalent to the original boundary tables, and place the set of new tables in a separate directory under tdt3_em (next to the associated token stream directory), e.g. "alt_tkn_bnd" - use the "tipsterize" script with "-t table_dir" on the command line, in addition to the other arguments described above; e.g.: mkdir alt_tkn_sgm ../corpus_info/tipsterize_tdt.perl -i tkn -o alt_tkn_sgm -t alt_tkn_bnd Topic annotations that were produced by LDC to support the 1999 and 2000 TDT evaluations are being made available separately, via both LDC and NIST web sites: http://www.ldc.upenn.edu/Projects/TDT3/ http://www.nist.gov/speech/tests/tdt/ Both web sites also provide additional information and resources for the TDT project: the LDC site includes the archives of email discussions among TDT participants via , and access to related resources, such as English/Mandarin glossing lexicons and parallel text collections. The NIST site includes complete documentation and software resources for running TDT system evaluations, and papers presented at TDT workshops by participants.