========================================================== SUMMARY OF CONTENTS IN THE TDT2 MULTI-LANGUAGE TEXT CORPUS ========================================================== Release date: April 25, 2001 Version: 4.0 I. Data Sources ================ The TDT2 corpus contains news data collected daily from 9 news sources in two languages (American English and Mandarin Chinese), over a period of six months (January - June, 1998). The sources and their frequency of sampling are as follows: English sources --------------- NYT_NYT (1) New York Times Newswire Service (excluding non-NYT sources) APW_ENG (1) Associated Press Worldstream Service (English content only) CNN_HDL (1) Cable News Network, "Headline News" ABC_WNT (2) American Broadcasting Company, "World News Tonight" PRI_TWD (3) Public Radio International, "The World" VOA_ENG (4) Voice of America, English news programs Mandarin sources ---------------- XIN_MAN (5) Xinhua News Agency ZBN_MAN (6) Zaobao News Agency VOA_MAN (7) Voice of America, Mandarin Chinese news programs Daily sampling -------------- (1) about 80 stories, in four sample files, per day (2) about 15 stories, in one sample file, per day (3) about 20 stories, in one sample file, per day, 5 days/week (4) about 40 stories, in two sample files, per day (5) about 60 stories, in three sample files, per day (6) about 50 stories, in two sample files, per day (starting Feb.26) (7) irregular: - no samples Jan.1 - Feb.19; - 40 to 80 stories, in one or two sample files, per day but with some gaps in the collection, Feb.20 - Apr.4 - 10 to 40 stories, in up to three sample files, per day again with some gaps, Apr.5 - Jun.30 The quantities indicated above for sampling frequencies are approximate; all sources were prone to occasional failures in the data collection process. A more detailed summary of data quantities by source and month is provided in the file "tdt2_stats.tables". A complete listing of all stories and sample files is provided in the file "tdt2_docno.table". II. Corpus Structure ===================== The organization of data in the corpus is intended to provide direct support for the research tasks defined in the yearly TDT evaluation plans (available at http://www.nist.gov/speech/tests/tdt/index.htm), while also providing a data format compatible with other research projects involving information extraction. II.A. Basic units of data -------------------------- The basic units of the corpus are news stories and sample files. Each news story is uniquely identified by a "DOCNO" (story-id) that indicates the source and date of the story; e.g.: XIN19980101.0001 identifies a story from Xinhua collected on Jan. 1, 1998; the final four digits distinguish this story from all other stories collected on the same date from the same source. In the case of broadcast sources (as opposed to newswire sources), the DOCNO also contains four digits to indicate the start time of the broadcast; e.g.: VOM19980220.0700.0221 identifies a Voice of America Mandarin story from the Feb. 20 broadcast that began at 7:00am (EST); again, the final four digits distinguish this story from others in the same broadcast. Each sample file represents a contiguous collection of stories from a given source on a given date over a specific period of time; the file name of the sample (file-id) provides all this information; e.g.: 19980101_0016_1116_XIN_MAN 19980220_0700_0800_VOA_MAN These are the file names that happen to contain the example story-ids mentioned above. The XIN file spans a period of collection from 12:16am to 11:16am on Jan.1, and the VOA file covers a 1-hour broadcast starting at 7:00am on Feb.20. Each sample from broadcast sources was manually segmented into story units, and each story unit was manually classified as either a "news story" or as "miscellaneous text": a unit was classified as "news" if it was judged by annotators to contain informative content about any topic or event. Miscellaneous text units include commercial breaks, music interludes, and "introductory" portions of broadcasts where an anchor person is providing a list of "upcoming stories" (typically by making a single statement about each event to be reported on during the broadcast). Only the "news story" units underwent topic relevance annotation, but the content and time stamps of the "miscellaneous text" units have been retained in the data files. II.B. List of data types ------------------------- Each data sample is presented in a variety of forms, with each form placed in a separate directory under "tdt2_em". In this cdrom release, two forms of data ("tkn_sgm" and "asr_sgm") are directly accessible as uncompressed files. All other data forms have been packed into the compressed unix tar file "tdt2proj.tgz"; unpacking this tar file will create the additional directories under "tdt2_em". The forms of data in this release (and their directory names) are: src_sgm -- original source text data (newswire, manual transcript or closed caption text) from which reference texts are derived, in an SGML markup format similar to the TIPSTER text corpora tkn -- reference text data in tokenized ("token stream") form: story boundaries and other descriptive markup in the tkn_sgm format are removed, each English word or Mandarin GB character is assigned a unique identifier (a sequential "recid" number) and presented on a separate line with an SGML tag: " word" as0 -- for the Mandarin broadcast data (VOA_MAN), output of the Dragon Systems speech recognizer, in token stream form, without story boundaries or punctuation; each Mandarin word is assigned a unique recid, with information on starting time and duration (in sec), speaker cluster, and asr confidence score (some Mandarin words from the recognizer comprise multiple GB characters) as1 -- for English broadcast sources, output of the BBN Byblos speech recognizer, same format as as0, (token stream, one word per line) except that speaker cluster and asr confidence score information is not available ("NA") mttkn -- for all Mandarin sources, output of SYSTRAN machine translation from "tkn" reference text data into English, tokenized, without story boundaries; some strings of Mandarin characters have been left untranslated by SYSTRAN, and these are included in the file in unmodified form (using the GB character set); each token is provided with a tag to indicate whether or not it is a translated token ("tr=Y" or "tr=N") mtas0 -- for the Mandarin broadcast data (VOA_MAN), output of SYSTRAN translation from as0 text data into English; same format as mttkn. tkn_sgm -- reference text data derived from "tkn" files, in an SGML markup format similar to the TIPSTER text corpora asr_sgm -- ASR text data derived from "as0" and "as1" (Mandarin and English) broadcast files, in an SGML markup format similar to the TIPSTER text corpora II.C The "TIPSTER-style" data types ------------------------------------ The "src_sgm", "tkn_sgm" and "asr_sgm" data sets are the only ones in which story boundary information is included as part of the text stream of each file. Of these three, the "tkn_sgm" and "asr_sgm" data sets provide the simplest, most compact formatting of the data, and are the most consistent and useful forms in terms of data content. The tkn_sgm and asr_sgm files use the following SGML tag structure for each story unit: SRC19980... (NEWS or MISCELLANEOUS) (NEWSWIRE, CAPTION, TRANSCRIPT or ASRTEXT) This region, between the TEXT tags, provides the full content of the story, which has been drawn from the corresponding "tkn", "as0" or "as1" data file. Note the following properties of text content in these two data sets: - In English files, all word tokens are space separated. In tkn_sgm files, word tokens may include adjacent punctuation, brackets and quotes; in asr_sgm files, punctuation, brackets and quotes are not present at all, since these are not produced by the ASR system. - In Mandarin tkn_sgm files, there is space separation only among tokens that consist of ASCII content (i.e. digits, punctuation, occasional names), since the original GB Mandarin text content was not segmented into words. In Mandarin asr_sgm files, there is space-separation between word tokens (because the ASR system produced word-segmented output), but there is no punctuation. - In all files, story text is presented with a consistent pattern of line-wrapping, but without paragraph breaks (which exist only in the src_sgm data format). - In files from broadcast sources, the TEXT elements of some story units may be completely empty, because there was no speech in the corresponding segment of audio, or because no transcript or captioning was provided for that segment (in such units, the "DOCTYPE" is always "MISCELLANEOUS" or "UNTRANSCRIBED"). In the src_sgm data, there is a variable amount of additional SGML markup within each , but outside of the element, providing extra information associated with each story: - "" provides a time stamp of when the story was broadcast or transmitted. - "" is used as a bracketing element around "", to contain other elements and information about the story that are not part of the actual story text. - "
", "", "" and "" appear within the "" portion of most newswire files, providing keywords, headline strings and other data about stories that are provided as part of the wire transmission but are external to the story text. - "" appears in broadcast files, providing a time stamp for the end of the story. The portion of each story in src_sgm files may also contain additional tagging, to convey "meta-information" about the story content; in particular: - "" tags are used in broadcast sources to mark speaker changes, when these are known from the original transcription. - " ... " tags are used to bracket other information about the story content, in both broadcast and newswire sources (e.g. comments about noise in the audio, or instructions to editors in longer newswire stories) -- in other words, material enclosed between these tags is NOT part of the actual story content; note that the opening tag, enclosed commentary and closing tag are usually on separate lines in the data files. - "

" is used in some sources to mark paragraph breaks. II.D. The "token stream" data types ------------------------------------ These five data sets (tkn,as0,as1,mttkn,mtas0) have all been packaged together in the compressed tar file "tdt2proj.tgz"; these directories will be created within "tdt2_em" when the tar file is unpacked (see the top-level "index.html" file on the cdrom for instructions to unpack the tar file). The content in these data files all share the same basic SGML markup strategy: A word ... For each of these data sets, there is a separate directory containing a set of "boundary table" files, one boundary table for each sample file, which provides the mapping of story boundaries to the corresponding token stream in terms of the "recid" values assigned to the tokens. A boundary table contains one SGML "" tag for each story unit, and the attributes in this tag identify the DOCNO, the DOCTYPE, the beginning and ending "recid" numbers in the token stream file that make up the token content of the story (if any), and for broadcast sources, the beginning and ending time offsets for the story in the corresponding audio file (to be found in the TDT2 Speech corpora, which are distributed separately); for example: ... ... Note that broadcast files may contain "MISCELLANEOUS TEXT" story units in which nothing is spoken or transcribed; the boundary table entries for such units will lack the "Brecid" and "Erecid" attributes. Also, the "Bsec" and "Esec" attributes apply only to broadcast sources -- they are present in all boundary entries for these sources, and are lacking in all boundary tables for newswire sources. II.E. Summary of data type distributions ----------------------------------------- So, for each data sample in the corpus (i.e. each contiguous recording from a given source on a given date covering a specific period of time), there are several files, stored in separate directories, containing different versions of data or information about the data derived from that sample. For example, a VOA_MAN broadcast has the reference text with TIPSTER-style markup, a tokenized version of the reference text, the output of an ASR system, a "TIPSTER-ized" markup version of the ASR output, machine-translated versions of both the reference text and ASR token streams, and boundary tables for all the various token stream files; their various path names are as follows: asr_sgm/19980220_0700_0800_VOA_MAN.asr_sgm tkn_sgm/19980220_0700_0800_VOA_MAN.tkn_sgm tkn/19980220_0700_0800_VOA_MAN.tkn tkn_bnd/19980220_0700_0800_VOA_MAN.tkn_bnd as0/19980220_0700_0800_VOA_MAN.as0 as0_bnd/19980220_0700_0800_VOA_MAN.as0_bnd mttkn/19980220_0700_0800_VOA_MAN.mttkn mttkn_bnd/19980220_0700_0800_VOA_MAN.mttkn_bnd mtas0/19980220_0700_0800_VOA_MAN.mtas0 mtas0_bnd/19980220_0700_0800_VOA_MAN.mtas0_bnd In each case, the file name extension string is identical to the name of the directory containing the file. The file-id is common to all versions of data derived from the one sample. The number of files present for a given sample depends on the particular source, as follows: Source tkn_sgm tkn asr_sgm as0 as1 mttkn mtas0 ----------------------------------------------------- ABC_WNT x x x x x CNN_HDL x x x x x PRI_TWD x x x x x VOA_ENG x x x x x APW_ENG x x NYT_NYT x x VOA_MAN x x x x x x XIN_MAN x x x ZBN_MAN x x x II.F Differences in content among data types --------------------------------------------- Naturally, when there are two or more distinct token streams drawn from the same data sample, the number of tokens in each story will vary depending on how the token stream was produced. For example, here are the various boundary table entries for one VOA_MAN story: as0_bnd/19980220_0700_0800_VOA_MAN.as0_bnd: mtas0_bnd/19980220_0700_0800_VOA_MAN.mtas0_bnd: mttkn_bnd/19980220_0700_0800_VOA_MAN.mttkn_bnd: tkn_bnd/19980220_0700_0800_VOA_MAN.tkn_bnd: Apart from these obvious differences among the token streams, there are also more subtle differences between "src_sgm" data and the corresponding "tkn" token stream and "tkn_sgm" ("tipsterized") sets, particularly in the case of newswire sources. These differences are created by the "tokenize" perl scripts, and are intended to assure that the "tkn" and "tkn_sgm" data sets contain only the narrative content of each story, in the most consistent form possible. The tokenization process addressed the following issues: - The content of tags in all src_sgm files is removed. - In newswire sources, each story typically begins with a "dateline" at the start of the first paragraph (usually a place name, a date, an abbreviation of the newswire service, and/or an author's name); the dateline is removed. - In English newswires, the text often includes special "typesetting" codes; these are removed. - Mandarin newswires occasionally use "dingbat" characters (circles, X's or other special marks, typically intended as paragraph "bullets"); these are removed. - Xinhua news always ends each story with a single GB character enclosed in parentheses, and this is always the same character; this is removed. - Xinhua uses only 16-bit GB character encoding in its transmission, even when the story content includes alphanumeric or other ASCII symbols (i.e. for digits, proper names, acronyms, bracketing and some punctuation); the GB character set provides 16-bit codes for rendering these symbols, and all "XIN_MAN.src_sgm" files use these codes, whereas the other Mandarin sources (ZBN and VOA) use single-byte ASCII values; the tokenization recognizes the GB codes for ASCII symbols, and converts them to single-byte ASCII values. III. Origins of reference and ASR text data for broadcast sources ================================================================= The following sections provide more detailed information about the creation and properties of reference and ASR text data; these issues varied depending on the data source. III.A. Closed-caption text -- ABC and CNN broadcasts ---------------------------------------------------- All sample files from these two television sources were accompanied by closed-caption signal, which was converted to ASCII text for capture via a standard serial port on a workstation. The text content may be relatively "telegraphic" in nature, because it is often the case that closed-caption text tends to simplify or reduce the spoken content. We have also observed that closed captions sometimes contain errors (misspellings or misinterpretations of what is spoken). III.B. FDCH transcripts -- ABC broadcasts only ---------------------------------------------- In order to support calibration of differences between closed caption text and more careful, accurate human transcription, the LDC collected commercially-produced transcripts, created by Federal Documents Clearing House (FDCH), for 155 of the 162 ABC sample files. FDCH was operating under a contract with ABC to provide verbatim transcripts of "World News Tonight" broadcasts, for general distribution and archival records. The accuracy and quality of the transcripts is quite high, omitting only difluencies in the speech and non-news content (commercials, etc). These transcripts are included as alternate versions of the reference text data for the ABC broadcasts, and they are distinguished from the corresponding closed-caption data files by having the addition string ".fdch" as part of the file name. Because the closed-caption text is taken to be the "common" (default) form of reference text, none of the FDCH data is included among the "tkn_sgm" files that are directly accessible on this cdrom. Instead all of the FDCH files are stored in the "tdt2proj" tar file. When fully unpacked according to the directions given in the top-level "index.html" documentation, the array of reference text data for ABC samples will appear as shown in the following example: src_sgm/19980106_1830_1900_ABC_WNT.src_sgm src_sgm/19980106_1830_1900_ABC_WNT.fdch.src_sgm tkn/19980106_1830_1900_ABC_WNT.tkn tkn/19980106_1830_1900_ABC_WNT.fdch.tkn tkn_bnd/19980106_1830_1900_ABC_WNT.tkn_bnd tkn_bnd/19980106_1830_1900_ABC_WNT.fdch.tkn_bnd tkn_sgm/19980106_1830_1900_ABC_WNT.tkn_sgm tkn_sgm/19980106_1830_1900_ABC_WNT.fdch.tkn_sgm The seven ABC samples that lack FDCH transcripts (i.e. for which we have only closed caption text) are: 19980104_1830_1900_ABC_WNT 19980111_1830_1900_ABC_WNT 19980125_1830_1900_ABC_WNT 19980322_1830_1900_ABC_WNT 19980414_1830_1900_ABC_WNT 19980509_1830_1900_ABC_WNT 19980523_1830_1900_ABC_WNT III.C. Transcripts from other commercial services -- PRI, VOA ------------------------------------------------------------- The radio broadcasts from PRI and VOA required manual transcription by commercial services which were specifically contracted by the LDC for this purpose. Three different services were employed, one to handle the complete set of VOA_MAN broadcasts, and two to share the load of the PRI and VOA_ENG broadcasts. Because of the large quantity of audio material involved, along with budget and schedule limitations for the initial production of the TDT2 text corpus, it was agreed that these services should perform only limited quality control on the text they produced. The expectation was that the overall quality of the resulting transcripts would be roughly equivalent to that of closed-caption text. To date, no careful assessment has been made of the VOA_MAN transcripts to clarify their accuracy, though we believe their overall quality is quite good -- close or comparable to that of FDCH texts. The output of the two English transcription services was checked on the basis of careful transcriptions, created later by the LDC, over a 4-hour sample of news stories from VOA_ENG and PRI; a similar check was also done for the FDCH and closed-caption texts, on the basis of careful transcriptions over a 6-hour sample of ABC and CNN stories. Overall, the FDCH texts showed the best quality -- 5.9% word-error rate (WER, counting insertions, deletions and substitutions, compared to our most careful transcription standards); most of these "errors" in the FDCH texts were presumably related to disfluencies in the speech (e.g. when speakers stuttered or repeated portions of phrases); the two English transcription services were fairly close to this level of quality, each with about 7.5% WER; the closed caption texts showed about 14.6% WER, on average. III.D. The Dragon ASR System (as0) ---------------------------------- Dragon Systems used a streamlined version of its research-grade speech recognizer on most of the English broadcast files and all the VOA Mandarin broadcast files. The output of this system included not only the hypothesized text in word-tokenized form, but also, for each word: - the starting time offset and word duration - a confidence score for the word, between 0 and 1 - a label for the particular "speaker cluster" that was selected as the best-performing speaker model in the recognition at that point in the file We do not have "as0" data for the following sample files -- in most cases, this was due to problems in tracking the sample files at the LDC and conveying them to Dragon while collection, manual transcription and other annotations were in progress: 19980222_1830_1900_ABC_WNT 19980424_1600_1630_CNN_HDL 19980528_1600_1630_CNN_HDL 19980528_2000_2100_PRI_TWD 19980611_0130_0200_CNN_HDL 19980615_2000_2100_PRI_TWD 19980617_2000_2100_PRI_TWD 19980618_1600_1630_CNN_HDL 19980619_2000_2100_PRI_TWD 19980622_2000_2100_PRI_TWD 19980628_1600_1630_CNN_HDL 19980629_2000_2100_PRI_TWD III.E. The BBN "Byblos" ASR System (as1) ---------------------------------------- NIST used a streamlined version of the BBN Byblos English speech recognizer on most of the English broadcast files. The output of this system consisted only of the hypothesized text in word-tokenized form, and the starting time offset and duration for each word. We do not have "as1" data for the following sample files: 19980109_2000_2100_PRI_TWD 19980128_1130_1200_CNN_HDL The "asr_sgm" version of the data uses "as0" as the source for all VOA_MAN files, and for the two English broadcast files listed above. All other English broadcast files in "asr_sgm" originate from the "as1" data set. In both sets of ASR data ("as0" and "as1"), the LDC post-processed the token stream files produced by the ASR systems; we explicitly labeled time gaps between successive words when these exceeded 0.1 sec, and inserted "place-holder" attributes in the "as1" for confidence score and speaker cluster (assigning a value of "NA" to these attributes for all words), so that both ASR data streams would have equivalent markup. IV. Supporting Materials ======================== In addition to the data directories cited above, this release contains the following additional directories: tdt2_em/dtd -- contains SGML Document Type Definition files to specify the markup format of the boundary table files, token stream files, and the topic tables; the dtd files are necessary for using an SGML parsing utility (e.g. nsgmls) to process the various data files. The functions of the dtd files are: - boundset.dtd -- for all "boundary table" files - docset.dtd -- for all "token stream" files (as0,as1,tkn,mt*) - srctext.dtd -- for all "src_sgm" files - tiptext.dtd -- for all "tipsterized sgm" files (asr_sgm,tkn_sgm) - topicset.dtd -- for all "topic table" files (available from the LDC and/or NIST web sites) doc -- tables and listings that describe the corpus content: - pub_file_list.txt -- list of all files on the cdrom release - tarset_file_list.txt -- list of all files in "tdt2proj.tgz" - tdt2_stats_tables.txt -- summary of quantities by source and month - tdt2_docno_table.txt -- list of all stories (DOCNO, file, DOCTYPE) - tdt2_release_notes.txt -- description of differences relative to v1.0 - voa_names.tbl -- list of older names for VOA_ENG files (see release notes for explanation) - tokenize_*_src.perl -- scripts that were used to create "tkn" and "tkn_bnd" files from "src_sgm" data - tipsterize_tdt.perl -- used to create "tkn_sgm" and "asr_sgm" from "tkn", "as0" and "as1" data, respectively The "tipsterize_tdt" script can be used as follows, to create "TIPSTER-style" SGML format for the machine-translated data sets "mttkn" and "mtas0", assuming that you have copied the full corpus onto a writable disk: cd tdt2_em mkdir mttkn_sgm mkas0_sgm ../corpus_info/tipsterize_tdt.perl -i mttkn -o mttkn_sgm ../corpus_info/tipsterize_tdt.perl -i mtas0 -o mtas0_sgm In each case, the script processes every file from the given input directory, and produces a corresponding "tipsterized" file in the output directory; each output file has the same file name, except that the extension is changed to match the name of the output directory. Another use of this script is to create "tipsterized" data files using an alternative set of boundary tables. By default, "tipsterize_tdt" will use the "ground truth" boundary tables included in this corpus release (i.e. "tkn_bnd" for "tkn" data, etc); since one of the tasks in TDT evaluations is automatic story boundary detection, there can be an alternative set of boundary tables, generated by a detection system. You can create "tipsterized" files from any token stream data set using an alternative set of boundary tables, as follows: - make sure the automatic story boundary information is rendered in a manner equivalent to the original boundary tables, and place the set of new tables in a separate directory under tdt2_em (next to the associated token stream directory), e.g. "alt_tkn_bnd" - use the "tipsterize" script with "-t table_dir" on the command line, in addition to the other arguments described above; e.g.: mkdir alt_tkn_sgm ../corpus_info/tipsterize_tdt.perl -i tkn -o alt_tkn_sgm -t alt_tkn_bnd Topic annotations that were produced by LDC to support the 1998 TDT evaluations are provided at the LDC web site mentioned below. Additional information about TDT is available at the following web sites: http://www.ldc.upenn.edu/Projects/TDT2/ http://www.nist.gov/speech/tests/tdt/ Both web sites also provide additional information and resources for the TDT project: the LDC site includes the archives of email discussions among TDT participants via , and access to related resources, such as English/Mandarin glossing lexicons and parallel text collections. The NIST site includes complete documentation and software resources for running TDT system evaluations, and papers presented at TDT workshops by participants.