Release Notes for the TDT2 Multi-language Text Corpus, Version 4.0 ================================================================== The initial portions of this document (parts I. through III.) explain differences between the last release of TDT2 Text (version 3.2) and the current release. The remainder of the material (part IV.) represents the "legacy" release notes, explaining the differences between version 3.2 and earlier releases; this latter portion also reviews the entire history of TDT2 Text releases, dating back to 1998. I. Organization of the corpus ----------------------------- The directory structure of the corpus is essentially the same as in version 3.2, except for the following points: - The "sgm" directory in v3.2 has been renamed to "src_sgm". - The "asr_sgm" and "tkn_sgm" directories, containing "tipsterized" versions of as0, as1 and tkn token stream files, have been added; the data files in these new directories have SGML markup that is similar to (but simpler and more consistent than) the markup in the "src_sgm" data files. The "asr_sgm" and "tkn_sgm" data files are directly accessible from the cdrom (as uncompressed text data), whereas all the other data formats are provided in a compressed tar file on the cdrom (just as they were in v3.2). - Closed-caption text files for ABC_WNT (in src_sgm, tkn and tkn_bnd) used to have ".ccap" as part of the file name; in the current release, the ".ccap" string has been removed from the file names, to reinforce their status as the "default" reference-text version. (The alternative reference text data for ABC_WNT, derived from the more accurate FDCH transcripts for these broadcasts, are still present, and still have ".fdch" as part of their file names.) - The as0 and src_sgm directories still contain alternative versions of text data -- most of the as0 files for English broadcasts have equivalent sample files in as1 (generated by a different ASR system); all of the "fdch" files for ABC broadcasts have equivalent "closed-caption" files in src_sgm. But in the new "asr_sgm" and "tkn_sgm" data directories, only one version of each broadcast is provided: when two versions of a given sample file are present, the "as1" data is the default source for English ASR, and the closed caption text is the default source for ABC_WNT reference text. II. Formatting of the data files -------------------------------- The "token stream", "boundary table", "src_sgm" and "topic table" file formats have not changed. The newly added "tipsterized sgm" file format is created from the token stream and boundary table data by means of the "tipsterize_tdt.perl" script, which is included in this release. Users can apply this script as they see fit, to create the same markup format for the machine-translation data sets (mttkn, mtas0) as well as the alternative ASR data files (as0) for English broadcasts. III. Data content ----------------- This release has the same inventory of sample files and story units as the previous release. The topic table files are identical to the earlier release. We have added DTD files for the "src_sgm" and for the "tipsterized token stream" (asr_sgm, tkn_sgm) data sets; these DTD files are "srctext.dtd" and "tiptext.dtd", respectively. There are two classes of repairs to the corpus text files that have affected the content of particular stories: (1) Repaired and re-tokenized some "src_sgm" data files In the earlier release, some data files in "sgm" (now "src_sgm") were known to contain unusable character data, caused by problems in newswire modem transmission, closed-caption reception, or non-standard character codes in transcripts and closed captions; the unusable character data was always very sparse, and was simply filtered out by the tokenization process that created the "tkn" data files. In the current release, the "src_sgm" files have been updated to repair or remove the unusable character data, and the tokenization has been run again, such that the "noise removal" logic used in the earlier release no longer applies in this process. As a result, there are slight differences in the word or GB-character token inventories of some files, relative to the earlier release. (2) Altered the tokenization logic for Mandarin newswire sources The first paragraph of Mandarin newswire stories typically begins with "dateline" and/or "byline" information; this information is enclosed in parenthesis, and is placed prior to the beginning of the first sentence in the story. The tokenization script eliminates the parenthesized string, but the version of this script used in the earlier release caused incorrect output in cases like this: (Byline) The initial sentence (in some stories) contains parens... where the "byline removal" logic deleted everything through the second close-paren character, rather than stopping at the first close-paren. This error was fixed and the new tokenization script was run on all Mandarin newswire files, yielding slight changes in the token inventory across many of these files. The following table summarizes the content differences in terms of the number of affected files, by source and by file type (note that NYT and VOA_ENG data were totally unaffected). The "added" and "lost" columns sum up the file-by-file token-count differences for each source (files that gained tokens are summed in the "added" column, those that lost tokens are summed in the "lost" column): TDT2 #files #files changed #tokens SRC total SGM TKN TKN_BND added lost ------------------------------------------------- ABC_WNT(ccap) 162 20 20 20 0 238 ABC_WNT.fdch 155 6 4 1 3 0 APW_ENG 711 6 0 0 0 0 CNN_HDL 641 51 49 46 0 134 PRI_TWD 121 1 0 0 0 0 VOA_MAN 177 2 1 1 10 0 XIN_MAN 484 137 133 127 3689 101 ZBN_MAN 250 32 32 32 1927 0 In XIN and ZBN, the tokens (GB characters) recovered by fixing the tokenizer amount to 0.25% and 0.41%, respectively, of all tokens in the affected files (these 169 updated source files contain a total of about 1.96 million tokens). IV. Notes from the prior TDT2 release Below are the release notes that accompanied the previous release of TDT2 Multilanguage Text (v3.2) -- some of the wording has been adjusted slightly to avoid confusion about the version being discussed. Still, please be aware that some of the information provided below is now superseded by changes described above for version 4.0. Dave Graff LDC April 25, 2001 DESCRIPTION OF CHANGES INTRODUCED IN TDT2 VERSION 3 =================================================== The following sections describe how the TDT2 Version 3 data differs from earlier releases. The changes involve restructuring of the corpus directories, slight modifications to the designation of topic-ids and to some file formats, and a variety of bug fixes. Summary of TDT2 release history: -------------------------------- Version 1: This was the form of the corpus that was used in the 1998 TDT2 benchmark tests, consisting of six English news sources annotated against 100 target topics (of which only 96 topics yielded on-topic "hits" in the text collection); training and development test data were released in October 1998, and evaluation test data were released in December 1998. Version 2: This was the form of the corpus made available for the first dry-run test for TDT3 benchmark participants, consisting of six English news sources and 3 Mandarin news sources; the Mandarin sources were annotated against 20 target topics selected from the original 96, such that each topic had at least four on-topic stories in each language. The full six-month, nine-source collection was designated as training and development test data, and released by NIST, June 6, 1999. Version 3: This is the release used by TDT participants as training and development data during the 1999 and 2000 evaluation programs, comprising the same sources and target topics as Version 2, plus an additional 96 new topics that have been partially annotated against the English sources, primarily for purposes of the "First Story Detection" research task in the TDT 1999 Evaluation Plan. The initial release to TDT 1999 participants, version 3.0, contained a number of problems, some of which were carried over from version 2; the last release in this series, 3.2, has been checked much more carefully -- the formatting of all data files has been verified to be correct according to current specifications, and all known content errors in the data and topic tables have been fixed. (There is still the chance that some corrections or additions to the topic annotations will be made in the future, but there will be few, if any, of these.) Differences between Version 2 and Version 3: -------------------------------------------- 1. Directory structure and file names Version 2 was organized into the following data directories, and the file name extensions applied to the directory contents were as shown here: Path Contents --------------------------------------------------------------- sgml/ *.sgm (reference texts including descriptive markup) tkntext/ *.tkn (tokenized version of reference texts) asrtext/ *.asr (output of Dragon ASR systems for all broadcast data) as1text/ *.as1 (output of the BBN ASR system for English broadcast data) mtrtext/ *.mtr (SYSTRAN machine translation of Mandarin tkntext data) mtatext/ *.mta (SYSTRAN machine translation of Mandarin asrtext data) tables/ *.bndtkn, *.bndasr, *.bndas1, *.bndmtr, *.bndmta (boundary tables for data files in all the "*text" paths) and also the file "topic_relevance.table" In Version 3, the various boundary table files have been partitioned into separate directories depending on the type of content they pertain to; the directory names have been altered, and the file name extensions are now set to be identical to the name of the directory that contains each file; i.e.: Path Contents ----------------------------------------------------- sgm/ *.sgm (reference text with markup) tkn/ *.tkn (tokenized version of ref.text) as0/ *.as0 (Dragon ASR output, English and Mandarin) as1/ *.as1 (BBN ASR output, English only) mttkn/ *.mttkn (SYSTRAN output from Mandarin *.tkn) mtas0/ *.mtas0 (SYSTRAN output from Mandarin *.as0) tkn_bnd/ *.tkn_bnd (boundary tables for *.tkn) as0_bnd/ *.as0_bnd (boundary tables for *.as0) as1_bnd/ *.as1_bnd (boundary tables for *.as1) mttkn_bnd/ *.mttkn_bnd (boundary tables for *.mttkn) mtas0_bnd/ *.mtas0_bnd (boundary tables for *.mtas0) topics/ tdt2_topic_rel.* (topic relevance tables) This reorganization of boundary tables and path names is intended to make individual files more accessible, reduce the overpopulation of any single directory, and allow for the creation of alternative sets of boundary tables for any given form of data. (For example, a user could create a directory called "tkn_bnd_a" to store boundary tables that are generated by an automatic story segmentation function applied to the "tkn" data files, and could easily use this set of tables, in place of the reference boundary tables in "tkn_bnd", to test system performance.) 2. Names of VOA English files Although the VOA English news service is described and treated as a single source in TDT2, Version 2 used three different patterns to name the VOA English files: from January through May, there were two news programs that aired daily, "VOA Today" and "VOA World Report"; the difference in program names was preserved in the corresponding file names (VOA_TDY and VOA_WRP), even though the content and structure of the two programs was quite similar -- both were 60-minute shows providing "news and features". In June, VOA abandoned the use of different names for news programs, and switched to a schedule in which hour-long "news and features" programming made up the bulk of the broadcast day. This schedule change was reflected in the Version 2 file names by switching to "VOA_ENG" for all June recordings. After the Version 2 release, it was decided that the distinctions among VOA English file names were of little or no practical use, and were instead a hindrance to using this one source in a simple and uniform way. The discontinuity in VOA English file names, combined with the inclusion of VOA Mandarin data (named VOA_MAN), made it difficult to reference all VOA English data as a coherent set. In Version 3, all VOA English files use the the string "VOA_ENG" in their file names. In case some users may want to investigate possible differences among the shows that used to be differently named, a table is provided in the "corpus_info" directory that records the file name correspondences between Version 2 and Version 3 ("voa_names.tbl"). 3. Topic designations Version 2 identified the target topics using sequential numbers, 1 through 100. In Version 3, the topic identifiers have been expanded to fixed-length strings of 5 digits, by adding 20000 to each original topic ID; the original 100 topics are now identified, in the same sequence, as 20001 through 20100. This change was intended to differentiate TDT2 topic IDs from those of other TDT phases. The TDT Pilot corpus (TDT1) will be re-released with a similar modification, using topic IDs 10001 through 10025, and the main target topics in TDT3 will be designated 30001 through 30060. This change also accommodates expansion in the set of annotated topics for each phase, and allows for easier sorting of topic data by ID. 4. Additional topic tables Version 2 provided a single topic_relevance.table, containing all on-topic judgments ("YES" and "BRIEF") resulting from full annotation of 100 target topics against all news stories. Prior to releasing Version 3, the LDC carried out additional topic annotations on TDT2 data to support the JHU CLSP 1999 Summer Workshop project on First Story Detection. This effort involved selecting an additional 97 target topics, and judging up to 60 stories against each new topic, with a focus on finding the earliest report in the corpus on each new topic, as well as some number of additional (subsequent) on-topic stories and a number of off-topic stories. Only a fairly small number of stories was judged for each new topic. This "First Story" annotation has lead to the inclusion of two additional topic tables: - "tdt2_topic_rel.partial_annot" contains records for all the stories that were judged against each of the new topics (in this table, the "level" attribute can have a value of "YES", "BRIEF" or "NO" -- stories that are not listed with a given topic in this table have NOT been judged against that topic) - "tdt2_topic_rel.first_story" contains a listing of just those stories which chronologically first for each of the 193 defined topics (the original 96 plus the newly added 97); this can be derived from the other two tables, and does not represent any new information -- it is provided simply as a convenience 5. Format of *.as1 files In Version 2, the token records ( elements) of BBN *.as1 files contained only "recid", "Bsec" and "Dur" attributes, whereas the Dragon *.asr files contained these attributes plus "Cluster" and "Conf" (speaker cluster and recognition confidence score information) for each word. In Version 3, the same attributes are used in all elements of all *.as0 and *.as1 files. In the *.as1 files, because the BBN system does not currently provide speaker cluster or confidence information in its output, the "Clust" and "Conf" attributes are always assigned the constant value "NA". 6. Format of *.mttkn and *.mtas0 files The SYSTRAN machine translation program, which is used by the LDC to provide English renditions of Mandarin data files, has the property that it fails to translate some strings of Mandarin text; when this happens, it simply includes the untranslated string as part of the translated output. As a result, the English output file may contain a scattering of "word" tokens that consist of unmodified 16-bit GB encoded characters intermixed among the English words. In Version 2, these GB strings were simply treated as word tokens just like the English words, and were not explicitly marked in any way as being untranslated. (They were distinct from English words, in terms of being composed of pairs of bytes in which all bytes had the 8th bit set.) In Version 3, an attribute has been added to each element to indicate whether the corresponding token represents a "successful" translation to English. The attribute is "tr", and it receives a value of "Y" if the corresponding token is English, or "N" if the token is an untranslated GB Mandarin string. For example: Is ... ΞΎ healthy ... (The character data for recid=53 consists of two bytes: 0xCE 0xBE) 7. Tokenization of Mandarin *.sgm files into *.tkn There were three issues affecting the tokenization of reference texts in Mandarin that were not properly dealt with in Version 2: (a) newswire articles contained "dateline" strings, "end-of-story" strings, and various "pictorial" characters (symbols to provide "bullet" highlighting of certain paragraphs) that should have been eliminated from the tokenized output, but were not. (b) newswire articles (particularly Xinhua) contained regions of corrupted data, yielding byte codes that were uninterpretable as either GB or ASCII characters; either the corrupted bytes, or whole stories that contained them, should have been excluded from the tokenized output, but were not. (c) often (especially in Xinhua), there were 16-bit codes in the text that mapped to a portion of the GB character table used to replicate the standard ASCII characters -- in other words, the text contained strings of digits and roman-alphabet letters (even spaces) that were rendered using 16-bit codes; these should have been be replaced by the corresponding 7-bit ASCII characters, but were not. For Version 3, the tokenization function was improved to eliminate dateline, byline and end-of-story strings from the newswire sources, as well as "highlighting" characters (this made Mandarin newswire tokenization comparable to the treatment of NYT and APW in English). Extra care was taken to isolate byte sequences that were untreatable as GB or printable ASCII character sequences, and to produce only valid, printable tokens as output (in some cases, stories were manually inspected, and deleted from the corpus if the data corruption was severe). Also, the new method identified GB characters with 7-bit ASCII equivalents, made sure that these alphanumerics and punctuations were rendered invariantly in ASCII form, and structured the tokenized output so that each element contains either a single GB character or a string of one or more contiguous ASCII characters. 8. Derivation of machine-translated text In Version 2, the machine translation of Mandarin reference text data was affected by the presence of dateline, byline and end-of-story strings (as well as data corruption) in the Mandarin newswires, as described in the previous section. In Version 3, the machine translation used the newly tokenized reference data files (*.tkn) as input, to assure that the translations would be of better general quality and that there would be proper equivalence of content between corresponding "native" and "translated" token stream files. 9. Consistency among various boundary tables In Version 2, there were a number of cases in which a comparison of different boundary tables for the same file-id (e.g. comparing the "bndtkn" file to the "bndasr" file) showed different inventories of stories; e.g. the "bndasr" table may have included fewer story entries than the "bndtkn" table, or the "doctype" of a given story might have differed in the two files. Also, the treatment of story boundaries in ASR data sometimes involved the addition of an extra entry at the end of the "bndasr" table, with "docno=UNASSIGNED". In Version 3, the creation of boundary tables was modified to assure that all boundary tables sharing a given file-id would have the same set of story entries, that there would only be entries for identified stories, and that the doctype of each story would be constant across all tables referring to that story. For example, there are four distinct boundary tables for each VOA_MAN program (for tkn, mttkn, as0 and mtas0 forms of the data); in this version, the four tables for a given file-id will have the same number of lines and the same set of docno and doctype values. (The "Brecid" and "Erecid" values will of course differ across tables; in fact a story may lack these values in one table and not in another, e.g. if an ASR system produced words where the human transcriber or closed caption service did not. Also, the "Esec" value of the final story in a file may differ when comparing the tkn_bnd to the as0_bnd or as1_bnd file, because time stamps on the ASR tokens may have extended beyond those of the manual transcription; it is still the case that all time spans and all tokens are accounted for in each boundary table.) 10. Miscellaneous bug fixes - Version 2 contained a set of files for 19980209_2000_2100_PRI_TWD; these were derived from an incorrect audio recording, which was actually a duplication of 19980216_2000_2100_PRI_TWD. The former file set has been deleted from the corpus. - Version 2 had bad asr and as1 data for 19980528_1600_1630_CNN_HDL, again due to a bad audio recording; a correct recording was used for closed-caption text and topic annotation, and NIST has provided a corrected version of the as1 data for this file; the as0 file for this broadcast has been deleted. - The first story annotation and recent work at the JHU summer workshop turned up a small number of incorrect topic labels in the Version 2 topic_relevance.table; these have been corrected. - The Version 2 topic_relevance.table contained a number of on-topic stories collected in the first three days of July 1998, even though text data for these dates were not part of the corpus; these unused topic labels have been removed. - All of the *.as1 files in Version 2 were lacking a final line-feed character at the end of the last line (after ""); this has been corrected. - Some boundary tables in Version 2 (and in version 3.0) did not tabulate all word or character tokens in the corresponding token stream files -- i.e. if tokens were extracted from the token stream on a story-by-story basis using the boundary table entries, some tokens from the stream would not be retrieved; this has been fixed in version 3.1 (and in this release, version 3.2) -- every boundary table accounts for every token identified in the corresponding token stream file. David Graff LDC September 7, 1999