Release Notes for the TDT3 Multi-language Text Corpus, Version 2.0 ================================================================= I. Organization of the corpus ----------------------------- The directory structure of the corpus is essentially the same as in version 1.0, except for the following points: - The "sgm" directory in v1.0 has been renamed to "src_sgm". - The "asr_sgm" and "tkn_sgm" directories, containing "tipsterized" versions of as0, as1 and tkn token stream files, have been added; the data files in these new directories have SGML markup that is similar to (but simpler and more consistent than) the markup in the "src_sgm" data files. The "asr_sgm" and "tkn_sgm" data files are directly accessible from the cdrom (as uncompressed text data), whereas all the other data formats are provided in a compressed tar file on the cdrom (just as they were in v1.0). II. Formatting of the data files -------------------------------- The "token stream", "boundary table" and "src_sgm" file formats have not changed. The newly added "tipsterized sgm" file format is created from the token stream and boundary table data by means of the "tipsterize_tdt.perl" script, which is included in this release. III. Data content ----------------- This release has the same inventory of sample files and story units as the previous release. There are two classes of repairs to the corpus that have affected the content of particular stories: (1) Repaired and re-tokenized some "src_sgm" data files In the earlier release, some data files in "sgm" (now "src_sgm") were known to contain unusable character data, caused by problems in newswire modem transmission, closed-caption reception, or non-standard character codes in transcripts and closed captions; the unusable character data was always very sparse, and was simply filtered out by the tokenization process that created the "tkn" data files. In the current release, the "src_sgm" files have been updated to repair or remove the unusable character data, and the tokenization has been run again, such that the "noise removal" logic used in the earlier release no longer applies in this process. As a result, there are slight differences in the word or GB-character token inventories of some files, relative to the earlier release. (2) Altered the tokenization logic for Mandarin newswire sources The first paragraph of Mandarin newswire stories typically begins with "dateline" and/or "byline" information; this information is enclosed in parenthesis, and is placed prior to the beginning of the first sentence in the story. The tokenization script eliminates the parenthesized string, but the version of this script used in the earlier release caused incorrect output in cases like this: (Byline) The initial sentence (in some stories) contains parens... where the "byline removal" logic deleted everything through the second close-paren character, rather than stopping at the first close-paren. This error was fixed and the new tokenization script was run on all Mandarin newswire files, yielding slight changes in the token inventory across many of these files. The following table summarizes the content differences in terms of the number of affected files, by source and by file type (note that NYT data was totally unaffected). The "added" and "lost" columns sum up the file-by-file token-count differences for each source (files that gained tokens are summed in the "added" column, those that lost tokens are summed in the "lost" column): TDT3 #files #files changed #tokens SRC total SGM TKN TKN_BND added lost --------------------------------------------- ABC_WNT 76 13 13 12 1 20 APW_ENG 360 7 0 0 0 0 CNN_HDL 349 30 30 30 23 90 MNB_NBW 51 6 4 3 0 4 NBC_NNW 87 5 5 5 0 9 PRI_TWD 65 2 2 0 0 0 VOA_ENG 103 2 2 0 0 0 VOA_MAN 121 3 1 1 0 1 XIN_MAN 217 53 136 136 4572 0 ZBN_MAN 180 29 124 124 2197 0 In XIN and ZBN, the tokens (GB characters) recovered by fixing the tokenizer amount to, respectively, 0.26% and 0.11% of all tokens in the affected files (these 260 tkn files contain > 3.6 million tokens).