Release Notes for the TDT2 Multi-language Text Corpus, Version 4.0
==================================================================

The initial portions of this document (parts I. through III.) explain
differences between the last release of TDT2 Text (version 3.2) and
the current release.  The remainder of the material (part IV.)
represents the "legacy" release notes, explaining the differences
between version 3.2 and earlier releases; this latter portion also
reviews the entire history of TDT2 Text releases, dating back to 1998.

I. Organization of the corpus
-----------------------------

The directory structure of the corpus is essentially the same as in
version 3.2, except for the following points:

 - The "sgm" directory in v3.2 has been renamed to "src_sgm".

 - The "asr_sgm" and "tkn_sgm" directories, containing "tipsterized"
   versions of as0, as1 and tkn token stream files, have been added;
   the data files in these new directories have SGML markup that is
   similar to (but simpler and more consistent than) the markup in the
   "src_sgm" data files.  The "asr_sgm" and "tkn_sgm" data files are
   directly accessible from the cdrom (as uncompressed text data),
   whereas all the other data formats are provided in a compressed tar
   file on the cdrom (just as they were in v3.2). 

 - Closed-caption text files for ABC_WNT (in src_sgm, tkn and tkn_bnd)
   used to have ".ccap" as part of the file name; in the current
   release, the ".ccap" string has been removed from the file names,
   to reinforce their status as the "default" reference-text version.
   (The alternative reference text data for ABC_WNT, derived from the
   more accurate FDCH transcripts for these broadcasts, are still
   present, and still have ".fdch" as part of their file names.)

 - The as0 and src_sgm directories still contain alternative versions
   of text data -- most of the as0 files for English broadcasts have
   equivalent sample files in as1 (generated by a different ASR
   system); all of the "fdch" files for ABC broadcasts have equivalent
   "closed-caption" files in src_sgm.  But in the new "asr_sgm" and
   "tkn_sgm" data directories, only one version of each broadcast is
   provided: when two versions of a given sample file are present, the
   "as1" data is the default source for English ASR, and the closed
   caption text is the default source for ABC_WNT reference text.


II. Formatting of the data files
--------------------------------

The "token stream", "boundary table", "src_sgm" and "topic table" file
formats have not changed.  The newly added "tipsterized sgm" file
format is created from the token stream and boundary table data by
means of the "tipsterize_tdt.perl" script, which is included in this
release.  Users can apply this script as they see fit, to create the
same markup format for the machine-translation data sets (mttkn,
mtas0) as well as the alternative ASR data files (as0) for English
broadcasts.


III. Data content
-----------------

This release has the same inventory of sample files and story units as
the previous release.  The topic table files are identical to the
earlier release.  We have added DTD files for the "src_sgm" and for
the "tipsterized token stream" (asr_sgm, tkn_sgm) data sets; these DTD
files are "srctext.dtd" and "tiptext.dtd", respectively.

There are two classes of repairs to the corpus text files that have
affected the content of particular stories:


(1) Repaired and re-tokenized some "src_sgm" data files

In the earlier release, some data files in "sgm" (now "src_sgm") were
known to contain unusable character data, caused by problems in
newswire modem transmission, closed-caption reception, or non-standard
character codes in transcripts and closed captions; the unusable
character data was always very sparse, and was simply filtered out by
the tokenization process that created the "tkn" data files.  In the
current release, the "src_sgm" files have been updated to repair or
remove the unusable character data, and the tokenization has been run
again, such that the "noise removal" logic used in the earlier release
no longer applies in this process.

As a result, there are slight differences in the word or GB-character
token inventories of some files, relative to the earlier release.


(2) Altered the tokenization logic for Mandarin newswire sources

The first paragraph of Mandarin newswire stories typically begins with
"dateline" and/or "byline" information; this information is enclosed
in parenthesis, and is placed prior to the beginning of the first
sentence in the story.  The tokenization script eliminates the
parenthesized string, but the version of this script used in the
earlier release caused incorrect output in cases like this:

  (Byline) The initial sentence (in some stories) contains parens...

where the "byline removal" logic deleted everything through the second
close-paren character, rather than stopping at the first close-paren.

This error was fixed and the new tokenization script was run on all
Mandarin newswire files, yielding slight changes in the token
inventory across many of these files.

The following table summarizes the content differences in terms of the
number of affected files, by source and by file type (note that NYT
and VOA_ENG data were totally unaffected).  The "added" and "lost"
columns sum up the file-by-file token-count differences for each
source (files that gained tokens are summed in the "added" column,
those that lost tokens are summed in the "lost" column):

TDT2	    #files    #files changed    #tokens
SRC	     total   SGM  TKN TKN_BND  added lost
-------------------------------------------------
ABC_WNT(ccap)  162    20   20   20       0  238
ABC_WNT.fdch   155     6    4    1       3    0
APW_ENG        711     6    0    0       0    0
CNN_HDL        641    51   49   46       0  134      
PRI_TWD        121     1    0    0       0    0
VOA_MAN        177     2    1    1      10    0
XIN_MAN        484   137  133  127    3689  101
ZBN_MAN        250    32   32   32    1927    0

In XIN and ZBN, the tokens (GB characters) recovered by fixing the
tokenizer amount to 0.25% and 0.41%, respectively, of all tokens in
the affected files (these 169 updated source files contain a total of
about 1.96 million tokens).


IV.  Notes from the prior TDT2 release

Below are the release notes that accompanied the previous release of
TDT2 Multilanguage Text (v3.2) -- some of the wording has been
adjusted slightly to avoid confusion about the version being
discussed.  Still, please be aware that some of the information
provided below is now superseded by changes described above for
version 4.0.

Dave Graff
LDC
April 25, 2001

 
	 DESCRIPTION OF CHANGES INTRODUCED IN TDT2 VERSION 3
	 ===================================================

The following sections describe how the TDT2 Version 3 data differs
from earlier releases.  The changes involve restructuring of the
corpus directories, slight modifications to the designation of
topic-ids and to some file formats, and a variety of bug fixes.


Summary of TDT2 release history:
--------------------------------

Version 1:  This was the form of the corpus that was used in the 1998
	TDT2 benchmark tests, consisting of six English news sources
	annotated against 100 target topics (of which only 96 topics
	yielded on-topic "hits" in the text collection); training and
	development test data were released in October 1998, and
	evaluation test data were released in December 1998.

Version 2:  This was the form of the corpus made available for the
	first dry-run test for TDT3 benchmark participants, consisting
	of six English news sources and 3 Mandarin news sources; the
	Mandarin sources were annotated against 20 target topics
	selected from the original 96, such that each topic had at
	least four on-topic stories in each language.  The full
	six-month, nine-source collection was designated as training
	and development test data, and released by NIST, June 6, 1999.

Version 3:  This is the release used by TDT participants as training
	and development data during the 1999 and 2000 evaluation
	programs, comprising the same sources and target topics as
	Version 2, plus an additional 96 new topics that have been
	partially annotated against the English sources, primarily for
	purposes of the "First Story Detection" research task in the
	TDT 1999 Evaluation Plan. The initial release to TDT 1999
	participants, version 3.0, contained a number of problems,
	some of which were carried over from version 2; the last
	release in this series, 3.2, has been checked much more
	carefully -- the formatting of all data files has been
	verified to be correct according to current specifications,
	and all known content errors in the data and topic tables have
	been fixed.  (There is still the chance that some corrections
	or additions to the topic annotations will be made in the
	future, but there will be few, if any, of these.)


Differences between Version 2 and Version 3:
--------------------------------------------

1. Directory structure and file names

Version 2 was organized into the following data directories, and the
file name extensions applied to the directory contents were as shown
here:

     Path	Contents
    ---------------------------------------------------------------
    sgml/	*.sgm (reference texts including descriptive markup)
    tkntext/	*.tkn (tokenized version of reference texts)
    asrtext/	*.asr (output of Dragon ASR systems for all broadcast data)
    as1text/	*.as1 (output of the BBN ASR system for English broadcast data)
    mtrtext/	*.mtr (SYSTRAN machine translation of Mandarin tkntext data)
    mtatext/	*.mta (SYSTRAN machine translation of Mandarin asrtext data)

    tables/	*.bndtkn, *.bndasr, *.bndas1, *.bndmtr, *.bndmta
		(boundary tables for data files in all the "*text" paths)
		and also the file "topic_relevance.table"

In Version 3, the various boundary table files have been partitioned
into separate directories depending on the type of content they
pertain to; the directory names have been altered, and the file name
extensions are now set to be identical to the name of the directory
that contains each file; i.e.:

      Path	Contents
    -----------------------------------------------------
    sgm/	*.sgm       (reference text with markup)
    tkn/	*.tkn       (tokenized version of ref.text)
    as0/	*.as0       (Dragon ASR output, English and Mandarin)
    as1/	*.as1       (BBN ASR output, English only)
    mttkn/	*.mttkn     (SYSTRAN output from Mandarin *.tkn)
    mtas0/	*.mtas0     (SYSTRAN output from Mandarin *.as0)

    tkn_bnd/	*.tkn_bnd   (boundary tables for *.tkn)
    as0_bnd/	*.as0_bnd   (boundary tables for *.as0)
    as1_bnd/	*.as1_bnd   (boundary tables for *.as1)
    mttkn_bnd/	*.mttkn_bnd (boundary tables for *.mttkn)
    mtas0_bnd/	*.mtas0_bnd (boundary tables for *.mtas0)

    topics/	tdt2_topic_rel.*  (topic relevance tables)

This reorganization of boundary tables and path names is intended to
make individual files more accessible, reduce the overpopulation of
any single directory, and allow for the creation of alternative sets
of boundary tables for any given form of data.  (For example, a user
could create a directory called "tkn_bnd_a" to store boundary tables
that are generated by an automatic story segmentation function applied
to the "tkn" data files, and could easily use this set of tables, in
place of the reference boundary tables in "tkn_bnd", to test system
performance.)


2. Names of VOA English files

Although the VOA English news service is described and treated as a
single source in TDT2, Version 2 used three different patterns to name
the VOA English files: from January through May, there were two news
programs that aired daily, "VOA Today" and "VOA World Report"; the
difference in program names was preserved in the corresponding file
names (VOA_TDY and VOA_WRP), even though the content and structure of
the two programs was quite similar -- both were 60-minute shows
providing "news and features".  In June, VOA abandoned the use of
different names for news programs, and switched to a schedule in which
hour-long "news and features" programming made up the bulk of the
broadcast day.  This schedule change was reflected in the Version 2
file names by switching to "VOA_ENG" for all June recordings.

After the Version 2 release, it was decided that the distinctions
among VOA English file names were of little or no practical use, and
were instead a hindrance to using this one source in a simple and
uniform way.  The discontinuity in VOA English file names, combined
with the inclusion of VOA Mandarin data (named VOA_MAN), made it
difficult to reference all VOA English data as a coherent set.

In Version 3, all VOA English files use the the string "VOA_ENG" in
their file names.  In case some users may want to investigate possible
differences among the shows that used to be differently named, a table
is provided in the "corpus_info" directory that records the file name
correspondences between Version 2 and Version 3 ("voa_names.tbl").


3. Topic designations

Version 2 identified the target topics using sequential numbers, 1
through 100.

In Version 3, the topic identifiers have been expanded to fixed-length
strings of 5 digits, by adding 20000 to each original topic ID; the
original 100 topics are now identified, in the same sequence, as 20001
through 20100.

This change was intended to differentiate TDT2 topic IDs from those of
other TDT phases.  The TDT Pilot corpus (TDT1) will be re-released
with a similar modification, using topic IDs 10001 through 10025, and
the main target topics in TDT3 will be designated 30001 through 30060.
This change also accommodates expansion in the set of annotated topics
for each phase, and allows for easier sorting of topic data by ID.


4. Additional topic tables

Version 2 provided a single topic_relevance.table, containing all
on-topic judgments ("YES" and "BRIEF") resulting from full annotation
of 100 target topics against all news stories.

Prior to releasing Version 3, the LDC carried out additional topic
annotations on TDT2 data to support the JHU CLSP 1999 Summer Workshop
project on First Story Detection.  This effort involved selecting an
additional 97 target topics, and judging up to 60 stories against each
new topic, with a focus on finding the earliest report in the corpus
on each new topic, as well as some number of additional (subsequent)
on-topic stories and a number of off-topic stories.  Only a fairly
small number of stories was judged for each new topic.

This "First Story" annotation has lead to the inclusion of two
additional topic tables:

 - "tdt2_topic_rel.partial_annot" contains records for all the stories
   that were judged against each of the new topics (in this table, the
   "level" attribute can have a value of "YES", "BRIEF" or "NO" --
   stories that are not listed with a given topic in this table have
   NOT been judged against that topic)

 - "tdt2_topic_rel.first_story" contains a listing of just those
   stories which chronologically first for each of the 193 defined
   topics (the original 96 plus the newly added 97); this can be
   derived from the other two tables, and does not represent any new
   information -- it is provided simply as a convenience


5. Format of *.as1 files

In Version 2, the token records (<W> elements) of BBN *.as1 files
contained only "recid", "Bsec" and "Dur" attributes, whereas the
Dragon *.asr files contained these attributes plus "Cluster" and
"Conf" (speaker cluster and recognition confidence score information)
for each word.

In Version 3, the same attributes are used in all <W> elements of all
*.as0 and *.as1 files.  In the *.as1 files, because the BBN system
does not currently provide speaker cluster or confidence information
in its output, the "Clust" and "Conf" attributes are always assigned
the constant value "NA".


6. Format of *.mttkn and *.mtas0 files

The SYSTRAN machine translation program, which is used by the LDC to
provide English renditions of Mandarin data files, has the property
that it fails to translate some strings of Mandarin text; when this
happens, it simply includes the untranslated string as part of the
translated output.  As a result, the English output file may contain a
scattering of "word" tokens that consist of unmodified 16-bit GB
encoded characters intermixed among the English words.

In Version 2, these GB strings were simply treated as word tokens just
like the English words, and were not explicitly marked in any way as
being untranslated.  (They were distinct from English words, in terms
of being composed of pairs of bytes in which all bytes had the 8th bit
set.)

In Version 3, an attribute has been added to each <W> element to
indicate whether the corresponding token represents a "successful"
translation to English.  The attribute is "tr", and it receives a
value of "Y" if the corresponding token is English, or "N" if the
token is an untranslated GB Mandarin string.  For example:

  <DOCSET type=SYSTRAN fileid=19980101_0016_1116_XIN_MAN ...>
  <W recid=1 tr=Y> Is
  ...
  <W recid=53 tr=N> ξ
  <W recid=54 tr=Y> healthy
  ...
  </DOCSET>

(The character data for recid=53 consists of two bytes: 0xCE 0xBE)


7. Tokenization of Mandarin *.sgm files into *.tkn

There were three issues affecting the tokenization of reference texts
in Mandarin that were not properly dealt with in Version 2:

 (a) newswire articles contained "dateline" strings, "end-of-story"
strings, and various "pictorial" characters (symbols to provide
"bullet" highlighting of certain paragraphs) that should have been
eliminated from the tokenized output, but were not.

 (b) newswire articles (particularly Xinhua) contained regions of
corrupted data, yielding byte codes that were uninterpretable as
either GB or ASCII characters; either the corrupted bytes, or whole
stories that contained them, should have been excluded from the
tokenized output, but were not.

 (c) often (especially in Xinhua), there were 16-bit codes in the text
that mapped to a portion of the GB character table used to replicate
the standard ASCII characters -- in other words, the text contained
strings of digits and roman-alphabet letters (even spaces) that were
rendered using 16-bit codes; these should have been be replaced by the
corresponding 7-bit ASCII characters, but were not.

For Version 3, the tokenization function was improved to eliminate
dateline, byline and end-of-story strings from the newswire sources,
as well as "highlighting" characters (this made Mandarin newswire
tokenization comparable to the treatment of NYT and APW in English).
Extra care was taken to isolate byte sequences that were untreatable
as GB or printable ASCII character sequences, and to produce only
valid, printable tokens as output (in some cases, stories were
manually inspected, and deleted from the corpus if the data corruption
was severe).  Also, the new method identified GB characters with 7-bit
ASCII equivalents, made sure that these alphanumerics and punctuations
were rendered invariantly in ASCII form, and structured the tokenized
output so that each <W> element contains either a single GB character
or a string of one or more contiguous ASCII characters.


8. Derivation of machine-translated text

In Version 2, the machine translation of Mandarin reference text data
was affected by the presence of dateline, byline and end-of-story
strings (as well as data corruption) in the Mandarin newswires, as
described in the previous section.

In Version 3, the machine translation used the newly tokenized
reference data files (*.tkn) as input, to assure that the translations
would be of better general quality and that there would be proper
equivalence of content between corresponding "native" and "translated"
token stream files.


9. Consistency among various boundary tables

In Version 2, there were a number of cases in which a comparison of
different boundary tables for the same file-id (e.g. comparing the
"bndtkn" file to the "bndasr" file) showed different inventories of
stories; e.g. the "bndasr" table may have included fewer story entries
than the "bndtkn" table, or the "doctype" of a given story might have
differed in the two files.  Also, the treatment of story boundaries in
ASR data sometimes involved the addition of an extra entry at the end
of the "bndasr" table, with "docno=UNASSIGNED".

In Version 3, the creation of boundary tables was modified to assure
that all boundary tables sharing a given file-id would have the same
set of story entries, that there would only be entries for identified
stories, and that the doctype of each story would be constant across
all tables referring to that story.  For example, there are four
distinct boundary tables for each VOA_MAN program (for tkn, mttkn, as0
and mtas0 forms of the data); in this version, the four tables for a
given file-id will have the same number of lines and the same set of
docno and doctype values.

(The "Brecid" and "Erecid" values will of course differ across tables;
in fact a story may lack these values in one table and not in another,
e.g. if an ASR system produced words where the human transcriber or
closed caption service did not.  Also, the "Esec" value of the final
story in a file may differ when comparing the tkn_bnd to the as0_bnd
or as1_bnd file, because time stamps on the ASR tokens may have
extended beyond those of the manual transcription; it is still the
case that all time spans and all tokens are accounted for in each
boundary table.)


10.  Miscellaneous bug fixes

 - Version 2 contained a set of files for 19980209_2000_2100_PRI_TWD;
   these were derived from an incorrect audio recording, which was
   actually a duplication of 19980216_2000_2100_PRI_TWD.  The former
   file set has been deleted from the corpus.

 - Version 2 had bad asr and as1 data for 19980528_1600_1630_CNN_HDL,
   again due to a bad audio recording; a correct recording was used
   for closed-caption text and topic annotation, and NIST has provided
   a corrected version of the as1 data for this file; the as0 file for
   this broadcast has been deleted.

 - The first story annotation and recent work at the JHU summer
   workshop turned up a small number of incorrect topic labels in the
   Version 2 topic_relevance.table; these have been corrected.

 - The Version 2 topic_relevance.table contained a number of on-topic
   stories collected in the first three days of July 1998, even though
   text data for these dates were not part of the corpus; these unused
   topic labels have been removed.

 - All of the *.as1 files in Version 2 were lacking a final line-feed
   character at the end of the last line (after "</DOCSET>"); this has
   been corrected.

 - Some boundary tables in Version 2 (and in version 3.0) did not
   tabulate all word or character tokens in the corresponding token
   stream files -- i.e. if tokens were extracted from the token stream
   on a story-by-story basis using the boundary table entries, some
   tokens from the stream would not be retrieved; this has been fixed
   in version 3.1 (and in this release, version 3.2) -- every boundary
   table accounts for every token identified in the corresponding
   token stream file.


David Graff
LDC
September 7, 1999