==========================================================
      SUMMARY OF CONTENTS IN THE TDT2 MULTI-LANGUAGE TEXT CORPUS
      ==========================================================

Release date:	April 25, 2001
Version:	4.0


I.  Data Sources
================

The TDT2 corpus contains news data collected daily from 9 news sources in two
languages (American English and Mandarin Chinese), over a period of six months
(January - June, 1998).  The sources and their frequency of sampling are as
follows:

  English sources
  ---------------
  NYT_NYT (1)	New York Times Newswire Service (excluding non-NYT sources)
  APW_ENG (1)	Associated Press Worldstream Service (English content only)
  CNN_HDL (1)	Cable News Network, "Headline News"
  ABC_WNT (2)	American Broadcasting Company, "World News Tonight"
  PRI_TWD (3)	Public Radio International, "The World"
  VOA_ENG (4)	Voice of America, English news programs

  Mandarin sources
  ----------------
  XIN_MAN (5)	Xinhua News Agency
  ZBN_MAN (6)	Zaobao News Agency
  VOA_MAN (7)	Voice of America, Mandarin Chinese news programs

  Daily sampling
  --------------
  (1) about 80 stories, in four sample files, per day
  (2) about 15 stories, in one sample file, per day
  (3) about 20 stories, in one sample file, per day, 5 days/week
  (4) about 40 stories, in two sample files, per day
  (5) about 60 stories, in three sample files, per day
  (6) about 50 stories, in two sample files, per day (starting Feb.26)
  (7) irregular:
        - no samples Jan.1 - Feb.19;
        - 40 to 80 stories, in one or two sample files, per day
                but with some gaps in the collection, Feb.20 - Apr.4
        - 10 to 40 stories, in up to three sample files, per day
                again with some gaps, Apr.5 - Jun.30

The quantities indicated above for sampling frequencies are
approximate; all sources were prone to occasional failures in the data
collection process.  A more detailed summary of data quantities by
source and month is provided in the file "tdt2_stats.tables".  A
complete listing of all stories and sample files is provided in the
file "tdt2_docno.table".


II.  Corpus Structure
=====================

The organization of data in the corpus is intended to provide direct
support for the research tasks defined in the yearly TDT evaluation
plans (available at http://www.nist.gov/speech/tests/tdt/index.htm),
while also providing a data format compatible with other research
projects involving information extraction.


II.A.  Basic units of data
--------------------------

The basic units of the corpus are news stories and sample files.  Each
news story is uniquely identified by a "DOCNO" (story-id) that
indicates the source and date of the story; e.g.:

        XIN19980101.0001

identifies a story from Xinhua collected on Jan. 1, 1998; the final
four digits distinguish this story from all other stories collected on
the same date from the same source.  In the case of broadcast sources
(as opposed to newswire sources), the DOCNO also contains four digits
to indicate the start time of the broadcast; e.g.:

        VOM19980220.0700.0221

identifies a Voice of America Mandarin story from the Feb. 20
broadcast that began at 7:00am (EST); again, the final four digits
distinguish this story from others in the same broadcast.

Each sample file represents a contiguous collection of stories from a
given source on a given date over a specific period of time; the file
name of the sample (file-id) provides all this information; e.g.:

        19980101_0016_1116_XIN_MAN
        19980220_0700_0800_VOA_MAN

These are the file names that happen to contain the example story-ids
mentioned above.  The XIN file spans a period of collection from
12:16am to 11:16am on Jan.1, and the VOA file covers a 1-hour
broadcast starting at 7:00am on Feb.20.

Each sample from broadcast sources was manually segmented into story
units, and each story unit was manually classified as either a "news
story" or as "miscellaneous text": a unit was classified as "news" if
it was judged by annotators to contain informative content about any
topic or event.  Miscellaneous text units include commercial breaks,
music interludes, and "introductory" portions of broadcasts where an
anchor person is providing a list of "upcoming stories" (typically by
making a single statement about each event to be reported on during
the broadcast).  Only the "news story" units underwent topic relevance
annotation, but the content and time stamps of the "miscellaneous
text" units have been retained in the data files.


II.B.  List of data types
-------------------------

Each data sample is presented in a variety of forms, with each form
placed in a separate directory under "tdt2_em".  In this cdrom
release, two forms of data ("tkn_sgm" and "asr_sgm") are directly
accessible as uncompressed files.  All other data forms have been
packed into the compressed unix tar file "tdt2proj.tgz"; unpacking
this tar file will create the additional directories under "tdt2_em".

The forms of data in this release (and their directory names) are:

   src_sgm -- original source text data (newswire, manual transcript
		or closed caption text) from which reference texts are
		derived, in an SGML markup format similar to the
		TIPSTER text corpora

   tkn     -- reference text data in tokenized ("token stream") form:
		story boundaries and other descriptive markup in the
		tkn_sgm format are removed, each English word or
		Mandarin GB character is assigned a unique identifier
		(a sequential "recid" number) and presented on a
		separate line with an SGML tag: "<W recid=N> word"

   as0     -- for the Mandarin broadcast data (VOA_MAN), output of the
		Dragon Systems speech recognizer, in token stream
		form, without story boundaries or punctuation; each
		Mandarin word is assigned a unique recid, with
		information on starting time and duration (in sec),
		speaker cluster, and asr confidence score (some
		Mandarin words from the recognizer comprise multiple
		GB characters)

   as1     -- for English broadcast sources, output of the BBN Byblos
		speech recognizer, same format as as0, (token stream,
		one word per line) except that speaker cluster and asr
		confidence score information is not available ("NA")

   mttkn   -- for all Mandarin sources, output of SYSTRAN machine
		translation from "tkn" reference text data into
		English, tokenized, without story boundaries; some
		strings of Mandarin characters have been left
		untranslated by SYSTRAN, and these are included in the
		file in unmodified form (using the GB character set);
		each token is provided with a tag to indicate whether
		or not it is a translated token ("tr=Y" or "tr=N")

   mtas0   -- for the Mandarin broadcast data (VOA_MAN), output of
		SYSTRAN translation from as0 text data into English;
		same format as mttkn.

   tkn_sgm -- reference text data derived from "tkn" files, in an SGML
		markup format similar to the TIPSTER text corpora

   asr_sgm -- ASR text data derived from "as0" and "as1" (Mandarin and
		English) broadcast files, in an SGML markup format
		similar to the TIPSTER text corpora


II.C  The "TIPSTER-style" data types
------------------------------------

The "src_sgm", "tkn_sgm" and "asr_sgm" data sets are the only ones in
which story boundary information is included as part of the text
stream of each file.  Of these three, the "tkn_sgm" and "asr_sgm" data
sets provide the simplest, most compact formatting of the data, and
are the most consistent and useful forms in terms of data content.

The tkn_sgm and asr_sgm files use the following SGML tag structure for
each story unit:

 <DOC>
 <DOCNO> SRC19980... </DOCNO>
 <DOCTYPE> (NEWS or MISCELLANEOUS) </DOCTYPE>
 <TXTTYPE> (NEWSWIRE, CAPTION, TRANSCRIPT or ASRTEXT) </TXTTYPE>
 <TEXT>
 This region, between the TEXT tags, provides the full content of the
 story, which has been drawn from the corresponding "tkn", "as0" or
 "as1" data file.
 </TEXT>
 </DOC>

Note the following properties of text content in these two data sets:

 - In English files, all word tokens are space separated.  In tkn_sgm
   files, word tokens may include adjacent punctuation, brackets and
   quotes; in asr_sgm files, punctuation, brackets and quotes are not
   present at all, since these are not produced by the ASR system.

 - In Mandarin tkn_sgm files, there is space separation only among
   tokens that consist of ASCII content (i.e. digits, punctuation,
   occasional names), since the original GB Mandarin text content was
   not segmented into words.  In Mandarin asr_sgm files, there is
   space-separation between word tokens (because the ASR system
   produced word-segmented output), but there is no punctuation.

 - In all files, story text is presented with a consistent pattern of
   line-wrapping, but without paragraph breaks (which exist only in
   the src_sgm data format).

 - In files from broadcast sources, the TEXT elements of some story
   units may be completely empty, because there was no speech in the
   corresponding segment of audio, or because no transcript or
   captioning was provided for that segment (in such units, the
   "DOCTYPE" is always "MISCELLANEOUS" or "UNTRANSCRIBED").

In the src_sgm data, there is a variable amount of additional SGML
markup within each <DOC>, but outside of the <TEXT> element, providing
extra information associated with each story:

 - "<DATE_TIME>" provides a time stamp of when the story was broadcast
   or transmitted.

 - "<BODY>" is used as a bracketing element around "<TEXT>", to
   contain other elements and information about the story that are not
   part of the actual story text.

 - "<HEADER>", "<SLUG>", "<HEADLINE>" and "<TRAILER>" appear within
   the "<BODY>" portion of most newswire files, providing keywords,
   headline strings and other data about stories that are provided as
   part of the wire transmission but are external to the story text.

 - "<END_TIME>" appears in broadcast files, providing a time stamp for
   the end of the story.

The <TEXT> portion of each story in src_sgm files may also contain
additional tagging, to convey "meta-information" about the story
content; in particular:

 - "<TURN>" tags are used in broadcast sources to mark speaker
   changes, when these are known from the original transcription.

 - "<ANNOTATION> ... </ANNOTATION>" tags are used to bracket other
   information about the story content, in both broadcast and newswire
   sources (e.g. comments about noise in the audio, or instructions to
   editors in longer newswire stories) -- in other words, material
   enclosed between these tags is NOT part of the actual story
   content; note that the opening tag, enclosed commentary and closing
   tag are usually on separate lines in the data files.

 - "<P>" is used in some sources to mark paragraph breaks.


II.D.  The "token stream" data types
------------------------------------

These five data sets (tkn,as0,as1,mttkn,mtas0) have all been packaged
together in the compressed tar file "tdt2proj.tgz"; these directories
will be created within "tdt2_em" when the tar file is unpacked (see
the top-level "index.html" file on the cdrom for instructions to
unpack the tar file).  The content in these data files all share the
same basic SGML markup strategy:

 <DOCSET type=... fileid=... (other attributes...) >
 <W recid=1 (other attributes...)> A
 <W recid=2 (other attributes...)> word
 ...
 </DOCSET>

For each of these data sets, there is a separate directory containing
a set of "boundary table" files, one boundary table for each sample
file, which provides the mapping of story boundaries to the
corresponding token stream in terms of the "recid" values assigned to
the tokens.  A boundary table contains one SGML "<BOUNDARY>" tag for
each story unit, and the attributes in this tag identify the DOCNO,
the DOCTYPE, the beginning and ending "recid" numbers in the token
stream file that make up the token content of the story (if any), and
for broadcast sources, the beginning and ending time offsets for the
story in the corresponding audio file (to be found in the TDT2 Speech
corpora, which are distributed separately); for example:

 <BOUNDSET type=ASRTEXT fileid=19981001_0800_0900_VOA_MAN ... >
 <BOUNDARY docno=VOM19981001.0800.0000 doctype=MISCELLANEOUS Bsec=0.00 Esec=122.25 Brecid=1 Erecid=262>
 <BOUNDARY docno=VOM19981001.0800.0122 doctype=NEWS Bsec=122.25 Esec=155.14 Brecid=263 Erecid=335>
 ...
 <BOUNDARY docno=VOM19981001.0800.1651 doctype=NEWS Bsec=1651.67 Esec=1920.34 Brecid=3714 Erecid=4353>
 <BOUNDARY docno=VOM19981001.0800.1920 doctype=MISCELLANEOUS Bsec=1920.34 Esec=1932.49>
 <BOUNDARY docno=VOM19981001.0800.1932 doctype=NEWS Bsec=1932.49 Esec=2010.84 Brecid=4354 Erecid=4552>
 ...
 </BOUNDSET>

Note that broadcast files may contain "MISCELLANEOUS TEXT" story units
in which nothing is spoken or transcribed; the boundary table entries
for such units will lack the "Brecid" and "Erecid" attributes.  Also,
the "Bsec" and "Esec" attributes apply only to broadcast sources --
they are present in all boundary entries for these sources, and are
lacking in all boundary tables for newswire sources.


II.E.  Summary of data type distributions
-----------------------------------------

So, for each data sample in the corpus (i.e. each contiguous recording
from a given source on a given date covering a specific period of
time), there are several files, stored in separate directories,
containing different versions of data or information about the data
derived from that sample.

For example, a VOA_MAN broadcast has the reference text with
TIPSTER-style markup, a tokenized version of the reference text, the
output of an ASR system, a "TIPSTER-ized" markup version of the ASR
output, machine-translated versions of both the reference text and ASR
token streams, and boundary tables for all the various token stream
files; their various path names are as follows:

     asr_sgm/19980220_0700_0800_VOA_MAN.asr_sgm
     tkn_sgm/19980220_0700_0800_VOA_MAN.tkn_sgm
         tkn/19980220_0700_0800_VOA_MAN.tkn
     tkn_bnd/19980220_0700_0800_VOA_MAN.tkn_bnd
         as0/19980220_0700_0800_VOA_MAN.as0
     as0_bnd/19980220_0700_0800_VOA_MAN.as0_bnd
       mttkn/19980220_0700_0800_VOA_MAN.mttkn
   mttkn_bnd/19980220_0700_0800_VOA_MAN.mttkn_bnd
       mtas0/19980220_0700_0800_VOA_MAN.mtas0
   mtas0_bnd/19980220_0700_0800_VOA_MAN.mtas0_bnd

In each case, the file name extension string is identical to the name
of the directory containing the file.  The file-id is common to all
versions of data derived from the one sample.

The number of files present for a given sample depends on the
particular source, as follows:

 Source	   tkn_sgm  tkn asr_sgm as0  as1  mttkn  mtas0
 -----------------------------------------------------
 ABC_WNT	x    x     x     x    x
 CNN_HDL	x    x     x     x    x
 PRI_TWD	x    x     x     x    x
 VOA_ENG	x    x     x     x    x
 APW_ENG	x    x
 NYT_NYT	x    x
 VOA_MAN	x    x     x     x           x     x
 XIN_MAN	x    x                       x
 ZBN_MAN	x    x                       x


II.F  Differences in content among data types
---------------------------------------------

Naturally, when there are two or more distinct token streams drawn
from the same data sample, the number of tokens in each story will
vary depending on how the token stream was produced.  For example,
here are the various boundary table entries for one VOA_MAN story:

 as0_bnd/19980220_0700_0800_VOA_MAN.as0_bnd:
  <BOUNDARY docno=VOM19980220.0700.0221 doctype=NEWS Bsec=221.13 Esec=265.48 Brecid=358 Erecid=460>

 mtas0_bnd/19980220_0700_0800_VOA_MAN.mtas0_bnd:
  <BOUNDARY docno=VOM19980220.0700.0221 doctype=NEWS Bsec=221.13 Esec=265.48 Brecid=386 Erecid=500>

 mttkn_bnd/19980220_0700_0800_VOA_MAN.mttkn_bnd:
  <BOUNDARY docno=VOM19980220.0700.0221 doctype=NEWS Bsec=221.13 Esec=265.48 Brecid=369 Erecid=479>

 tkn_bnd/19980220_0700_0800_VOA_MAN.tkn_bnd:
  <BOUNDARY docno=VOM19980220.0700.0221 doctype=NEWS Bsec=221.13 Esec=265.48 Brecid=647 Erecid=842>

Apart from these obvious differences among the token streams, there
are also more subtle differences between "src_sgm" data and the
corresponding "tkn" token stream and "tkn_sgm" ("tipsterized") sets,
particularly in the case of newswire sources.  These differences are
created by the "tokenize" perl scripts, and are intended to assure
that the "tkn" and "tkn_sgm" data sets contain only the narrative
content of each story, in the most consistent form possible.  The
tokenization process addressed the following issues:

 - The content of <ANNOTATION> tags in all src_sgm files is removed.

 - In newswire sources, each story typically begins with a "dateline"
   at the start of the first paragraph (usually a place name, a date,
   an abbreviation of the newswire service, and/or an author's name);
   the dateline is removed.

 - In English newswires, the text often includes special "typesetting"
   codes; these are removed.

 - Mandarin newswires occasionally use "dingbat" characters (circles,
   X's or other special marks, typically intended as paragraph
   "bullets"); these are removed.

 - Xinhua news always ends each story with a single GB character
   enclosed in parentheses, and this is always the same character;
   this is removed.

 - Xinhua uses only 16-bit GB character encoding in its transmission,
   even when the story content includes alphanumeric or other ASCII
   symbols (i.e. for digits, proper names, acronyms, bracketing and
   some punctuation); the GB character set provides 16-bit codes for
   rendering these symbols, and all "XIN_MAN.src_sgm" files use these
   codes, whereas the other Mandarin sources (ZBN and VOA) use
   single-byte ASCII values; the tokenization recognizes the GB codes
   for ASCII symbols, and converts them to single-byte ASCII values.


III. Origins of reference and ASR text data for broadcast sources
=================================================================

The following sections provide more detailed information about the
creation and properties of reference and ASR text data; these issues
varied depending on the data source.


III.A. Closed-caption text -- ABC and CNN broadcasts
----------------------------------------------------

All sample files from these two television sources were accompanied by
closed-caption signal, which was converted to ASCII text for capture
via a standard serial port on a workstation.  The text content may be
relatively "telegraphic" in nature, because it is often the case that
closed-caption text tends to simplify or reduce the spoken content.
We have also observed that closed captions sometimes contain errors
(misspellings or misinterpretations of what is spoken).


III.B. FDCH transcripts -- ABC broadcasts only
----------------------------------------------

In order to support calibration of differences between closed caption
text and more careful, accurate human transcription, the LDC collected
commercially-produced transcripts, created by Federal Documents
Clearing House (FDCH), for 155 of the 162 ABC sample files.  FDCH was
operating under a contract with ABC to provide verbatim transcripts of
"World News Tonight" broadcasts, for general distribution and archival
records.  The accuracy and quality of the transcripts is quite high,
omitting only difluencies in the speech and non-news content
(commercials, etc).

These transcripts are included as alternate versions of the reference
text data for the ABC broadcasts, and they are distinguished from the
corresponding closed-caption data files by having the addition string
".fdch" as part of the file name.

Because the closed-caption text is taken to be the "common" (default)
form of reference text, none of the FDCH data is included among the
"tkn_sgm" files that are directly accessible on this cdrom.  Instead
all of the FDCH files are stored in the "tdt2proj" tar file.  When
fully unpacked according to the directions given in the top-level
"index.html" documentation, the array of reference text data for ABC
samples will appear as shown in the following example:

        src_sgm/19980106_1830_1900_ABC_WNT.src_sgm  
        src_sgm/19980106_1830_1900_ABC_WNT.fdch.src_sgm

            tkn/19980106_1830_1900_ABC_WNT.tkn
            tkn/19980106_1830_1900_ABC_WNT.fdch.tkn

        tkn_bnd/19980106_1830_1900_ABC_WNT.tkn_bnd
        tkn_bnd/19980106_1830_1900_ABC_WNT.fdch.tkn_bnd

        tkn_sgm/19980106_1830_1900_ABC_WNT.tkn_sgm  
        tkn_sgm/19980106_1830_1900_ABC_WNT.fdch.tkn_sgm

The seven ABC samples that lack FDCH transcripts (i.e. for which we
have only closed caption text) are:

19980104_1830_1900_ABC_WNT
19980111_1830_1900_ABC_WNT
19980125_1830_1900_ABC_WNT
19980322_1830_1900_ABC_WNT
19980414_1830_1900_ABC_WNT
19980509_1830_1900_ABC_WNT
19980523_1830_1900_ABC_WNT


III.C. Transcripts from other commercial services -- PRI, VOA
-------------------------------------------------------------

The radio broadcasts from PRI and VOA required manual transcription by
commercial services which were specifically contracted by the LDC for
this purpose.  Three different services were employed, one to handle
the complete set of VOA_MAN broadcasts, and two to share the load of
the PRI and VOA_ENG broadcasts.  Because of the large quantity of
audio material involved, along with budget and schedule limitations
for the initial production of the TDT2 text corpus, it was agreed that
these services should perform only limited quality control on the text
they produced.  The expectation was that the overall quality of the
resulting transcripts would be roughly equivalent to that of
closed-caption text.

To date, no careful assessment has been made of the VOA_MAN
transcripts to clarify their accuracy, though we believe their overall
quality is quite good -- close or comparable to that of FDCH texts.

The output of the two English transcription services was checked on
the basis of careful transcriptions, created later by the LDC, over a
4-hour sample of news stories from VOA_ENG and PRI; a similar check
was also done for the FDCH and closed-caption texts, on the basis of
careful transcriptions over a 6-hour sample of ABC and CNN stories.

Overall, the FDCH texts showed the best quality -- 5.9% word-error
rate (WER, counting insertions, deletions and substitutions, compared
to our most careful transcription standards); most of these "errors"
in the FDCH texts were presumably related to disfluencies in the
speech (e.g. when speakers stuttered or repeated portions of phrases);
the two English transcription services were fairly close to this level
of quality, each with about 7.5% WER; the closed caption texts showed
about 14.6% WER, on average.


III.D. The Dragon ASR System (as0)
----------------------------------

Dragon Systems used a streamlined version of its research-grade speech
recognizer on most of the English broadcast files and all the VOA
Mandarin broadcast files.  The output of this system included not only
the hypothesized text in word-tokenized form, but also, for each word:

 - the starting time offset and word duration
 - a confidence score for the word, between 0 and 1
 - a label for the particular "speaker cluster" that was selected as
   the best-performing speaker model in the recognition at that point
   in the file

We do not have "as0" data for the following sample files -- in most
cases, this was due to problems in tracking the sample files at the
LDC and conveying them to Dragon while collection, manual
transcription and other annotations were in progress:

19980222_1830_1900_ABC_WNT
19980424_1600_1630_CNN_HDL
19980528_1600_1630_CNN_HDL
19980528_2000_2100_PRI_TWD
19980611_0130_0200_CNN_HDL
19980615_2000_2100_PRI_TWD
19980617_2000_2100_PRI_TWD
19980618_1600_1630_CNN_HDL
19980619_2000_2100_PRI_TWD
19980622_2000_2100_PRI_TWD
19980628_1600_1630_CNN_HDL
19980629_2000_2100_PRI_TWD


III.E. The BBN "Byblos" ASR System (as1)
----------------------------------------

NIST used a streamlined version of the BBN Byblos English speech
recognizer on most of the English broadcast files.  The output of this
system consisted only of the hypothesized text in word-tokenized form,
and the starting time offset and duration for each word.  We do not
have "as1" data for the following sample files:

19980109_2000_2100_PRI_TWD
19980128_1130_1200_CNN_HDL

The "asr_sgm" version of the data uses "as0" as the source for all
VOA_MAN files, and for the two English broadcast files listed above.
All other English broadcast files in "asr_sgm" originate from the
"as1" data set.

In both sets of ASR data ("as0" and "as1"), the LDC post-processed the
token stream files produced by the ASR systems; we explicitly labeled
time gaps between successive words when these exceeded 0.1 sec, and
inserted "place-holder" attributes in the "as1" for confidence score
and speaker cluster (assigning a value of "NA" to these attributes for
all words), so that both ASR data streams would have equivalent
markup.  


IV. Supporting Materials
========================

In addition to the data directories cited above, this release contains
the following additional directories:

 tdt2_em/dtd -- contains SGML Document Type Definition files to
  specify the markup format of the boundary table files, token stream
  files, and the topic tables; the dtd files are necessary for using
  an SGML parsing utility (e.g. nsgmls) to process the various data
  files.  The functions of the dtd files are:

  - boundset.dtd -- for all "boundary table" files
  - docset.dtd   -- for all "token stream" files (as0,as1,tkn,mt*)
  - srctext.dtd  -- for all "src_sgm" files
  - tiptext.dtd  -- for all "tipsterized sgm" files (asr_sgm,tkn_sgm)
  - topicset.dtd -- for all "topic table" files (available from the
			    LDC and/or NIST web sites)

 doc -- tables and listings that describe the corpus content:

  - pub_file_list.txt  -- list of all files on the cdrom release
  - tarset_file_list.txt -- list of all files in "tdt2proj.tgz"
  - tdt2_stats_tables.txt -- summary of quantities by source and month
  - tdt2_docno_table.txt  -- list of all stories (DOCNO, file, DOCTYPE)
  - tdt2_release_notes.txt -- description of differences relative to v1.0
  - voa_names.tbl -- list of older names for VOA_ENG files (see
		     release notes for explanation)
  - tokenize_*_src.perl -- scripts that were used to create "tkn" and
			   "tkn_bnd" files from "src_sgm" data
  - tipsterize_tdt.perl -- used to create "tkn_sgm" and "asr_sgm" from
			   "tkn", "as0" and "as1" data, respectively

The "tipsterize_tdt" script can be used as follows, to create
"TIPSTER-style" SGML format for the machine-translated data sets
"mttkn" and "mtas0", assuming that you have copied the full corpus
onto a writable disk:

   cd  tdt2_em
   mkdir  mttkn_sgm  mkas0_sgm

   ../corpus_info/tipsterize_tdt.perl  -i mttkn  -o mttkn_sgm

   ../corpus_info/tipsterize_tdt.perl  -i mtas0  -o mtas0_sgm

In each case, the script processes every file from the given input
directory, and produces a corresponding "tipsterized" file in the
output directory; each output file has the same file name, except that
the extension is changed to match the name of the output directory.

Another use of this script is to create "tipsterized" data files using
an alternative set of boundary tables.  By default, "tipsterize_tdt"
will use the "ground truth" boundary tables included in this corpus
release (i.e. "tkn_bnd" for "tkn" data, etc); since one of the tasks
in TDT evaluations is automatic story boundary detection, there can be
an alternative set of boundary tables, generated by a detection
system.  You can create "tipsterized" files from any token stream data
set using an alternative set of boundary tables, as follows:

  - make sure the automatic story boundary information is rendered in
   a manner equivalent to the original boundary tables, and place the
   set of new tables in a separate directory under tdt2_em (next to
   the associated token stream directory), e.g. "alt_tkn_bnd"

  - use the "tipsterize" script with "-t table_dir" on the command
   line, in addition to the other arguments described above; e.g.:

 mkdir alt_tkn_sgm
 ../corpus_info/tipsterize_tdt.perl  -i tkn  -o alt_tkn_sgm  -t alt_tkn_bnd


Topic annotations that were produced by LDC to support the 1998 TDT
evaluations are provided at the LDC web site mentioned below.

Additional information about TDT is available at the following web
sites:

	http://www.ldc.upenn.edu/Projects/TDT2/
	http://www.nist.gov/speech/tests/tdt/

Both web sites also provide additional information and resources for
the TDT project: the LDC site includes the archives of email
discussions among TDT participants via <tdt-distrib@ldc.upenn.edu>,
and access to related resources, such as English/Mandarin glossing
lexicons and parallel text collections.  The NIST site includes
complete documentation and software resources for running TDT system
evaluations, and papers presented at TDT workshops by participants.