" tags are used in broadcast sources to mark speaker
changes, when these are known from the original transcription.
- " ... " tags are used to bracket other
information about the story content, in both broadcast and newswire
sources (e.g. comments about noise in the audio, or instructions to
editors in longer newswire stories) -- in other words, material
enclosed between these tags is NOT part of the actual story
content; note that the opening tag, enclosed commentary and closing
tag are usually on separate lines in the data files.
- "" is used in some sources to mark paragraph breaks.
II.D. The "token stream" data types
------------------------------------
These five data sets (tkn,as0,as1,mttkn,mtas0) have all been packaged
together in the compressed tar file "tdt3proj.tgz"; these directories
will be created within "tdt3_em" when the tar file is unpacked (see
the top-level "index.html" file on the cdrom for instructions to
unpack the tar file). The content in these data files all share the
same basic SGML markup strategy:
A
word
...
For each of these data sets, there is a separate directory containing
a set of "boundary table" files, one boundary table for each sample
file, which provides the mapping of story boundaries to the
corresponding token stream in terms of the "recid" values assigned to
the tokens. A boundary table contains one SGML "" tag for
each story unit, and the attributes in this tag identify the DOCNO,
the DOCTYPE, the beginning and ending "recid" numbers in the token
stream file that make up the token content of the story (if any), and
for broadcast sources, the beginning and ending time offsets for the
story in the corresponding audio file (to be found in the TDT3 Speech
corpora, which are distributed separately); for example:
...
...
Note that broadcast files may contain "MISCELLANEOUS TEXT" story units
in which nothing is spoken or transcribed; the boundary table entries
for such units will lack the "Brecid" and "Erecid" attributes. Also,
the "Bsec" and "Esec" attributes apply only to broadcast sources --
they are present in all boundary entries for these sources, and are
lacking in all boundary tables for newswire sources.
II.E. Summary of data type distributions
-----------------------------------------
So, for each data sample in the corpus (i.e. each contiguous recording
from a given source on a given date covering a specific period of
time), there are several files, stored in separate directories,
containing different versions of data or information about the data
derived from that sample.
For example, a VOA_MAN broadcast has the reference text with
TIPSTER-style markup, a tokenized version of the reference text, the
output of an ASR system, a "TIPSTER-ized" markup version of the ASR
output, machine-translated versions of both the reference text and ASR
token streams, and boundary tables for all the various token stream
files; their various path names are as follows:
asr_sgm/19981220_0700_0800_VOA_MAN.asr_sgm
tkn_sgm/19981220_0700_0800_VOA_MAN.tkn_sgm
tkn/19981220_0700_0800_VOA_MAN.tkn
tkn_bnd/19981220_0700_0800_VOA_MAN.tkn_bnd
as0/19981220_0700_0800_VOA_MAN.as0
as0_bnd/19981220_0700_0800_VOA_MAN.as0_bnd
mttkn/19981220_0700_0800_VOA_MAN.mttkn
mttkn_bnd/19981220_0700_0800_VOA_MAN.mttkn_bnd
mtas0/19981220_0700_0800_VOA_MAN.mtas0
mtas0_bnd/19981220_0700_0800_VOA_MAN.mtas0_bnd
In each case, the file name extension string is identical to the name
of the directory containing the file. The file-id is common to all
versions of data derived from the one sample.
The number of files present for a given sample depends on the
particular source, as follows:
Source tkn asr mttkn mtasr
------------------------------------
APW_ENG x
NYT_NYT x
ABC_WNT x x
CNN_HDL x x
MSN_NBW x x
NBC_NNW x x
PRI_TWD x x
VOA_ENG x x
XIN_MAN x x
ZBN_MAN x x
AFP_ARB x x
ALH_ARB x x
ANN_ARB x x
CBS_MAN x x x x
CNR_MAN x x x x
CTS_MAN x x x x
CTV_MAN x x x x
VOA_MAN x x x x
NTV_ARB x x x x
VOA_ARB x x x x
II.F Differences in content among data types
---------------------------------------------
Naturally, when there are two or more distinct token streams drawn
from the same data sample, the number of tokens in each story will
vary depending on how the token stream was produced. For example,
here are the various boundary table entries for one VOA_MAN story:
as0_bnd/19981220_0700_0800_VOA_MAN.as0_bnd:
mtas0_bnd/19981220_0700_0800_VOA_MAN.mtas0_bnd:
mttkn_bnd/19981220_0700_0800_VOA_MAN.mttkn_bnd:
tkn_bnd/19981220_0700_0800_VOA_MAN.tkn_bnd:
Apart from these obvious differences among the token streams, there
are also more subtle differences between "src_sgm" data and the
corresponding "tkn" token stream and "tkn_sgm" ("tipsterized") sets,
particularly in the case of newswire sources. These differences are
created by the "tokenize" perl scripts, and are intended to assure
that the "tkn" and "tkn_sgm" data sets contain only the narrative
content of each story, in the most consistent form possible. The
tokenization process addressed the following issues:
- The content of tags in all src_sgm files is removed.
- In newswire sources, each story typically begins with a "dateline"
at the start of the first paragraph (usually a place name, a date,
an abbreviation of the newswire service, and/or an author's name);
the dateline is removed.
- In English newswires, the text often includes special "typesetting"
codes; these are removed.
- Mandarin newswires occasionally use "dingbat" characters (circles,
X's or other special marks, typically intended as paragraph
"bullets"); these are removed.
- Xinhua news always ends each story with a single GB character
enclosed in parentheses, and this is always the same character;
this is removed.
- Xinhua uses only 16-bit GB character encoding in its transmission,
even when the story content includes alphanumeric or other ASCII
symbols (i.e. for digits, proper names, acronyms, bracketing and
some punctuation); the GB character set provides 16-bit codes for
rendering these symbols, and all "XIN_MAN.src_sgm" files use these
codes, whereas the other Mandarin sources (ZBN and VOA) use
single-byte ASCII values; the tokenization recognizes the GB codes
for ASCII symbols, and converts them to single-byte ASCII values.
III. Origins of reference text data for broadcast sources
=========================================================
The broadcast sources fall into two distinct groups in terms of
broadcast media:
Television sources with closed captions: ABC, CNN, NBC, MNB
Radio and other video sources: PRI, VOA, CBS, CNR, CTS, CTV, NTV
These two groups are distinct in terms of how the reference text data
were created, and this affects the relative quality of the reference
text with respect to word accuracy (i.e. faithfulness to what was
actually spoken in the audio signal).
For the first group of television sources (that is, the ones in
English), all reference text has been drawn from the closed-caption
signal that accompanied the video broadcast. As a result, the text
may be relatively "telegraphic" in nature, because it is often the
case that closed-caption text tends to simplify or reduce the spoken
content. We have also observed that closed captions sometimes contain
errors (misspellings or misinterpretations of what is spoken).
For the radio sources, and for all non-English video sources, all
reference text has been manually transcribed from digital recordings
by professional transcription services. In general, the quality of
these transcripts is quite good in terms of lexical accuracy, and the
English data are nearly or virtually free of spelling errors.
IV. Supporting Materials
========================
In addition to the data directories cited above, this release contains
the following additional directories:
dtd -- contains SGML Document Type Definition files to specify the
markup format of the boundary table files, token stream files, and
the topic tables; the dtd files are necessary for using an SGML
parsing utility (e.g. nsgmls) to process the various data files.
The functions of the dtd files are:
- boundset.dtd -- for all "boundary table" files
- docset.dtd -- for all "token stream" files (as0,as1,tkn,mt*)
- tiptext.dtd -- for all "tipsterized sgm" files (asr_sgm,tkn_sgm)
- srctext.dtd -- for all "src_sgm" files
doc -- tables and listings that describe the corpus content:
- tdt4_stats.tables -- summary of quantities by source and month
- tdt4_docno.table -- list of all stories (DOCNO, file, DOCTYPE)
- content_summary.txt -- this file
- tdt4guidelines_v1_5.pdf -- details of annotation procedures
Topic annotations that were produced by LDC to support the 1999 and
2000 TDT evaluations are being made available separately, via both LDC
and NIST web sites:
http://www.ldc.upenn.edu/Projects/TDT4/
http://www.nist.gov/speech/tests/tdt/
Both web sites also provide additional information and resources for
the TDT project: the LDC site includes the archives of email
discussions among TDT participants via ,
and access to related resources, such as English/Mandarin glossing
lexicons and parallel text collections. The NIST site includes
complete documentation and software resources for running TDT system
evaluations, and papers presented at TDT workshops by participants.