TDT5 Multilingual News Text
LDC2006T18
December 4, 2006
I. Introduction
This file contains documentation about TDT5 Multilingual News Text
, Linguistic Data Consortium (LDC) catalog number LDC2006T18 and ISBN
Number 1-58563-417-4.
The TDT5 corpora were created by Linguistic Data Consortium with support
from the DARPA TIDES (Translingual Information Detection, Extraction and
Summarization) Program. This release contains the complete set of English,
Arabic and Chinese newswire text used in the 2004 Topic Detection and
Tracking technology evaluations. The topic relevance annotations
corresponding to this publication can be found in LDC Publication LDC2006T19,
TDT5 Topics and Annotations.
Topic Detection and Tracking (TDT) refers to automatic techniques for
finding topically related material in streams of data such as newswire and
broadcast news.
There were four TDT tasks defined for the 2004 evaluation: the tracking of
known topics, the detection of unknown topics, the detection of initial
stories on unknown topics, and the detection of pairs of stories on the
same topic (links). Of these four tasks, the topic tracking task and the
link detection task are considered to be "primary." Previous TDT
evaluations also included a story segmentation task. This task applied
only to broadcast news. Since TDT5 does not include broadcast news, there
is no story segmentation task in the 2004 TDT Evaluation.
Complete documentation on the TDT evaluation program can be found on NIST's
TDT website.
For further information about corpora and annotations to support the TDT
Program visit LDC's TDT
information pages.
II. Data Profile
The TDT5 corpus spans collections from April-September 2003 and consists of
English, Chinese, and Arabic news text. A total of 15 distinct news "sources"
are included (where a "source" comprises data from a given news agency in a
particular language; when an agency publishes in multiple languages, each
language is considered a different "source"). In contrast to earlier TDT
corpora, TDT5 has no broadcast/audio content, only printed news from wire
and web sources.
Arabic AFA Agence France Presse
ANN An-Nahar
UMM Ummah Press
XIA Xinhua News Agency
English AFE Agence France Presse
APE Associated Press
CNE Central News Agency - Taiwan
LAT LA Times/Washington Post
NYT New York Times
UME Ummah Press
XIE Xinhua News Agency
Chinese AFC Agence France Presse
CNA Central News Agency - Taiwan
XIN Xinhua News Agency
ZBN Zaobao News Agency
TDT5 comprises a subset of the previous LDC release, HARD 2004 Text
(LDC2005T28). In the HARD 2004 Text release, data is organized into one
file per day per source. For the current TDT5 corpus, roughly half of the
daily files are partitioned into chunks of some maximum size (around 30
K-words, on average), in order to support the "look-ahead" condition in the
TDT5 evaluation plan and to keep sample file size relatively consistent.
Summary statistics about the volume of data by source are available in the
tdt5_stats_tables.txt document found in the /docs directory of this release.
III. Notes on Time/Date Properties of Corpus
Most of the sources included in TDT5 are web sites where we typically
download all content at daily intervals, or other types of electronic
archives that we receive in bulk, as opposed to being wire services that
run continuously on dedicated modems. This difference has an impact on one
of the underlying assumptions about TDT data.
Modem/wire collections tend to behave like 24-hour news channels, giving a
sequence of reports on a given event with details being added over time;
each story comes with a time-stamp, and stories are written to data files
in chronological order.
The web/bulk sources behave differently: for each date, we get a snapshot
of information that the source asserts is current at a given moment. The
web sources often do not provide anything like time-stamps on the stories
(or LDC's download/conditioning may have failed to locate or retain
time-stamps); in any case, the sequence in which stories for a given date
are received -- and the order in which they are stored to a given daily
collection file -- may be random relative to when the story was posted on
the web site. We view this timing variation as unrecoverable and having
little effect.
Of the 15 sources in TDT5, we have time-stamp data and chronological
ordering for only 5. These 5 all happen to be sources whose daily quantity
will require partitioning of the data into two or more files per day. But
there are an additional 5 sources without time-stamps, and "indeterminate"
ordering of stories within each day's collection, that need to be split up
as well, into as many as 4 partitions per day. To make this fit within the
TDT framework, we needed to "invent" time-stamps for these stories, and
place them into files that occupy particular time periods each day.
(The remaining 5 sources with no time-stamps are low-volume, and their
files will not be partitioned; their data files will be assigned to
arbitrarily chosen times of day, such that each source will always occupy
the same given time-slot.)
On balance, it's unlikely that the partitioning of the non-time-stamped
sources will produce anomalies where a more detailed follow-up story
appears earlier in the day than a shorter, "first-on-topic" type of story
within a given source. The archival nature of these sources tends to
eliminate multiple, time-ordered versions of a given story within a single
daily snapshot.
It is conceivable that a given event might show up in AP (a time-stamped
modem feed) at 15:50 with "first-on-topic" brevity, while the same event
might be reported with follow-up detail in a Zaobao file with a time-stamp
of 08:10. But this sort of variance has always been a feature in TDT.
Here are some general observations/rules for splitting up the original
TDT5 data files (one file per source per day) in order to produce
sample units that are more appropriate for TDT. Since most of the
variance in file size correlates with source and language, the rules
are stated in those terms.
afc -> AFP_CHN m : all files remain unsplit
cna -> CNA_CHN m : all files remain unsplit
cne -> CNA_ENG m : all files remain unsplit
ume -> UMM_ENG m : all files remain unsplit
umm -> UMM_ARB m : all files remain unsplit
ann -> ANN_ARB m : most files remain unsplit, a few split in half
xia -> XIN_ARB m : most files remain unsplit, a few split in half
nyt -> NYT_ENG t : files either split in half or remain unsplit
zbn -> ZBN_CHN m : files either split in half or remain unsplit
afa -> AFP_ARB i : most files split in half, a few remain unsplit
lat -> LAT_ENG t : most files split into 2 to 4 partitions each
xie -> XIN_ENG m : all files split into 2 to 3 partitions each
xin -> XIN_CHN m : all files split into 3 or 4 partitions each
afe -> AFP_ENG i : files split into 5 to 10 partitions each
ape -> APW_ENG t : files split into 6 to 12 partitions each
The lower-case letter just before each colon shows the status of
time stamps for each source:
t = true time stamps already exist in src_sgm markup
i = time stamps can be imported from raw data
m = must invent "made-up" time stamps for stories
The strategy for splitting would be based on the following thresholds:
< 36000 tokens -- do not split
36000 - 70000 -- split into 2
70000 - 100000 -- split into 3
100000 - 130000 -- split into 4
... and so on at intervals of 30000.
Regarding the 10 sources that do not come with any time stamp data
(marked with "m" above), we can allocate times as follows:
afc -> AFP_CHN : 0300_0500
ann -> ANN_ARB : 1100_1230,1230_1400
cna -> CNA_CHN : 0530_0700
cne -> CNA_ENG : 1300_1430
ume -> UMM_ENG : 2100_2300
umm -> UMM_ARB : 1900_2100
xia -> XIN_ARB : 0600_0800,1500_1700
xie -> XIN_ENG : 0900_1100,1300_1500,1700_1900
xin -> XIN_CHN : 0130_0330,0730_0930,1130_1330,1530_1730
zbn -> ZBN_CHN : 1000_1200,1900_2100
IV. Annotations of the Corpus
The TDT5 Corpus has been annotated in multiple ways, including topic
relevance judgments, link detection and adjudication of site submissions.
All annotations are available in the TDT5 Topics and Annotations Corpus,
LDC2006T19.
Additional information about annotation of the TDT5 corpus is available at
http://www.ldc.upenn.edu/Projects/TDT5/Annotation/TDT2004V1.2.pdf
V. Corpus Structure
The organization of data in the corpus is intended to provide direct
support for the research tasks defined in the yearly TDT evaluation
plans (available at http://www.nist.gov/speech/tests/tdt/index.htm),
while also providing a data format compatible with other research
projects including information extraction, information retrieval,
summarization and other technologies.
Each data sample is presented in a variety of forms, with each form
placed in a separate directory under /data.
The forms of data in this release (and their directory names) are:
tkn_sgm -- Reference text data derived from "tkn" files, in an SGML
markup format similar to the TIPSTER text corpora
mttkn_sgm -- Machine translation output from ISI, in an
SGML markup format similar to the TIPSTER text
corpora.
The other data formats used in accordance with the NIST TDT5 evaluation
plan involved "token-stream" data, which were originally designed to
support the story-segmentation task for broadcast data. Because these
formats are much bulkier, and are useful only for replicating the NIST
TDT5 evaluation procedures, they are being provided in the form of a
compressed unix "tar" file: tdt5proj.tgz contains the token stream and
boundary table files for the "tkn" and "mttkn" data as used in the NIST
TDT5 evaluation. The tar file also contains the "pre-tokenized" versions
of all source text data (files identified as "src_sgm").
Users of the GNU "tar" utility (or an equivalent command line tool) can
unpack the tar file contents as follows:
# copy tdt5proj.tgz to the current working directory, then:
tar xzf tdt5proj.tgz
Some users may need to uncompress the file before extracting the
contents:
gunzip < tdt5proj.tgz > tdt5proj.tar
tar xf tdt5proj.tar
A complete listing of the contents of the tar file (including names and
sizes of all data files) is provided in "docs/proj_filelist.txt" .
Our thanks to ISI (especially Ignacio Thayer & Kevin Knight) for providing
MT output for the corpus.
V. Supporting Materials
In addition to the data directories cited above, this release contains
the following additional directories:
dtd -- contains SGML Document Type Definition files to specify the
markup format of the boundary table files, token stream files, and
the topic tables; the dtd files are necessary for using an SGML
parsing utility (e.g. nsgmls) to process the various data files.
The functions of the dtd files are:
- boundset.dtd -- for all "boundary table" files
- docset.dtd -- for all "token stream" files (tkn,mttkn)
- tiptext.dtd -- for all "tipsterized sgm" files (tkn_sgm,mttkn_sgm)
- srctext.dtd -- for all "src_sgm" files
- topicset.dtd -- for resuults of topic annotations (provided in the
TDT 5 Topics and Annotations Corpus, LDC2006T19.
doc -- tables and listings that describe the corpus content:
- tdt5_stats_tables.txt -- summary of quantities by source and month
- content_summary.txt -- this file
- tdt5proj_filelist.txt -- list of contents in the release
-----------
README Created Stephanie Strassel 12/4/2006
Updated David Graff 12/11/2006