README File for the GIGAWORD ENGLISH TEXT CORPUS
================================================
INTRODUCTION
------------
The Gigaword English Corpus is a comprehensive archive of newswire
text data that has been acquired over several years by the Linguistic
Data Consortium (LDC), at the University of Pennsylvania.
Four distinct international sources of English newswire are
represented here:
- Agence France Press English Service (afe)
- Associated Press Worldstream English Service (apw)
- The New York Times Newswire Service (nyt)
- The Xinhua News Agency English Service (xie)
The three-character abbreviations shown above represent both the
directory names where the data files are found, and the 3-letter
prefix that appears at the beginning of every file name.
Much of the content in this collection has been published previously
by the LDC in a variety of other, older corpora, particularly the
North American News text corpora, the various TDT corpora, and the
AQUAINT text corpus. But there is a significant amount of material
that is being released here for the first time: all of the Agence
France Presse content, the 1995 and 2001 Xinhua content, and the
portions of NYT and APW dating from February 2001 forward.
DATA FORMAT AND SGML MARKUP
---------------------------
Each data file name consists of the 3-letter prefix, followed by a
6-digit date (representing the year and month during which the file
contents were generated by the respective news source), followed by a
".gz" file extension, indicating that the file contents have been
compressed using the GNU "gzip" compression utility (RFC 1952). So,
each file contains all the usable data received by LDC for the given
month from the given news source.
All text data are presented in SGML form, using a very simple, minimal
markup structure; all text consists of printable ASCII and whitespace.
The file "gigaword_e.dtd" in the "docs" directory provides the formal
"Document Type Declaration" for parsing the SGML content. The corpus
has been fully validated by a standard SGML parser utility (nsgmls),
using this DTD file.
The markup structure, common to all data files, can be summarized as
follows:
Paragraph tags are only used if the 'type' attribute of the DOC
happens to be "story" -- more on the 'type' attribute below...
Note that all data files use the UNIX-standard "\n" form of line
termination, and text lines are generally wrapped to a width of 80
characters or less.
" is found only in DOCs of this type;
in the other types described below, the text content is rendered
with no additional tags or special characters -- just lines of ASCII
tokens separated by whitespace.
* multi : This type of DOC contains a series of unrelated "blurbs",
each of which briefly describes a particular topic or event; this is
typically applied to DOCs that contain "summaries of todays news",
"news briefs in ... (some general area like finance or sports)", and
so on. Each paragraph-like blurb by itself is coherent, but it does
not bear any necessary relation of topicality or continuity relative
to it neighbors.
* advis : (short for "advisory") These are DOCs which the news service
addresses to news editors -- they are not intended for publication
to the "end users" (the populations who read the news); as a result,
DOCs of this type tend to contain obscure abbreviations and phrases,
which are familiar to news editors, but may be meaningless to the
general public. We also find a lot of formulaic, repetitive content
in DOCs of this type (contact phone numbers, etc).
* other : This represents DOCs that clearly do not fall into any of
the above types -- in general, items of this type are intended for
broad circulation (they are not advisories), they may be topically
coherent (unlike "multi" type DOCS), and they typically do not
contain paragraphs or sentences (they aren't really "stories");
these are things like lists of sports scores, stock prices,
temperatures around the world, and so on.
The general strategy for categorizing DOCs into these four classes
was, for each source, to discover the most common and frequent clues
in the text stream that correlated with the three "non-story" types,
and to apply the appropriate label for the ``type=...'' attribute
whenever the DOC displayed one of these specific clues. When none of
the known clues was in evidence, the DOC was classified as a "story".
This means that the most frequent classification error will tend to be
the use of `` type="story" '' on DOCs that are actually some other
type. But the number of such errors should be fairly small, compared
to the number of "non-story" DOCs that are correctly tagged as such.
Note that the markup was applied algorithmically, using logic that was
based on less-than-complete knowledge of the data. For the most part,
the HEADLINE, DATELINE and TEXT tags have their intended content; but
due to the inherent variability (and the inevitable source errors) in
the data, users may find occasional mishaps where the headline and/or
dateline were not successfully identified (hence show up within TEXT),
or where an initial sentence or paragraph has been mistakenly tagged
as the headline or dateline.
DATA QUANTITIES
---------------
The "docs" directory contains a set of plain-text tables (datastats.*)
that describe the quantities of data by source and month (i.e. by
file), broken down according to the four "type" categories. The
overall totals for each source are summarized below. Note that the
"Totl-MB" numbers show the amount of data you get when the files are
uncompressed (i.e. nearly 12 gigabytes, total); the "Gzip-MB" column
shows totals for compressed file sizes as stored on the DVD-ROM; the
"K-wrds" numbers are simply the number of whitespace-separated tokens
(of all types) after all SGML tags are eliminated.
Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFE 44 417 1216 170969 656269
APW 91 1213 3647 539665 1477466
NYT 96 2104 5906 914159 1298498
XIE 83 320 940 131711 679007
TOTAL 314 4054 11709 1756504 4111240
The following tables present "Text-MB", "K-wrds" and "#DOCS" broken
down by source and DOC type; "Text-MB" represents the total number of
characters (including whitespace) after SGML tags are eliminated.
Text-MB K-wrds #DOCs
type="advis":
AFE 33 3748 16788
APW 115 17292 29628
NYT 446 69453 126812
XIE 12 1885 6473
TOTAL 606 92378 179701
type="multi":
AFE 27 4032 12072
APW 212 33934 50143
NYT 110 17773 28455
XIE 58 9125 41367
TOTAL 407 64864 132037
type="other":
AFE 25 3575 36279
APW 235 33751 214710
NYT 109 16195 23867
XIE 33 4981 44776
TOTAL 402 58502 319632
type="story":
AFE 992 159614 591130
APW 2791 454693 1182985
NYT 4904 810744 1119364
XIE 728 115717 586391
TOTAL 9415 1540768 3479870
GENERAL AND SOURCE-SPECIFIC PROPERTIES OF THE DATA
--------------------------------------------------
Most of the text data (all of AFE and NYT, most of APW) were received
at LDC via dedicated, 24-hour/day electronic feeds (leased phone lines
in the case of APW and NYT, a local satellite dish for AFE). These
24-hour transmission services were all susceptible to "line noise"
(occasional corruption of text content), as well as service outages
both at the data source and at our receiving computers. Usually, the
various disruptions of a newswire data stream would leave tell-tale
evidence in the form of byte values falling outside the range of
printable ASCII characters, or recognizable patterns of anomalous
ASCII strings.
All XIE data and a two-year portion of APW data were received as bulk
electronic text archives via internet retrieval. As such, they were
not susceptible to modem line-noise or related disruptions, though
this does not guarantee that the source data are free of mishaps.
We can say for certain that all the data, including the internet bulk
archives, have undergone a consistent extent of quality control, to
eliminate non-ASCII content and other obvious forms of corruption.
Naturally, since the source data are all generated manually on a daily
basis, there will be a small percentage of human errors common to all
sources: missing whitespace, incorrect or variant spellings, badly
formed sentences, and so on, as are normally seen in newspapers. No
attempt has been made to address this property of the data.
Another common feature to be noted is that stories may be repeated in
the course of daily transmissions (or daily archiving). Sometimes a
later transmission of a story comes with minor alterations (fixed
spelling, one or more paragraphs added or removed); but just as often,
the collection ends up with two or more DOCs that are fully identical.
In general, though, this practice affects a relatively small minority
of the overall content. (NYT is perhaps the worst offender in this
regard, sometimes sending as many as six copies of some featured
story.) No attempt has been made to eliminate these duplications.
Finally, the 24-hour services typically show a practice of breaking
long stories into chunks, and sending the chunks as separate DOC
units, with each unit having the normal structural features of a full
story. (This is especially prevalent in NYT, which has the longest
average story length of all the sources.) Normally, when this sort of
splitting is done, cues are provided in the text of each chunk that
allow editors to reconstruct the full report; but these cues tend to
rely heavily on editorial skills -- it is taken for granted by each
news service that the stories will be reassembled manually as needed
-- so the process of combining the pieces into a full story is not
amenable to an algorithmic solution, and no attempt has been made to
do this.
The following sections explain data properties that are particular to
each source.
AFE:
There is a gap of 54 months in the AFE collection (about four and a
half years), spanning from May 1997 to December 2001; the LDC had
discontinued its subscription to the AFP English wire service during
this period, and at the point where we restored the subscription near
the end of 2001, there was no practical means for recovering the
portion that was missed. Apart from this, the AFE content shows a
high degree of internal consistency (relative to APW and NYT), in
terms of day-to-day content and typographic conventions.
APW:
This service provides up to six other languages besides English on the
same modem connection, with DOCs in all languages interleaved at
random; of course, we have extracted just the English content for
publication here. The service draws news from quasi-independent
offices around the world, so there tends to be more variability here
in terms of typographic conventions; there is also a noticeably higher
percentage of non-story content, especially in the "other" category:
tables of sports results, stocks, weather, etc.
During the period between August 1999 and August 2001, the modem
service failed to deliver English content, while data in other
languages continued to flow in. (LDC was spooling the data
automatically, and during this period, alarms would be raised only if
the data flow stopped completely -- so the absence of English went
unnoticed.) On learning of this gap in the data, we were able to
recover much of the missing content with help from AP's New York City
office and from Richard Sproat at AT&T Labs -- we gratefully
acknowledge their assistance. Both were able to supply bulk archives
that covered most of the period that we had missed. In particular,
August - November 1999 and January - September 2000 were retrieved
from a USENET/ClariNet and web archives that AT&T had collected for
its own research use, while the October 2000 - August 2001 data were
supplied by AP directly from their own web service archive. As a
result of the varying sources, these sub-parts of APW data tend to
differ from the rest of the collection (and from each other), in terms
of daily quantity, extent of typographic variance, and possibly the
breadth of subject matter being reported.
NYT:
There have been only a few scattered service interruptions for NYT,
and these typically involve gaps of a few days (the longest was about
two weeks). The NYT service provides not only the content that is
specific to the New York Times daily newspaper publication, but also a
wide and varied sampling of news and features from other urban and
regional newspapers around the U.S., including:
Albany Times Union
Arizona Republic
Atlanta Constitution
Bloomberg Business News
Boston Globe
Casper (Wyo.) Star-Tribune
Chicago Sun-Times
Columbia News Service
Cox News Service
Fort Worth Star-Telegram
Hearst Newspapers
Houston Chronicle
International Herald Tribune
Kansas City Star
Los Angeles Daily News
San Antonio Express-News
San Francisco Chronicle
Seattle Post-Intelligencer
States News Service
Typically, the actual source of a given DOC was indicated in the raw
data via an abbreviation (e.g. AZR, BLOOM, COX, LADN, NYT, SPI, etc)
at the end of the "slug" line that accompanies every story. (The
"slug" is a short string, usually less than 40 characters, that news
editors use to tag and sort stories and topics over the course of a
day.) Because this feature of NYT slug lines is quite consistent and
informative, the markup strategy was adapted to make sure that the
full slug line would always be included as part of the content of the
"DATELINE" tag in every DOC. (Slugs were either not present or not
retained in the other three newswire sources.) Some examples: