README File for the ENGLISH GIGAWORD TEXT CORPUS
================================================
Second Edition
==============
INTRODUCTION
------------
The English Gigaword Corpus is a comprehensive archive of newswire
text data that has been acquired over several years by the Linguistic
Data Consortium (LDC) at the University of Pennsylvania. This is the
second edition of the English Gigaword Corpus.
This edition includes all of the contents in the first edition of the
English Gigaword corpus (LDC2003T05) as well as new data from July
2002 through Dec 2004. Also, a new newswire source (the Central News
Agency of Taiwan, English Service) has been added in this edition.
The five distinct international sources of English newswire included
in this edition are the following:
- Agence France-Presse, English Service (afp_eng)
- Associated Press Worldstream, English Service (apw_eng)
- Central News Agency of Taiwan, English Service (cna_eng)
- New York Times Newswire Service (nyt_eng)
- Xinhua News Agency, English Service (xin_eng)
The seven-letter codes in the parentheses above include the
three-character source name abbreviations and the three-character
language code ("eng") separated by an underscore ("_") character. The
three-letter language code conforms to LDC's new internal convention
based on the new ISO 639-3 standard. In the first edition of the
English Gigaword corpus and other previous LDC corpora, a simpler
three-character-code scheme was used to identify both the source and
the language. The new convention allows us to distinguish data sets
by source and language more naturally when a single newswire provider
distributes data in multiple languages. The following table shows the
correspondence between the old codes and the new codes.
new old
------------------
afp_eng afe
apw_eng apw
cna_eng cne
nyt_eng nyt
xin_eng xie
The new seven-letter codes are used in both the directory names where
the data files are found, and in the prefix that appears at the
beginning of every data file name.
As with the first English Gigaword release, some of the content in the
this corpus has been published previously by the LDC in a variety of
other, older corpora, particularly the North American News text
corpora, the various TDT corpora, and the AQUAINT text corpus.
WHAT'S NEW IN THE SECOND EDITION
--------------------------------
o New newswire data contents from July 2002 to December 2004 have been
added for all of the four newswire sources that were represented in
the first edition.
o A new source, the Central News Agency of Taiwan English Service
(CNA_ENG), has been added.
o We have adopted a new naming scheme for filenames and DOC IDs. The
new naming scheme represents the source names in a three-letter code
and the language name in a three-letter code.
o Minor formatting improvements (mostly line-wrapping) have been
applied to some of the data contents originally published in the
first edition.
MAPPING OF DOCUMENTS IN FIRST EDITION TO DOCUMENTS IN SECOND EDITION
--------------------------------------------------------------------
All of the documents in the first edition of the English Gigaword
corpus can be mapped to the same documents in this edition by changing
the prefix of DOC IDs and file names as shown below.
o DOC IDs (The "id" attribute of the DOC tags)
AFE -> AFP_ENG_ (e.g., AFE20020101.0001 -> AFP_ENG_20020101.0001)
APW -> APW_ENG_ (e.g., APW20020101.0001 -> APW_ENG_20020101.0001)
NYT -> NYT_ENG_ (e.g., NYT20020101.0001 -> NYT_ENG_20020101.0001)
XIE -> XIN_ENG_ (e.g., XIE20020101.0001 -> XIN_ENG_20020101.0001)
o File names
afe -> afp_eng_ (e.g., afe200201.gz -> afp_eng_200201.gz)
apw -> apw_eng_ (e.g., apw200201.gz -> apw_eng_200201.gz)
nyt -> nyt_eng_ (e.g., nyt200201.gz -> nyt_eng_200201.gz)
xie -> xin_eng_ (e.g., xie200201.gz -> xin_eng_200201.gz)
The data from the following time periods were included in the first
edition. The rest of the data in this edition are new material added
in this edition.
afp_eng : May 1994 - June 2002
apw_eng : November 1994 - June 2002
nyt_eng : July 1994 - June 2002
xin_eng : January 1995 - June 2002
DATA FORMAT AND SGML MARKUP
---------------------------
Each data file name consists of the 7-letter prefix plus another
underscore character, followed by a 6-digit date (representing the
year and month during which the file contents were generated by the
respective news source), followed by a ".gz" file extension,
indicating that the file contents have been compressed using the GNU
"gzip" compression utility (RFC 1952). So, each file contains all the
usable data received by LDC for the given month from the given news
source.
All text data are presented in SGML form, using a very simple, minimal
markup structure; all text consists of printable ASCII and whitespace.
The file "gigaword_e.dtd" in the "docs" directory provides the formal
"Document Type Declaration" for parsing the SGML content. The corpus
has been fully validated by a standard SGML parser utility (nsgmls),
using this DTD file.
The markup structure, common to all data files, can be summarized as
follows:
Paragraph tags are only used if the 'type' attribute of the DOC
happens to be "story" -- more on the 'type' attribute below...
Note that all data files use the UNIX-standard "\n" form of line
termination, and text lines are generally wrapped to a width of 80
characters or less.
" is found only in DOCs of this type;
in the other types described below, the text content is rendered
with no additional tags or special characters -- just lines of ASCII
tokens separated by whitespace.
* multi : This type of DOC contains a series of unrelated "blurbs",
each of which briefly describes a particular topic or event; this is
typically applied to DOCs that contain "summaries of todays news",
"news briefs in ... (some general area like finance or sports)", and
so on. Each paragraph-like blurb by itself is coherent, but it does
not bear any necessary relation of topicality or continuity relative
to it neighboring sections.
* advis : (short for "advisory") These are DOCs which the news service
addresses to news editors -- they are not intended for publication
to the "end users" (the populations who read the news); as a result,
DOCs of this type tend to contain obscure abbreviations and phrases,
which are familiar to news editors, but may be meaningless to the
general public. We also find a lot of formulaic, repetitive content
in DOCs of this type (contact phone numbers, etc).
* other : This represents DOCs that clearly do not fall into any of
the above types -- in general, items of this type are intended for
broad circulation (they are not advisories), they may be topically
coherent (unlike "multi" type DOCS), and they typically do not
contain paragraphs or sentences (they aren't really "stories");
these are things like lists of sports scores, stock prices,
temperatures around the world, and so on.
The general strategy for categorizing DOCs into these four classes
was, for each source, to discover the most common and frequent clues
in the text stream that correlated with the three "non-story" types,
and to apply the appropriate label for the ``type=...'' attribute
whenever the DOC displayed one of these specific clues. When none of
the known clues was in evidence, the DOC was classified as a "story".
This means that the most frequent classification error will tend to be
the use of `` type="story" '' on DOCs that are actually some other
type. But the number of such errors should be fairly small, compared
to the number of "non-story" DOCs that are correctly tagged as such.
Note that the markup was applied algorithmically, using logic that was
based on less-than-complete knowledge of the data. For the most part,
the HEADLINE, DATELINE and TEXT tags have their intended content; but
due to the inherent variability (and the inevitable source errors) in
the data, users may find occasional mishaps where the headline and/or
dateline were not successfully identified (hence show up within TEXT),
or where an initial sentence or paragraph has been mistakenly tagged
as the headline or dateline.
DATA QUANTITIES
---------------
The "docs" directory contains a set of plain-text tables (datastats_*)
that describe the quantities of data by source and month (i.e. by
file), broken down according to the four "type" categories. The
overall totals for each source are summarized below. Note that the
"Totl-MB" numbers show the amount of data you get when the files are
uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB"
column shows totals for compressed file sizes as stored on the
DVD-ROM; the "K-wrds" numbers are simply the number of
whitespace-separated tokens (of all types) after all SGML tags are
eliminated.
Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFP_ENG 74 796 2270 337792 1202139
APW_ENG 121 1648 4908 736518 1975456
CNA_ENG 71 43 104 15039 57999
NYT_ENG 125 2318 6479 1026533 1446256
XIN_ENG 119 474 1411 201346 1017150
TOTAL 510 5279 15170 2317228 5699000
The following tables present "Text-MB", "K-wrds" and "#DOCS" broken
down by source and DOC type; "Text-MB" represents the total number of
characters (including whitespace) after SGML tags are eliminated.
Text-MB K-wrds #DOCs
type="advis":
AFP_ENG 70 9392 27008
APW_ENG 172 25917 37543
CNA_ENG 0 24 112
NYT_ENG 463 73955 134282
XIN_ENG 12 1920 7522
TOTAL 718 111208 206467
type="multi":
AFP_ENG 50 7717 21394
APW_ENG 229 37477 55376
CNA_ENG 9 1402 6253
NYT_ENG 124 20469 33216
XIN_ENG 95 15151 63955
TOTAL 508 82216 180194
type="other":
AFP_ENG 53 7834 65607
APW_ENG 327 45643 266799
CNA_ENG 2 170 1463
NYT_ENG 108 16322 24605
XIN_ENG 65 9497 83689
TOTAL 554 79466 442163
type="story":
AFP_ENG 1872 312848 1088130
APW_ENG 3782 627484 1615738
CNA_ENG 83 13450 50171
NYT_ENG 5408 915792 1254153
XIN_ENG 1076 174770 861984
TOTAL 12221 2044344 4870176
GENERAL AND SOURCE-SPECIFIC PROPERTIES OF THE DATA
--------------------------------------------------
Most of the text data (all of AFP_ENG, most of APW_ENG and NYT_ENG)
were received at LDC via dedicated, 24-hour/day electronic feeds
(leased phone lines in the case of APW_ENG and NYT_ENG, a local
satellite dish for AFP_ENG). These 24-hour transmission services were
all susceptible to "line noise" (occasional corruption of text
content), as well as service outages both at the data source and at
our receiving computers. Usually, the various disruptions of a
newswire data stream would leave tell-tale evidence in the form of
byte values falling outside the range of printable ASCII characters,
or recognizable patterns of anomalous ASCII strings.
All XIN_ENG data, all CNA_ENG data, and a two-year portion of APW_ENG
data were received as bulk electronic text archives via internet
retrieval. As such, they were not susceptible to modem line-noise or
related disruptions, though this does not guarantee that the source
data are free of mishaps.
The more recent NYT_ENG and APW_ENG data have been received via
internet-based subscription systems, whereby first-issue stories and
editing updates are sent throughout the day to a dedicated client
process running at the LDC; this process maintains a local database
and story cache that maintains the latest version of each distinct
story for a limited number of days (in contrast to the older
modem-based service, where updated versions and editing directives
simply accumulated in an ever-growing data stream). In the new setup,
the harvesting of stories into the growing archive is simply a matter
of taking a daily snapshot of the client-program's story cache,
removing stories from the snapshot if they had been captured on a
previous day, and adding the remainder to the archive. As a result,
the data collected in this manner tends to include less duplication of
story content (because repeated transmissions of a given story, with
or without minor edits, are generally not retained in the final
archive).
All the data have undergone a consistent extent of quality control, to
eliminate non-ASCII content and other obvious forms of corruption.
Naturally, since the source data are all generated manually on a daily
basis, there will be a small percentage of human errors common to all
sources: missing whitespace, incorrect or variant spellings, badly
formed sentences, and so on, as are normally seen in newspapers. No
attempt has been made to address this property of the data.
As indicated above, a common feature of the modem-based archives is
that stories may be repeated in the course of daily transmissions (or
daily archiving). Sometimes a later transmission of a story comes
with minor alterations (fixed spelling, one or more paragraphs added
or removed); but just as often, the collection ends up with two or
more DOCs that are fully identical. In general, though, this practice
affects a relatively small minority of the overall content. (NYT_ENG
is perhaps the worst offender in this regard, sometimes sending as
many as six copies of some featured story.) We have not attempted to
eliminate these duplications; however, we plan to make information
about duplicate and similar articles available on our web site as
supplemental information for this corpus. (See the "ADDITIONAL
INFORMATION and UPDATES" section below.)
Finally, some of the modem services typically show a practice of
breaking long stories into chunks, and sending the chunks as separate
DOC units, with each unit having the normal structural features of a
full story. (This is especially prevalent in NYT_ENG, which has the
longest average story length of all the sources.) Normally, when this
sort of splitting is done, cues are provided in the text of each chunk
that allow editors to reconstruct the full report; but these cues tend
to rely heavily on editorial skills -- it is taken for granted by each
news service that the stories will be reassembled manually as needed
-- so the process of combining the pieces into a full story is not
amenable to an algorithmic solution, and no attempt has been made to
do this.
The following sections explain data properties that are particular to
each source.
AFP_ENG:
There is a gap of 54 months in the AFP_ENG collection (about four and
a half years), spanning from May 1997 to December 2001; the LDC had
discontinued its subscription to the AFP English wire service during
this period, and at the point where we restored the subscription near
the end of 2001, there was no practical means for recovering the
portion that was missed. There is also a gap spanning from September
20, 2002 to October 2, 2002 and another gap spanning from August 6,
2003 to September 10, 2003.
Apart from these, the AFP_ENG content shows a high degree of internal
consistency (relative to APW_ENG and NYT_ENG), in terms of day-to-day
content and typographic conventions.
APW_ENG:
This service provides up to six other languages besides English on the
same modem connection, with DOCs in all languages interleaved at
random; of course, we have extracted just the English content for
publication here. The service draws news from quasi-independent
offices around the world, so there tends to be more variability here
in terms of typographic conventions; there is also a noticeably higher
percentage of non-story content, especially in the "other" category:
tables of sports results, stocks, weather, etc.
During the period between August 1999 and August 2001, the modem
service failed to deliver English content, while data in other
languages continued to flow in. (LDC was spooling the data
automatically, and during this period, alarms would be raised only if
the data flow stopped completely -- so the absence of English went
unnoticed.) On learning of this gap in the data, we were able to
recover much of the missing content with help from AP's New York City
office and from Richard Sproat at AT&T Labs -- we gratefully
acknowledge their assistance. Both were able to supply bulk archives
that covered most of the period that we had missed. In particular,
August - November 1999 and January - September 2000 were retrieved
from USENET/ClariNet and web archives that AT&T had collected for its
own research use, while the October 2000 - August 2001 data were
supplied by AP directly from their own web service archive. As a
result of the varying sources, these sub-parts of APW_ENG data tend to
differ from the rest of the collection (and from each other), in terms
of daily quantity, extent of typographic variance, and possibly the
breadth of subject matter being reported.
Among the data added in this edition, the data from January 2004
contained particularly noisy data due to transmission errors. We have
removed documents containing explicit noises from this month.
CNA_ENG:
The amount of data for this source is relatively small compared to
other sources. This data set has been delivered to the LDC via
internet transfer. As a result, we avoided many of the problems that
commonly afflict newswire data collected over modems. There is a
large gap of 16 months from April 2002 to July 2003 in this data set.
NYT_ENG:
Prior to 2003, there had been only a few scattered service
interruptions for NYT_ENG, and these typically involve gaps of a few
days (the longest was about two weeks). However, there was a time
period, from February 2003 to June 2004, in which pervasive modem
noise induced a significant amount of character data corruption,
affecting the control-character story-boundary markers as well as the
text content of the stories themselves. We have filtered out
documents that showed explicit evidence of corruption. As a result,
there is a smaller amount of documents in this time period. In
particular, there is no data from June 2004, and there is very little
data from May 2004, included in this release. Also, even after
filtering out stories that showed explicit evidence of corruption
(invalid sequences of story-boundary control codes, occurrences of
inappropriate byte values), there are still likely to be
"non-explicit" cases of data corruption in the stories that remain for
this time period. On July 1, 2004, we switched to an internet-based
file transfer method to receive NYT_ENG articles, and the NYT_ENG data
after this date was not susceptible to modem line-noise.
It should be noted that NYT_ENG documents from 16 days in July 2002 --
all odd numbered days -- have been intentionally excluded from this
collection in order to satisfy a contractual agreement with a
partner site.
The NYT_ENG service provides not only the content that is specific to
the New York Times daily newspaper publication, but also a wide and
varied sampling of news and features from other urban and regional
newspapers around the U.S., including:
Albany Times Union
Arizona Republic
Atlanta Constitution
Bloomberg Business News
Boston Globe
Casper (Wyo.) Star-Tribune
Chicago Sun-Times
Columbia News Service
Cox News Service
Fort Worth Star-Telegram
Hearst Newspapers
Houston Chronicle
International Herald Tribune
Kansas City Star
Los Angeles Daily News
San Antonio Express-News
San Francisco Chronicle
Seattle Post-Intelligencer
States News Service
Typically, the actual source of a given DOC was indicated in the raw
data via an abbreviation (e.g. AZR, BLOOM, COX, LADN, NYT, SPI, etc)
at the end of the "slug" line that accompanies every story. (The
"slug" is a short string, usually less than 40 characters, that news
editors use to tag and sort stories and topics over the course of a
day.) Because this feature of NYT_ENG slug lines is quite consistent
and informative, the markup strategy was adapted to make sure that the
full slug line would be included as part of the content of the
"DATELINE" tag whenever possible. (Slugs were either not present or
not retained in the other three newswire sources.) Some examples: