README File for the ENGLISH GIGAWORD TEXT CORPUS
	   ================================================

			    Third Edition
			    =============

INTRODUCTION
------------

The English Gigaword Corpus is a comprehensive archive of newswire
text data that has been acquired over several years by the Linguistic
Data Consortium (LDC) at the University of Pennsylvania.  This is the
third edition of the English Gigaword Corpus.

This edition includes all of the contents in the previous edition
(LDC2005T12) as well as new data from the same five sources presented
there covering 24-month period of January 2005 through December 2006.
Also, a sixth data source (the Los Angeles Times/Washington Post
newswire service) has been added in this edition.

The six distinct international sources of English newswire included
in this edition are the following:

 - Agence France-Presse, English Service              (afp_eng)
 - Associated Press Worldstream, English Service      (apw_eng)
 - Central News Agency of Taiwan, English Service     (cna_eng)
 - Los Angeles Times/Washington Post Newswire Service (ltw_eng)
 - New York Times Newswire Service                    (nyt_eng)
 - Xinhua News Agency, English Service                (xin_eng)

The seven-letter codes in the parentheses above include the
three-character source name abbreviations and the three-character
language code ("eng") separated by an underscore ("_") character.  The
three-letter language code conforms to LDC's internal convention
based on the new ISO 639-3 standard.

The seven-letter codes are used in both the directory names where
the data files are found, and in the prefix that appears at the
beginning of every data file name.

As with other Gigaword releases, some of the content in the this
corpus has been published previously by the LDC in a variety of other,
older corpora, particularly the North American News text corpora, the
various TDT corpora, and the AQUAINT text corpus, as well as earlier
editions of Gigaword English.

WHAT'S NEW IN THE THIRD EDITION
--------------------------------

o New newswire data contents from January 2005 to December 2006 have
  been added for all of the five newswire sources that were
  represented in the first edition.

o A new source, the Los Angeles Times/Washington Post newswire
  service, has been added.

o A small handful of corrections to older APW data have been made to
  remove a few non-English stories, clean up some character "noise",
  and rectify the encoding for a few non-ASCII characters.

o The CNA content introduced in Gigaword English 2nd Edition has been
  completely updated to repair data corruptions caused by occasional
  character encoding problems; as a result of the update, there may be
  differences in the inventory and/or ID strings of DOC elements in
  this portion of the corpus, relative to the previous edition.  (The
  nature of encoding problems is explained below under "SOURCE
  SPECIFIC PROPERTIES".)

o Many of the files (141 out of 722) include a small number of UTF-8
  "wide" characters, typically accented letters found in proper names
  and borrowed words (some sources also use special punctuation marks,
  non-breaking spaces, etc).

Apart from the replacement/update of all CNA files, the data content
of the 2nd edition has been included in the present release without
modification.


DATA FORMAT AND SGML MARKUP
---------------------------

Each data file name consists of the 7-letter prefix plus another
underscore character, followed by a 6-digit date representing the year
and month during which the file contents were generated by the
respective news source, followed by a ".gz" file extension indicating
that the file contents have been compressed using the GNU "gzip"
compression utility (RFC 1952).  So, each file contains all the usable
data received by LDC for the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal
markup structure; all text consists of printable ASCII and whitespace.
The file "gigaword_e.dtd" in the "docs" directory provides the formal
"Document Type Declaration" for parsing the SGML content.  The corpus
has been fully validated by a standard SGML parser utility (nsgmls),
using this DTD file.

The markup structure, common to all data files, can be summarized as
follows:

<DOC id="..." type="..." >
<HEADLINE>
The Headline Element is Optional -- not all DOCs have one
</HEADLINE>
<DATELINE>
The Dateline Element is Optional -- not all DOCs have one
</DATELINE>
<TEXT>
<P>
Paragraph tags are only used if the 'type' attribute of the DOC
happens to be "story" -- more on the 'type' attribute below...
</P>
<P>
Note that all data files use the UNIX-standard "\n" form of line
termination, and text lines are generally wrapped to a width of 80
characters or less.
</P>
</TEXT>
</DOC>

For every "opening" tag (DOC, HEADLINE, DATELINE, TEXT, P), there is a
corresponding "closing" tag -- always.  The attribute values in the
DOC tag are always presented within double-quotes; the "id=" attribute
of DOC consists of the 7-letter source/language abbreviation (in
CAPS), an underscore, an 8-digit date string representing the date of
the story (YYYYMMDD), a period, and a 4-digit sequence number starting
at "0001" for each date (e.g. "NYT_ENG_199501.0001"); in this way,
every DOC in the corpus is uniquely identifiable by the id string.

There are cases where we have assigned a sequence number to a
document, and later, we have found out the document is empty or very
noisy.  In such cases, we have removed the document from the
collection, but did not reassign sequence numbers to the rest of the
collection for the same day.  As a result there may be some gaps in
sequence numbers.

Every SGML tag is presented alone on one line, separate from other
tags, and from the text content (so a simple process like the UNIX
"grep -v '<'" will eliminate all tags, and retain all the text
content).

The structure shown above represents some notable differences relative
to the markup strategy employed in previous LDC text corpora; these
are intended to facilitate bulk processing of the present corpus.  The
major differences are:

 - Earlier corpora usually organized the data as one file per day, or
   limited the average file size to one megabyte (MB).

Typical compressed file sizes in the current corpus range from about 3
MB (1995 Xinhua data) to about 30 MB (1996-7 NYT data); this equates
to a range of about 9 to 90 MB when the data are uncompressed.  In
general, these files are not intended for use with interactive text
editors or word processing software (though many such programs are
likely to work reasonably well with these files).  Rather, it's
expected that the files will be used as input to programs that are
geared to dealing with data in such quantities, for filtering,
conditioning, indexing, statistical summary, etc.  (The LDC can
provide open source software, mostly written in Perl, for extracting
DOCs from such data files, using the "id" string or other search
criteria for story selection; see http://www.ldc.upenn.edu/Using/ .)

 - Earlier corpora tended to use different markup outlines (different
   tag sets) depending on the source of the data, because different
   sources came to us with different structural properties, and we had
   chosen to preserve these as much as possible (even though many
   elements of the delivered structure may have been meaningless for
   research use).

The present corpus uses only the information structure that is common
to all sources and serves a clear function: headline, dateline, and
core news content (usually containing paragraphs).  The "dateline" is
a brief string typically found at the beginning of the first paragraph
in each news story, giving the location the report is coming from, and
sometimes the news service and/or date; since this content is not part
of the initial sentence, we separate it from the first paragraph (this
was not done in previous corpora).

 - Earlier corpora tended to include "custom" SGML entity references,
   which were intended to preserve things like special punctuation or
   typesetting instructions (e.g. "&QL;", "&UR;", "&MD;", etc).

The present corpus uses only three SGML entity references: ``&amp;'',
which represents the literal ampersand "&" character; ``&lt;'', which
represents the "left/open angle bracket"; ``&gt;'', which represents
the "right/close angle bracket".  All other specialized control
characters have been filtered out, and unusual punctuation (such as
the underscore character, used in NYT_ENG and APW_ENG to represent an
"em-dash" character) has been left as-is, or converted to simple
equivalents (e.g. hyphens).

 - In earlier corpora, newswire data were presented as streams of
   undifferentiated "DOC" units; depending on the source and corpus,
   varying amounts of quality checking and filtering were done to
   eliminate noisy or unsuitable content (e.g. test messages).

The portions of this corpus that were included in the first edition of
the English Gigaword corpus have received a uniform treatment in terms
of quality control.  The new material added in this edition has been
initially processed by LDC's daily newswire processing pipeline to
create initial mark-up, and then were re-processed follow the design
used in the first edition of the Gigaword corpus.  The same extent of
quality control has been applied to the new material.  However, there
may be cases where some treatments of data, such as the categorization
of DOC units, have changed.  

For all of the documents in this corpus, we have applied a rudimentary
(and _approximate_) categorization of DOC units into four distinct
"types".  The classification is indicated by the `` type="string" ''
attribute that is included in each opening ``DOC'' tag.  The four
types are:

* story : This is by far the most frequent type, and it represents the
  most typical newswire item: a coherent report on a particular topic
  or event, consisting of paragraphs and full sentences.  As indicated
  above, the paragraph tag "<P>" is found only in DOCs of this type;
  in the other types described below, the text content is rendered
  with no additional tags or special characters -- just lines of ASCII
  tokens separated by whitespace.

* multi : This type of DOC contains a series of unrelated "blurbs",
  each of which briefly describes a particular topic or event; this is
  typically applied to DOCs that contain "summaries of todays news",
  "news briefs in ... (some general area like finance or sports)", and
  so on.  Each paragraph-like blurb by itself is coherent, but it does
  not bear any necessary relation of topicality or continuity relative
  to it neighboring sections.

* advis : (short for "advisory") These are DOCs which the news service
  addresses to news editors -- they are not intended for publication
  to the "end users" (the populations who read the news); as a result,
  DOCs of this type tend to contain obscure abbreviations and phrases,
  which are familiar to news editors, but may be meaningless to the
  general public.  We also find a lot of formulaic, repetitive content
  in DOCs of this type (contact phone numbers, etc).

* other : This represents DOCs that clearly do not fall into any of
  the above types -- in general, items of this type are intended for
  broad circulation (they are not advisories), they may be topically
  coherent (unlike "multi" type DOCS), and they typically do not
  contain paragraphs or sentences (they aren't really "stories");
  these are things like lists of sports scores, stock prices,
  temperatures around the world, and so on.

The general strategy for categorizing DOCs into these four classes
was, for each source, to discover the most common and frequent clues
in the text stream that correlated with the three "non-story" types,
and to apply the appropriate label for the ``type=...'' attribute
whenever the DOC displayed one of these specific clues.  When none of
the known clues was in evidence, the DOC was classified as a "story".

This means that the most frequent classification error will tend to be
the use of `` type="story" '' on DOCs that are actually some other
type.  But the number of such errors should be fairly small, compared
to the number of "non-story" DOCs that are correctly tagged as such.

Also, since some sources tended to change their delivery methods or
format over time, the distribution of non-story types can be seen to
vary signficantly by epoch and source.  The various "datastats" tables
may be helpful in tracking changes in the nature of the source data
(and LDC's ability to adapt to those changes).

Note that the markup was applied algorithmically, using logic that was
based on less-than-complete knowledge of the data.  For the most part,
the HEADLINE, DATELINE and TEXT tags have their intended content; but
due to the inherent variability (and the inevitable source errors) in
the data, users may find occasional mishaps where the headline and/or
dateline were not successfully identified (hence show up within TEXT),
or where an initial sentence or paragraph has been mistakenly tagged
as the headline or dateline.


DATA QUANTITIES
---------------

The "docs" directory contains a set of plain-text tables (datastats_*)
that describe the quantities of data by source and month (i.e. by
file), broken down according to the four "type" categories.  The
overall totals for each source are summarized below.  Note that the
"Totl-MB" numbers show the amount of data you get when the files are
uncompressed (i.e. approximately 15 gigabytes, total); the "Gzip-MB"
column shows totals for compressed file sizes as stored on the
DVD-ROM; the "K-wrds" numbers are simply the number of
whitespace-separated tokens (of all types) after all SGML tags are
eliminated.

Source  #Files  Gzip-MB Totl-MB  K-wrds   #DOCs
-----------------------------------------------
afp_eng     98    1101    3135   466718 1592309
apw_eng    145    1924    5657   849435 2272995
cna_eng     96      50     148    21657   85600
ltw_eng     91     467    1220   192650  295224
nyt_eng    149    2713    7481  1188494 1655279
xin_eng    143     586    1750   249521 1247039

TOTAL      722    6841   19391  2968475 7148446 

The following tables present "Text-MB", "K-wrds" and "#DOCS" broken
down by source and DOC type; "Text-MB" represents the total number of
characters (including whitespace) after SGML tags are eliminated.

        Text-MB  K-wrds   #DOCs
advis
  afp_eng    99   13542   34587
  apw_eng   173   26110   37778
  cna_eng     0      12      52
  ltw_eng    81   13053   27081
  nyt_eng   512   81988  142193
  xin_eng    12    1920    7522
  TOTAL     879  136625  249213

multi
  afp_eng    62    9414   26243
  apw_eng   238   38936   57249
  cna_eng    14    2351   12491
  ltw_eng    17    2852    6342
  nyt_eng   124   20435   33183
  xin_eng   109   17427   74491
  TOTAL     565   91415  209999

other
  afp_eng    78   11561   87946
  apw_eng   337   47208  273377
  cna_eng     2     213    1935
  ltw_eng     1     209     995
  nyt_eng   109   16664   25597
  xin_eng    88   12645  110694
  TOTAL     617   88500  500544

story
  afp_eng  2591  432204 1443533
  apw_eng  4443  737185 1904591
  cna_eng   117   19086   71122
  ltw_eng  1053  176543  260806
  nyt_eng  6304 1069406 1454306
  xin_eng  1339  217519 1054332
  TOTAL   15849 2651943 6188690


GENERAL AND SOURCE-SPECIFIC PROPERTIES OF THE DATA
--------------------------------------------------

Much of the text data (all of AFP_ENG, most of APW_ENG, LTW_ENG and
NYT_ENG) were received at LDC via dedicated, 24-hour/day electronic
feeds (leased phone lines in the case of APW_ENG, LTW_ENG and NYT_ENG,
a local satellite dish for AFP_ENG).  These 24-hour transmission
services were all susceptible to "line noise" (occasional corruption
of text content), as well as service outages both at the data source
and at our receiving computers.  Usually, the various disruptions of a
newswire data stream would leave tell-tale evidence in the form of
byte values falling outside the range of printable characters, or
recognizable patterns of anomalous ASCII strings.

All XIN_ENG data, all CNA_ENG data, and a 2-year portion of APW_ENG
were received as bulk electronic text archives via internet retrieval.
As such, they were not susceptible to modem line-noise or related
disruptions, though this does not guarantee that the source data are
free of mishaps.  Also, the more recent portions of APW_ENG, LTW_ENG
and NYT_ENG have been delivered by various internet-based subscription
systems (explained in more detail in the source-specific sections
below); again, this has eliminated the various problems with modem
noise, but does not assure "perfect" data.

All the data have undergone a consistent extent of quality control, to
improper characters and other obvious forms of corruption.

Naturally, since the source data are all generated manually on a daily
basis, there will be a small percentage of human errors common to all
sources: missing whitespace, incorrect or variant spellings, badly
formed sentences, and so on, as are normally seen in newspapers.  No
attempt has been made to address this property of the data.

As indicated above, a common feature of the modem-based archives is
that stories may be repeated in the course of daily transmissions (or
daily archiving).  Sometimes a later transmission of a story comes
with minor alterations (fixed spelling, one or more paragraphs added
or removed); but just as often, the collection ends up with two or
more DOCs that are fully identical.  In general, though, this practice
affects a relatively small minority of the overall content.  (NYT_ENG
is perhaps the worst offender in this regard, sometimes sending as
many as six copies of some featured story.)  We have not attempted to
eliminate these duplications; however, we plan to make information
about duplicate and similar articles available on our web site as
supplemental information for this corpus.  (See the "ADDITIONAL
INFORMATION and UPDATES" section below.)

Finally, some of the modem services typically show a practice of
breaking long stories into chunks, and sending the chunks as separate
DOC units, with each unit having the normal structural features of a
full story.  (This is especially prevalent in NYT_ENG, which has the
longest average story length of all the sources.)  Normally, when this
sort of splitting is done, cues are provided in the text of each chunk
that allow editors to reconstruct the full report; but these cues tend
to rely heavily on editorial skills -- it is taken for granted by each
news service that the stories will be reassembled manually as needed
-- so the process of combining the pieces into a full story is not
amenable to an algorithmic solution, and no attempt has been made to
do this.  Also, some sources (especially NYT and LTW) include advisory
annotations in the longer stories, providing guidance on how such
stories can be abridged (e.g. "(STORY CAN END HERE, OPTIONAL MATERIAL
FOLLOWS)", and other such phrases, typically parenthesized and in all
caps.

The following sections explain data properties that are particular to
each source.

AFP_ENG:

There is a gap of 54 months in the AFP_ENG collection (about four and
a half years), spanning from May 1997 to December 2001; the LDC had
discontinued its subscription to the AFP English wire service during
this period, and at the point where we restored the subscription near
the end of 2001, there was no practical means for recovering the
portion that was missed.  There is also a gap spanning from September
20, 2002 to October 2, 2002 and another gap spanning from August 6,
2003 to September 10, 2003.

Apart from these, the AFP_ENG content shows a high degree of internal
consistency (relative to APW_ENG and NYT_ENG), in terms of day-to-day
content and typographic conventions.

APW_ENG:

This service provides up to six other languages besides English on the
same modem connection, with DOCs in all languages interleaved at
random; of course, we have extracted just the English content for
publication here.  The service draws news from quasi-independent
offices around the world, so there tends to be more variability here
in terms of typographic conventions; there is also a noticeably higher
percentage of non-story content, especially in the "other" category:
tables of sports results, stocks, weather, etc.

During the period between August 1999 and August 2001, the modem
service failed to deliver English content, while data in other
languages continued to flow in.  (LDC was spooling the data
automatically, and during this period, alarms would be raised only if
the data flow stopped completely -- so the absence of English went
unnoticed.)  On learning of this gap in the data, we were able to
recover much of the missing content with help from AP's New York City
office and from Richard Sproat at AT&T Labs -- we gratefully
acknowledge their assistance.  Both were able to supply bulk archives
that covered most of the period that we had missed.  In particular,
August - November 1999 and January - September 2000 were retrieved
from USENET/ClariNet and web archives that AT&T had collected for its
own research use, while the October 2000 - August 2001 data were
supplied by AP directly from their own web service archive.  As a
result of the varying sources, these sub-parts of APW_ENG data tend to
differ from the rest of the collection (and from each other), in terms
of daily quantity, extent of typographic variance, and possibly the
breadth of subject matter being reported.

Among the data added in this edition, the data from January 2004
contained particularly noisy data due to transmission errors.  We have
removed documents containing explicit noises from this month.

Starting in May 2004, APW switched to a dedicated internet delivery
system, eliminating the problems of modem noise and also creating a
much better environment for limiting or avoiding duplicate content in
stories.  This system of collection continued to operate until the
end of August, 2006.  At that point, there was a brief lapse in the
collection (roughly the first half of September 2006 is missing from
our archives), and then data reception switched to a "Network News
Transfer Protocol" (NNTP, related to Usenet transmission).  Under this
delivery method, we found that many stories were being delivered two
or three times each, but it has proven to be fairly easy to remove
these duplications.

CNA_ENG:

The amount of data for this source is relatively small compared to
other sources.  This data set has been delivered to the LDC via
internet transfer.  As a result, we avoided many of the problems that
commonly afflict newswire data collected over modems.  There is a
large gap of 16 months from April 2002 to July 2003 in this data set.

When this source was first released in Gigaword English II, the data
had been incorrectly assumed to be ASCII only, and when non-ASCII
bytes were found, they were simply removed.  In preparing the current
release, we found that the CNA source data actually used the Big-5
("Traditional Chinese") character set in various irregular ways,
usually to render "full-width" variants of ASCII letters, digits and
punctuation.  The approach taken in the previous release caused many
of these "wide" characters to end up as data corruption, particularly
when the second byte of the Big-5 wide character happened to fall in
the ASCII range (which is common for the Big-5 "full-width" versions
of ASCII characters).

For the current release, all the CNA data has been reprocessed from
original sources and correctly converted from Big-5 to UTF-8; where
appropriate, we have normalized the "full-width character" variants to
their corresponding ASCII equivalents.

LTW_ENG:

There is a gap of about 62 months (mid-June 1998 through early August
2003) during which the LDC had dropped its subscription.  The data
were collected via dedicated modem up until March 2004, at which point
the delivery was switched to E-mail transmission, eliminating data
loss due to modem noise.  The effect of the transmission change on
duplicated material has not been determined, but this source has
tended to show a relatively low degree of duplication.

LTW provides not only the content that is specific to the daily
newspapers published in Los Angeles and Washington, D.C., but also a
sampling of newspaper content from other papers in other cities.

NYT_ENG:

Prior to 2003, there had been only a few scattered service
interruptions for NYT_ENG, and these typically involve gaps of a few
days (the longest was about two weeks).  However, there was a time
period, from February 2003 to June 2004, in which pervasive modem
noise induced a significant amount of character data corruption,
affecting the control-character story-boundary markers as well as the
text content of the stories themselves.  We have filtered out
documents that showed explicit evidence of corruption.  As a result,
there is a smaller amount of documents in this time period.  In
particular, there is no data from June 2004, and there is very little
data from May 2004, included in this release.  Also, even after
filtering out stories that showed explicit evidence of corruption
(invalid sequences of story-boundary control codes, occurrences of
inappropriate byte values), there are still likely to be
"non-explicit" cases of data corruption in the stories that remain for
this time period.  On July 1, 2004, we switched to an internet-based
file transfer method to receive NYT_ENG articles, and the NYT_ENG data
after this date was not susceptible to modem line-noise.

It should be noted that NYT_ENG documents from 16 days in July 2002 --
all odd numbered days -- have been intentionally excluded from this
collection in order to satisfy a contractual agreement with a
partner site. 

The NYT_ENG service provides not only the content that is specific to
the New York Times daily newspaper publication, but also a wide and
varied sampling of news and features from other urban and regional
newspapers around the U.S., including:

Albany Times Union
Arizona Republic
Atlanta Constitution
Bloomberg Business News
Boston Globe
Casper (Wyo.) Star-Tribune
Chicago Sun-Times
Columbia News Service
Cox News Service
Fort Worth Star-Telegram
Hearst Newspapers
Houston Chronicle
International Herald Tribune
Kansas City Star
Los Angeles Daily News
San Antonio Express-News
San Francisco Chronicle
Seattle Post-Intelligencer
States News Service

Typically, the actual source of a given DOC was indicated in the raw
data via an abbreviation (e.g. AZR, BLOOM, COX, LADN, NYT, SPI, etc)
at the end of the "slug" line that accompanies every story.  (The
"slug" is a short string, usually less than 40 characters, that news
editors use to tag and sort stories and topics over the course of a
day.)  Because this feature of NYT_ENG slug lines is quite consistent
and informative, the markup strategy was adapted to make sure that the
full slug line would be included as part of the content of the
"DATELINE" tag whenever possible.  (Slugs were either not present or
not retained in the other three newswire sources.)  Some examples:

<DATELINE>
TEMPE, Ariz. (BC-FIESTA-BLOCK-AZR)
</DATELINE>

<DATELINE>
LOS ANGELES (BC-BKN-LAKERS-ONEAL-LADN)
</DATELINE>

<DATELINE>
NEW YORK (BC-NY-NEWYEAR-ART-1STLD-WRITETHRU-675&AMP;ADD-NYT)
</DATELINE>

<DATELINE>
 (BC-OBIT-KENNEDY-NYT)
</DATELINE>

The first three examples are cases where the opening paragraph had a
dateline string; in the fourth, the opening paragraph had no dateline.
The slug is normally ALL-CAPS-AND-HYPHENS (this is how it is presented
by the newswire service -- there are some exceptions, of course, and
the occasional glitch); it is always preceded by a space and an open
parenthesis, and always followed by a close parenthesis.  Meanwhile,
the dateline string taken from the first paragraph (when present) is
always presented first on the line, with no initial space; it can be
mixed-case, may have multiple word tokens, and may have punctuation.

Features of text formatting, style and subject matter may vary
somewhat according to the original source.  Overall, NYT_ENG shows the
largest amount of "advisory" content, both in terms of how many DOCs
are addressed specifically to the receiving news editors, and in terms
of additional "advice" included within regular news stories, e.g.
"(STORY CAN END HERE. OPTIONAL MATERIAL FOLLOWS)".

XIN_ENG:

The Xinhua English news archive provided fairly consistent formatting
and coverage spanning 1995 through 2004, making it fairly easy to
prepare for research use.  Many stories have the distinct flavor of an
official government information source, in contrast to the other news
services represented here.  The material is otherwise unremarkable.

Gigaword English Release 3 (LDC2007T07) -- UPDATE -- 2008-07-01:
====================================================================
It was recently brought to our attention that the New York Times newswire
text archive in this corpus contained some articles in Spanish.  After
doing a complete scan of the 149 monthly data files under "nyt_eng", we
identified 2517 DOC elements with the 'type="story"' attribute where the
story content was in Spanish.

In the process, we also found 421 DOC elements with the 'type="story"'
attribute where the text content was in fact not a news story.

We have added two additional files to the LDC's Online Documentation set 
for this corpus (available from the "Online documentation: yes" link on 
the catalog web page for LDC2007T07):

	other.file-doc.map
	spanish.file-doc.map

The first map file lists the file names and DOC ID strings for the 421 DOC
elements that were incorrectly labeled as 'type="story"'; the second lists
the file names and DOC ID strings for the 2517 DOC elements containing
Spanish text.  Users of the corpus who focus on the "story" classification
for their work may want to eliminate the listed DOC IDs from consideration
in their processing of the data.

The affected DOC elements were also present in previous releases of
Gigaword English, to the extent that the dates of the affected DOCs fall
within the time spans covered by the earlier versions of the corpus.

We would like to express our gratitude to Paul Cook of the Computer
Science Department  at the University of Toronto for bringing this problem 
to our attention.


ADDITIONAL INFORMATION AND UPDATES
----------------------------------

Additional information, updates, and bug fixes may be available in the
LDC catalog entry for this corpus (LDC2005T12) at:

http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2005T12

--------------------------------------------------------------------------
Original README file written by David Graff, January 2003

Updated by Junbo Kong and Kazuaki Maeda for the Second Edition, June 2005

Linguistic Data Consortium