English Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T05 and ISBN 1-58563-260-0, and is distributed on DVD. This is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC.
Four distinct international sources of English newswire are represented here:
|Agence France Press English Service
|Associated Press Worldstream English Service
|The New York Times Newswire Service
|The Xinhua News Agency English Service
Much of the content in this collection has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora (LDC95T21, LDC98T30), the various TDT corpora and the AQUAINT text corpus (LDC2002T31). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward.
Each data file name consists of the three-letter prefix, followed by a six-digit date (representing the year and month during which the file contents were delivered by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source.
All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication.
Please follow this link for a sample file.
The markup structure, common to all data files, can be summarized as follows:
The Headline Element is Optional -- not all DOCs have one
The Dateline Element is Optional -- not all DOCs have one
Paragraph tags are only used if the "type" attribute of the DOC happens to be "story"
Note that all data files use the UNIX-standard " " form of line termination, and text lines are generally wrapped to a width of 80 characters or less
For this release, all sources have received a uniform treatment in terms of quality control and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types." The classification is indicated by the "type="string" " attribute that is included in each opening DOC tag. The four types are: story, multi, advis and other.
Statistics regarding the quantities of data for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are not compressed (i.e. nearly 12 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.
There are no updates available at this time.
Portions © 1994-1997 and 2001-2002 Agence France-Presse, © 1994-2002 Associated Press, © 1994-2002 New York Times, © 1995-2001 Xinhua News Agency, © 2002 Trustees of the University of Pennsylvania