English Gigaword

Item Name: English Gigaword
Author(s): David Graff, Christopher Cieri
LDC Catalog No.: LDC2003T05
ISBN: 1-58563-260-0
ISLRN: 953-543-425-922-6
Release Date: January 28, 2003
Member Year(s): 2003
DCMI Type(s): Text
Data Source(s): newswire
Project(s): TIDES, GALE, EARS
Application(s): natural language processing, language modeling, information retrieval
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2003T05 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Graff, David, and Christopher Cieri. English Gigaword LDC2003T05. Web Download. Philadelphia: Linguistic Data Consortium, 2003.

Introduction

English Gigaword was produced by Linguistic Data Consortium (LDC) catalog number LDC2003T05 and ISBN 1-58563-260-0, and is distributed on DVD. This is a comprehensive archive of newswire text data in English that has been acquired over several years by the LDC.

Four distinct international sources of English newswire are represented here:

Agence France Press English Service (afe)
Associated Press Worldstream English Service (apw)
The New York Times Newswire Service (nyt)
The Xinhua News Agency English Service (xie)

Data

Much of the content in this collection has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora (LDC95T21, LDC98T30), the various TDT corpora and the AQUAINT text corpus (LDC2002T31). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward.

Each data file name consists of the three-letter prefix, followed by a six-digit date (representing the year and month during which the file contents were delivered by the respective news source), followed by a ".gz" file extension, indicating that the file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). So, each file contains all the usable data received by LDC for the given month from the given news source.

All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication.

Please follow this link for a sample file.

The markup structure, common to all data files, can be summarized as follows:

The Headline Element is Optional -- not all DOCs have one

The Dateline Element is Optional -- not all DOCs have one

Paragraph tags are only used if the "type" attribute of the DOC happens to be "story"

Note that all data files use the UNIX-standard " " form of line termination, and text lines are generally wrapped to a width of 80 characters or less

For this release, all sources have received a uniform treatment in terms of quality control and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types." The classification is indicated by the "type="string" " attribute that is included in each opening DOC tag. The four types are: story, multi, advis and other.

Statistics regarding the quantities of data for each source are summarized below. Note that the "Totl-MB" numbers show the amount of data you get when the files are not compressed (i.e. nearly 12 gigabytes, total); the "Gzip-MB" column shows totals for compressed file sizes as stored on the DVD-ROM; the "K-wrds" numbers are simply the number of whitespace-separated tokens (of all types) after all SGML tags are eliminated.

Source #Files Gzip-MB Totl-MB K-wrds #DOCs
AFE 44 417 1216 170969 656269
APW 91 1213 3647 539665 1477466
NYT 96 2104 5906 914159 1298498
XIE 83 320 940 131711 679007
TOTAL 314 4054 11709 1756504 4111240

Updates

There are no updates available at this time.

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee