Home › Language Resources › Data

English Gigaword

Item Name:	English Gigaword
Author(s):	David Graff, Christopher Cieri
LDC Catalog No.:	LDC2003T05
ISBN:	1-58563-260-0
ISLRN:	953-543-425-922-6
DOI:	https://doi.org/10.35111/0z6y-q265
Release Date:	January 28, 2003
Member Year(s):	2003
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	EARS, GALE, TIDES
Application(s):	information retrieval, language modeling, natural language processing
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2003T05 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Graff, David, and Christopher Cieri. English Gigaword LDC2003T05. Web Download. Philadelphia: Linguistic Data Consortium, 2003.
Related Works: Hide	View hasVersion LDC2005T12 English Gigaword Second Edition LDC2007T07 English Gigaword Third Edition LDC2009T13 English Gigaword Fourth Edition LDC2011T07 English Gigaword Fifth Edition isOutcomeOf LDC95T21 North American News Text Corpus LDC98T30 North American News Text Supplement LDC2002T31 The AQUAINT Corpus of English News Text

Introduction

English Gigaword was produced by the Linguistic Data Consortium (LDC) and contains approximately 1.8 billion words of English news text. This is a comprehensive archive of newswire text data in English that has been acquired over several years by LDC.

Four distinct international sources of English newswire are represented here:

Agence France Press English Service (AFE))
Associated Press Worldstream English Service (APW)
The New York Times Newswire Service (NYT)
The Xinhua News Agency English Service (XIE)

Data

Much of the content in this collection has been published previously by LDC in a variety of other, older corpora, particularly the (North American News Text Corpus (LDC95T21), the North American News Text Supplement (LDC98T30)), the various TDT corpora and (The AQUAINT Corpus of English News Text (LDC2002T31)). But there is a significant amount of material that is being released here for the first time: all of the Agence France Presse content, the 1995 and 2001 Xinhua content, and the portions of NYT and APW dating from February 2001 forward.

The file contents have been compressed using the GNU "gzip" compression utility (RFC 1952). All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. The corpus has been fully validated by a standard SGML parser utility (nsgmls), using a DTD file which is provided as part of this publication.

There are 314 files, totaling approximately 4 GB in compressed form (12 GB uncompressed).

The table below presents the following categories of information: source of the data, number of files per source, K-words (thousands of words), and number of documents.

Source	#Files	K-words	#DOCs
AFE	44	170,969	656,269
APW	91	539,665	1,477,466
NYT	96	914,159	1,298,498
XIE	83	131,711	679,007
TOTAL	314	1,756,504	4,111,240

For this release, all sources have received a uniform treatment in terms of quality control, and we have applied a rudimentary (and _approximate_) categorization of DOC units into four distinct "types":

story	This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
multi	This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event: "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.
advis	These are DOCs which the news service addresses to news editors, they are not intended for publication to the "end users."
other	These DOCs clearly do not fall into any of the above types; these are things like lists of sports scores, stock prices, temperatures around the world, and so on.

English Gigaword

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees