Home › Language Resources › Data

English Gigaword Second Edition

Item Name:	English Gigaword Second Edition
Author(s):	David Graff, Junbo Kong, Ke Chen, Kazuaki Maeda
LDC Catalog No.:	LDC2005T12
ISBN:	1-58563-350-X
ISLRN:	274-788-133-216-1
DOI:	https://doi.org/10.35111/stcf-4x49
Release Date:	July 15, 2005
Member Year(s):	2005
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	EARS, GALE, TIDES
Application(s):	information retrieval, language modeling, natural language processing
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2005T12 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Graff, David, et al. English Gigaword Second Edition LDC2005T12. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: Hide	View isVersionOf LDC2003T05 English Gigaword hasVersion LDC2007T07 English Gigaword Third Edition LDC2009T13 English Gigaword Fourth Edition LDC2011T07 English Gigaword Fifth Edition isOutcomeOf LDC95T21 North American News Text Corpus LDC98T30 North American News Text Supplement hasOutcome LDC2007T08 ISI Arabic-English Automatically Extracted Parallel Text LDC2007T09 ISI Chinese-English Automatically Extracted Parallel Text

Introduction

English Gigaword Second Edition was produced by the Linguistic Data Consortium (LDC) and contains approximately 2.3 billion words of English newswire text data that has been acquired over several years by LDC.

This edition includes all of the contents in the first edition, English Gigaword (LDC2003T05), as well as new data from July 2002 through December 2004 from all four sources in the first edition and a new source, the Central News Agency of Taiwan, English Service. This second addition also includes a three-letter language code in the source abbreviations, and minor formatting improvements (mostly line-wrapping).

Data

Here is a table showing the five distinct international sources of English newswire included in this release along with the breakdown of their contents in numbers of documents and K-words (thousands of words):

Source	Abbreviation	Documents	K-words
Agence France-Presse, English Service	(afp_eng)	1,202,139	337,792
Associated Press Worldstream, English Service	(apw_eng)	1,975,456	736,518
Central News Agency of Taiwan, English Service	(cna_eng)	57,999	15,039
The New York Times Newswire Service	(nyt_eng)	1,446,256	1,026,533
The Xinhua News Agency, English Service	(xin_eng)	1,017,150	201,346
Totals		5,699,000	2,317,228

All the data is organized into zipped files. All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. Documents are sorted into four different types:

Story: a report composed of paragraphs and full sentences; most common
Multi: unrelated "blurbs" of several news items
Advis: advisories directed at news editors and not intended for publication/general audience
Other: intended for publication but not paragraphs or sentences; these are things like lists of sports scores, stock prices, temperatures around the world, etc.

Samples

For an example of the data in this corpus, please view this sample (SGML).

Updates

None at this time.

Copyright

Portions © 1994-1997 and 2001-2004 Agence France-Presse, © 1994-2004 Associated Press, © 1997-2004 Central News Agency of Taiwan, © 1994-2004 New York Times, © 1995-2004 Xinhua News Agency, © 2005 Trustees of the University of Pennsylvania