English Gigaword Third Edition


Item Name: English Gigaword Third Edition
Authors: David Graff, Junbo Kong, Ke Chen,and Kazuaki Maeda
LDC Catalog No.: LDC2007T07
ISBN: 1-58563-416-6
Release Date: May 17, 2007
Data Type: text
Data Source(s): newswire
Project(s): GALE
Application(s): information retrieval, language modeling, natural language processing
Language(s): English
Language ID(s): eng
Distribution: 2 DVD
Member fee: $0 for 2007 members
Non-member Fee: US $4000.00
Reduced-License Fee: US $2000.00
Extra-Copy Fee: US $400.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: David Graff, Junbo Kong, Ke Chen,and Kazuaki Maeda
2007
English Gigaword Third Edition
Linguistic Data Consortium, Philadelphia

Introduction

The English Gigaword Corpus is a comprehensive archive of newswire text data that has been acquired over several years by the Linguistic Data Consortium (LDC) at the University of Pennsylvania. This is the third edition of the English Gigaword Corpus.

This edition includes all of the contents in the previous edition (LDC2005T12) as well as new data from the same five sources presented there covering 24-month period of January 2005 through December 2006. Also, a sixth data source (the Los Angeles Times/Washington Post newswire service) has been added in this edition.

The six distinct international sources of English newswire included in this edition are the following:

Agence France-Presse, English Service (afp_eng)
Associated Press Worldstream, English Service (apw_eng)
Central News Agency of Taiwan, English Service (cna_eng)
Los Angeles Times/Washington Post Newswire Service (ltw_eng)
New York Times Newswire Service (nyt_eng)
Xinhua News Agency, English Service (xin_eng)

The seven-letter codes in the parentheses above include the three-character source name abbreviations and the three-character language code ("eng") separated by an underscore ("_") character. The three-letter language code conforms to LDC's internal convention based on the new ISO 639-3 standard.

The seven-letter codes are used in both the directory names where the data files are found, and in the prefix that appears at the beginning of every data file name.

As with other Gigaword releases, some of the content in the this corpus has been published previously by the LDC in a variety of other, older corpora, particularly the North American News text corpora, the various TDT corpora, and the AQUAINT text corpus, as well as earlier editions of Gigaword English.

New in the Third Edition

  • New newswire data contents from January 2005 to December 2006 have been added for all of the five newswire sources that were represented in the first edition.
  • A new source, the Los Angeles Times/Washington Post newswire service, has been added.
  • A small handful of corrections to older APW data have been made to remove a few non-English stories, clean up some character "noise", and rectify the encoding for a few non-ASCII characters.
  • The CNA content introduced in Gigaword English 2nd Edition has been completely updated to repair data corruptions caused by occasional character encoding problems; as a result of the update, there may be differences in the inventory and/or ID strings of DOC elements in this portion of the corpus, relative to the previous edition. (The nature of encoding problems is explained below under "SOURCE SPECIFIC PROPERTIES".)
  • Many of the files (141 out of 722) include a small number of UTF-8 "wide" characters, typically accented letters found in proper names and borrowed words (some sources also use special punctuation marks, non-breaking spaces, etc).

Apart from the replacement/update of all CNA files, the data content of the 2nd edition has been included in the present release without modification.

Samples

For an example of the data in this corpus, please review this text file.

Update

The New York Times newswire text archive in this corpus contains some articles in Spanish. A scan of the 149 monthly data files under "nyt_eng" yielded 2517 DOC elements with the 'type="story"' attribute where the story content was in Spanish.

The scan also disclosed 421 DOC elements with the 'type="story"' attribute where the text content was in fact not a news story.

Two additional files to the online documentation for this corpus identify those occurrences.

Sponsorship

This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Content Copyright

Portions 1994-2006 Agence France Presse, 1994-2006 The Associated Press, 1997-2006 Central News Agency (Taiwan), 1994-1998, 2003-2006 Los Angeles Times-Washington Post News Service, Inc., 1994-2006 New York Times, 1995-2006 Xinhua News Agency, 2007 Trustees of the University of Pennsylvania