English Gigaword Second Edition
|Item Name:||English Gigaword Second Edition|
|Author(s):||David Graff, Junbo Kong, Ke Chen, Kazuaki Maeda|
|LDC Catalog No.:||LDC2005T12|
|Release Date:||July 15, 2005|
|Project(s):||EARS, GALE, TIDES|
|Application(s):||information retrieval, language modeling, natural language processing|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2005T12 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Graff, David, et al. English Gigaword Second Edition LDC2005T12. Web Download. Philadelphia: Linguistic Data Consortium, 2005.|
English Gigaword Second Edition was produced by the Linguistic Data Consortium (LDC) and contains approximately 2.3 billion words of English newswire text data that has been acquired over several years by LDC.
This edition includes all of the contents in the first edition, English Gigaword (LDC2003T05), as well as new data from July 2002 through December 2004 from all four sources in the first edition and a new source, the Central News Agency of Taiwan, English Service. This second addition also includes a three-letter language code in the source abbreviations, and minor formatting improvements (mostly line-wrapping).
Here is a table showing the five distinct international sources of English newswire included in this release along with the breakdown of their contents in numbers of documents and K-words (thousands of words):
|Agence France-Presse, English Service||(afp_eng)||1,202,139||337,792|
|Associated Press Worldstream, English Service||(apw_eng)||1,975,456||736,518|
|Central News Agency of Taiwan, English Service||(cna_eng)||57,999||15,039|
|The New York Times Newswire Service||(nyt_eng)||1,446,256||1,026,533|
|The Xinhua News Agency, English Service||(xin_eng)||1,017,150||201,346|
All the data is organized into zipped files. All text data are presented in SGML form, using a very simple, minimal markup structure; all text consists of printable ASCII and whitespace. Documents are sorted into four different types:
- Story: a report composed of paragraphs and full sentences; most common
- Multi: unrelated "blurbs" of several news items
- Advis: advisories directed at news editors and not intended for publication/general audience
- Other: intended for publication but not paragraphs or sentences; these are things like lists of sports scores, stock prices, temperatures around the world, etc.
For an example of the data in this corpus, please view this sample (SGML).
None at this time.