Home › Language Resources › Data

Chinese Gigaword Second Edition

Item Name:	Chinese Gigaword Second Edition
Author(s):	David Graff, Ke Chen, Junbo Kong, Kazuaki Maeda
LDC Catalog No.:	LDC2005T14
ISBN:	1-58563-353-4
ISLRN:	292-607-460-859-8
DOI:	https://doi.org/10.35111/vr0r-sb06
Release Date:	August 17, 2005
Member Year(s):	2005
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	EARS, GALE, TIDES
Application(s):	information retrieval, language modeling, natural language processing
Language(s):	Mandarin Chinese
Language ID(s):	cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2005T14 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Graff, David, et al. Chinese Gigaword Second Edition LDC2005T14. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: Hide	View isVersionOf LDC2003T09 Chinese Gigaword hasVersion LDC2007T38 Chinese Gigaword Third Edition LDC2009T27 Chinese Gigaword Fourth Edition LDC2011T13 Chinese Gigaword Fifth Edition hasAnnotation LDC2007T03 Tagged Chinese Gigaword LDC2009T14 Tagged Chinese Gigaword Version 2.0 isOutcomeOf LDC95T13 Mandarin Chinese News Text hasOutcome LDC2007T09 ISI Chinese-English Automatically Extracted Parallel Text LDC2010T12 NIST 2004 Open Machine Translation (OpenMT) Evaluation LDC2010T14 NIST 2005 Open Machine Translation (OpenMT) Evaluation

Introduction

Chinese Gigaword Second Edition was produced by the Linguistic Data Consortium (LDC) and contains a comprehensive archive of newswire text data in Chinese totalling approximately 1.3 billion words that has been acquired over several years by LDC.

This edition includes all of the contents in the first release, Chinese Gigaword (LDC2003T09), as well as new data collected after the publication of the first edition, specifically Xinhua from October 2002 through December 2004 and CNA from January 2003 through December 2004. Also, a limited number of articles from a new newspaper source (Lianhe Zaobao) have been added in this edition.

Data

Here is a table of the three distinct international sources of Chinese newswire included in this edition along with a breakdown of how many documents and K-words (thousands of words) are included for each:

Source	Abbreviation	Documents	K-words
Central News Agency, Taiwan	(cna_cmn)	1,769,952	792,195
Xinhua News Agency	(xin_cmn)	992,261	471,110
Zaobao Newspaper	(zbn_cmn)	41,418	28,066
Totals		2,803,632	1,291,371

The seven-character abbreviations shown above represent both the source name and the language ID ("cmn" for Mandarin Chinese). The files are presented in zipped format containing SGML-formatted text files with multiple documents. Documents fall within three categories:

Story: a report composed of paragraphs and full sentences; most common
Multi: unrelated "blurbs" of several news items
Advis: advisories directed at news editors and not intended for publication/general audience
Other: intended for publication but not paragraphs or sentences; these are things like lists of sports scores, stock prices, temperatures around the world, etc.

Chinese Gigaword Second Edition

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees