Chinese Gigaword Second Edition
Item Name: | Chinese Gigaword Second Edition |
Author(s): | David Graff, Ke Chen, Junbo Kong, Kazuaki Maeda |
LDC Catalog No.: | LDC2005T14 |
ISBN: | 1-58563-353-4 |
ISLRN: | 292-607-460-859-8 |
DOI: | https://doi.org/10.35111/vr0r-sb06 |
Release Date: | August 17, 2005 |
Member Year(s): | 2005 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Project(s): | EARS, GALE, TIDES |
Application(s): | information retrieval, language modeling, natural language processing |
Language(s): | Mandarin Chinese |
Language ID(s): | cmn |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2005T14 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Graff, David, et al. Chinese Gigaword Second Edition LDC2005T14. Web Download. Philadelphia: Linguistic Data Consortium, 2005. |
Related Works: | View |
Introduction
Chinese Gigaword Second Edition was produced by the Linguistic Data Consortium (LDC) and contains a comprehensive archive of newswire text data in Chinese totalling approximately 1.3 billion words that has been acquired over several years by LDC.
This edition includes all of the contents in the first release, Chinese Gigaword (LDC2003T09), as well as new data collected after the publication of the first edition, specifically Xinhua from October 2002 through December 2004 and CNA from January 2003 through December 2004. Also, a limited number of articles from a new newspaper source (Lianhe Zaobao) have been added in this edition.
Data
Here is a table of the three distinct international sources of Chinese newswire included in this edition along with a breakdown of how many documents and K-words (thousands of words) are included for each:
Source | Abbreviation | Documents | K-words |
Central News Agency, Taiwan | (cna_cmn) | 1,769,952 | 792,195 |
Xinhua News Agency | (xin_cmn) | 992,261 | 471,110 |
Zaobao Newspaper | (zbn_cmn) | 41,418 | 28,066 |
Totals | 2,803,632 | 1,291,371 |
The seven-character abbreviations shown above represent both the source name and the language ID ("cmn" for Mandarin Chinese). The files are presented in zipped format containing SGML-formatted text files with multiple documents. Documents fall within three categories:
- Story: a report composed of paragraphs and full sentences; most common
- Multi: unrelated "blurbs" of several news items
- Advis: advisories directed at news editors and not intended for publication/general audience
- Other: intended for publication but not paragraphs or sentences; these are things like lists of sports scores, stock prices, temperatures around the world, etc.
Samples
For an example of the data in this corpus, please view this sample (SGML).
Updates
None at this time.