Home › Language Resources › Data

Tagged Chinese Gigaword Version 2.0

Item Name:	Tagged Chinese Gigaword Version 2.0
Author(s):	Chu-Ren Huang
LDC Catalog No.:	LDC2009T14
ISBN:	1-58563-516-2
ISLRN:	247-043-830-464-8
DOI:	https://doi.org/10.35111/9bhh-2s82
Release Date:	June 18, 2009
Member Year(s):	2009
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	parsing, natural language processing, language modeling, information retrieval, information extraction
Language(s):	Mandarin Chinese, Chinese
Language ID(s):	cmn, zho
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2009T14 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Huang, Chu-Ren. Tagged Chinese Gigaword Version 2.0 LDC2009T14. Web Download. Philadelphia: Linguistic Data Consortium, 2009.
Related Works: Hide	View isVersionOf LDC2007T03 Tagged Chinese Gigaword isAnnotationOf LDC2005T14 Chinese Gigaword Second Edition

Introduction

Tagged Chinese Gigaword Version 2.0, created by scholars at Academia Sinica, Taipei, Taiwan, is a part-of-speech tagged version of LDC's Chinese Gigaword Second Edition (LDC2005T14). Like the original release, Version 2.0 contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency, Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags. In addtion, this new release removes residual noises in the original and improves tagging accuracy by incorporating lexica of unknown words. The changes represented in Version 2.0 include the following:

A single-width space is used consistently between two segmented words.
The position of the newline character remains fixed, better reflecting the source files from Chinese Gigaword Second Edition (LDC2005T14).
The original coding of partial Latin letters or Arabic numerals is preserved.
1,192 documents from Central News Agency (Taiwan) and 13 documents from Xinhua News Agency that were missing from the first publication are included.
A set of heuristics for building out-of-vocabulary dictionaries to improve annotation quality of very large corpora is incorporated.

Documents in the corpus were assigned one of the following categories:

story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.
advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users."
other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.

Data

Basic statistics of data from each source are summarized below.

Source	No. Files	Compressed Size(MB)	Total Size(MB)	No. Words(thousands)	No. Documents
CNA_CMN	168	1520	6136	501456	1769953
XIN_CMN	168	898	3755	311660	992261
ZBN_CMN	10	55	214	18632	41418
TOTAL	346	2473	10105	831748	2803632

The POS tags and their corresponding explanations are listed below:

Tag	Explanation_Chinese	Explantation_English
A	非謂形容詞	Non-predicative adjective
Caa	對等連接詞，如：和、跟	Conjunctive conjunction
Cab	連接詞，如：等等	Conjunction, e.g.deng3deng3
Cba	連接詞，如：的話	Conjunction, e.g.de5hua4
Cbb	關聯連接詞	Correlative Conjunction
D	副詞	Adverb
Da	數量副詞	Quantitative Adverb
DE	的, 之, 得, 地	Particle DE and its functional equivalents
Dfa	動詞前程度副詞	Pre-verbal Adverb of degree
Dfb	動詞後程度副詞	Post-verbal Adverb of degree
Di	時態標記	Aspectual Adverb
Dk	句副詞	Sentential Adverb
FW	外文標記	Foreign Word
I	感嘆詞	Interjection
Na	普通名詞	Common Noun
Nb	專有名稱	Proper Noun
Nc	地方詞	Place Noun
Ncd	位置詞	Localizer
Nd	時間詞	Time Noun
Nep	指代定詞	Demonstrative Determinatives
Neqa	數量定詞	Quantitative Determinatives
Neqb	後置數量定詞	Post-quantitative Determinatives
Nes	特指定詞	Specific Determinatives
Neu	數詞定詞	Numeral Determinatives
Nf	量詞	Measure
Ng	後置詞	Postposition
Nh	代名詞	Pronoun
P	介詞	Preposition
SHI	是	you3 (to have)
T	語助詞	Particle
VA	動作不及物動詞	Active Intransitive Verb
VAC	動作使動動詞	Active Causative Verb
VB	動作類及物動詞	Active Pseudo-transitive Verb
VC	動作及物動詞	Active Transitive Verb
VCL	動作接地方賓語動詞	Active Verb with a Locative Object
VD	雙賓動詞	Ditransitive Verb
VE	動作句賓動詞	Active Verb with a Sentential Object
VF	動作謂賓動詞	Active Verb with a Verbal Object
VG	分類動詞	Classificatory Verb
VH	狀態不及物動詞	Stative Intransitive Verb
VHC	狀態使動動詞	Stative Causative Verb
VI	狀態類及物動詞	Stative Pseudo-transitive Verb
VJ	狀態及物動詞	Stative Transitive Verb
VK	狀態句賓動詞	Stative Verb with a Sentential Object
VL	狀態謂賓動詞	Stative Verb with a Verbal Object
V_2	有	有

Since neither manual checking nor automatic checking against a gold standard is feasible for gigaword size corpora, the authors proposed quality assurance of automatic annotation of very large corpora based on heterogeneous CKIP and ICTCLAS tagging systems (Huang et al., 2008). By comparing to word lists generated from the ICTCLAS version of an automatic tagged Xinhua portion of Chinese Gigaword, a set of heuristics for building out-of-vocabulary dictionaries to improve quality were proposed. Randomly selected texts for evaluating effects of these out-of-vocabulary dictionaries were manually checked. Experimental results indicate that there were 30,562 correct words (about 97.3 %) of tested words. The quality control test result follows:

Corpora	Thousands of words	No. Test words	No. Correct Words
CNA	501459	42,695	41,449
XIN	311718	28,744	27,967
ZBN	18632	22,825	22,270
Total	831809	31,421	30,562

Samples

Please view this sample.

Copyright

Portions © 2005-2009 Academia Sinica, © 1991-1994 Central News Agengy (Taiwan), © 2000-2003 SPH AsiaOne, Ltd., © 1990-2004 Xinhua News Agency, © 2005, 2007, 2009 Trustees of the University of Pennsylvania

Tagged Chinese Gigaword Version 2.0

Introduction

Data

Samples

Copyright

Available Media

View Fees