Tagged Chinese Gigaword Version 2.0
Item Name: | Tagged Chinese Gigaword Version 2.0 |
Author(s): | Chu-Ren Huang |
LDC Catalog No.: | LDC2009T14 |
ISBN: | 1-58563-516-2 |
ISLRN: | 247-043-830-464-8 |
DOI: | https://doi.org/10.35111/9bhh-2s82 |
Release Date: | June 18, 2009 |
Member Year(s): | 2009 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Application(s): | parsing, natural language processing, language modeling, information retrieval, information extraction |
Language(s): | Mandarin Chinese, Chinese |
Language ID(s): | cmn, zho |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2009T14 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Huang, Chu-Ren. Tagged Chinese Gigaword Version 2.0 LDC2009T14. Web Download. Philadelphia: Linguistic Data Consortium, 2009. |
Related Works: | View |
Introduction
Tagged Chinese Gigaword Version 2.0, created by scholars at Academia Sinica, Taipei, Taiwan, is a part-of-speech tagged version of LDC's Chinese Gigaword Second Edition (LDC2005T14). Like the original release, Version 2.0 contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency, Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags. In addtion, this new release removes residual noises in the original and improves tagging accuracy by incorporating lexica of unknown words. The changes represented in Version 2.0 include the following:
- A single-width space is used consistently between two segmented words.
- The position of the newline character remains fixed, better reflecting the source files from Chinese Gigaword Second Edition (LDC2005T14).
- The original coding of partial Latin letters or Arabic numerals is preserved.
- 1,192 documents from Central News Agency (Taiwan) and 13 documents from Xinhua News Agency that were missing from the first publication are included.
- A set of heuristics for building out-of-vocabulary dictionaries to improve annotation quality of very large corpora is incorporated.
Documents in the corpus were assigned one of the following categories:
- story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
- multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.
- advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users."
- other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.
Data
Basic statistics of data from each source are summarized below.
Source |
No. Files |
Compressed Size(MB) |
Total Size(MB) |
No. Words(thousands) |
No. Documents |
CNA_CMN |
168 |
1520 |
6136 |
501456 |
1769953 |
XIN_CMN |
168 |
898 |
3755 |
311660 |
992261 |
ZBN_CMN |
10 |
55 |
214 |
18632 |
41418 |
TOTAL |
346 |
2473 |
10105 |
831748 |
2803632 |
The POS tags and their corresponding explanations are listed below:
Tag |
Explanation_Chinese |
Explantation_English |
A |
非謂形容詞 |
Non-predicative adjective |
Caa |
對等連接詞,如:和、跟 |
Conjunctive conjunction |
Cab |
連接詞,如:等等 |
Conjunction, e.g.deng3deng3 |
Cba |
連接詞,如:的話 |
Conjunction, e.g.de5hua4 |
Cbb |
關聯連接詞 |
Correlative Conjunction |
D |
副詞 |
Adverb |
Da |
數量副詞 |
Quantitative Adverb |
DE |
的, 之, 得, 地 |
Particle DE and its functional equivalents |
Dfa |
動詞前程度副詞 |
Pre-verbal Adverb of degree |
Dfb |
動詞後程度副詞 |
Post-verbal Adverb of degree |
Di |
時態標記 |
Aspectual Adverb |
Dk |
句副詞 |
Sentential Adverb |
FW |
外文標記 |
Foreign Word |
I |
感嘆詞 |
Interjection |
Na |
普通名詞 |
Common Noun |
Nb |
專有名稱 |
Proper Noun |
Nc |
地方詞 |
Place Noun |
Ncd |
位置詞 |
Localizer |
Nd |
時間詞 |
Time Noun |
Nep |
指代定詞 |
Demonstrative Determinatives |
Neqa |
數量定詞 |
Quantitative Determinatives |
Neqb |
後置數量定詞 |
Post-quantitative Determinatives |
Nes |
特指定詞 |
Specific Determinatives |
Neu |
數詞定詞 |
Numeral Determinatives |
Nf |
量詞 |
Measure |
Ng |
後置詞 |
Postposition |
Nh |
代名詞 |
Pronoun |
P |
介詞 |
Preposition |
SHI |
是 |
you3 (to have) |
T |
語助詞 |
Particle |
VA |
動作不及物動詞 |
Active Intransitive Verb |
VAC |
動作使動動詞 |
Active Causative Verb |
VB |
動作類及物動詞 |
Active Pseudo-transitive Verb |
VC |
動作及物動詞 |
Active Transitive Verb |
VCL |
動作接地方賓語動詞 |
Active Verb with a Locative Object |
VD |
雙賓動詞 |
Ditransitive Verb |
VE |
動作句賓動詞 |
Active Verb with a Sentential Object |
VF |
動作謂賓動詞 |
Active Verb with a Verbal Object |
VG |
分類動詞 |
Classificatory Verb |
VH |
狀態不及物動詞 |
Stative Intransitive Verb |
VHC |
狀態使動動詞 |
Stative Causative Verb |
VI |
狀態類及物動詞 |
Stative Pseudo-transitive Verb |
VJ |
狀態及物動詞 |
Stative Transitive Verb |
VK |
狀態句賓動詞 |
Stative Verb with a Sentential Object |
VL |
狀態謂賓動詞 |
Stative Verb with a Verbal Object |
V_2 |
有 |
有 |
Since neither manual checking nor automatic checking against a gold standard is feasible for gigaword size corpora, the authors proposed quality assurance of automatic annotation of very large corpora based on heterogeneous CKIP and ICTCLAS tagging systems (Huang et al., 2008). By comparing to word lists generated from the ICTCLAS version of an automatic tagged Xinhua portion of Chinese Gigaword, a set of heuristics for building out-of-vocabulary dictionaries to improve quality were proposed. Randomly selected texts for evaluating effects of these out-of-vocabulary dictionaries were manually checked. Experimental results indicate that there were 30,562 correct words (about 97.3 %) of tested words. The quality control test result follows:
Corpora |
Thousands of words |
No. Test words |
No. Correct Words |
CNA |
501459 |
42,695 |
41,449 |
XIN |
311718 |
28,744 |
27,967 |
ZBN |
18632 |
22,825 |
22,270 |
Total |
831809 |
31,421 |
30,562 |
Samples
Please view this sample.