Tagged Chinese Gigaword Version 2.0

Item Name: Tagged Chinese Gigaword Version 2.0
Author(s): Chu-Ren Huang
LDC Catalog No.: LDC2009T14
ISBN: 1-58563-516-2
ISLRN: 247-043-830-464-8
Release Date: June 18, 2009
Member Year(s): 2009
DCMI Type(s): Text
Data Source(s): newswire
Application(s): parsing, natural language processing, language modeling, information retrieval, information extraction
Language(s): Mandarin Chinese, Chinese
Language ID(s): cmn, zho
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2009T14 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Huang, Chu-Ren. Tagged Chinese Gigaword Version 2.0 LDC2009T14. Web Download. Philadelphia: Linguistic Data Consortium, 2009.

Introduction

Tagged Chinese Gigaword Version 2.0, created by scholars at Academia Sinica, Taipei, Taiwan, is a part-of-speech tagged version of LDC's Chinese Gigaword Second Edition (LDC2005T14). Like the original release, Version 2.0 contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency, Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags. In addtion, this new release removes residual noises in the original and improves tagging accuracy by incorporating lexica of unknown words. The changes represented in Version 2.0 include the following:

  • A single-width space is used consistently between two segmented words.
  • The position of the newline character remains fixed, better reflecting the source files from Chinese Gigaword Second Edition (LDC2005T14).
  • The original coding of partial Latin letters or Arabic numerals is preserved.
  • 1,192 documents from Central News Agency (Taiwan) and 13 documents from Xinhua News Agency that were missing from the first publication are included.
  • A set of heuristics for building out-of-vocabulary dictionaries to improve annotation quality of very large corpora is incorporated.

Documents in the corpus were assigned one of the following categories:

  • story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
  • multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.
  • advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users."
  • other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.

Data

Basic statistics of data from each source are summarized below.

Source

No. Files

Compressed Size(MB)

Total Size(MB)

No. Words(thousands)

No. Documents

CNA_CMN

168

1520

6136

501456

1769953

XIN_CMN

168

898

3755

311660

992261

ZBN_CMN

10

55

214

18632

41418

TOTAL

346

2473

10105

831748

2803632

The POS tags and their corresponding explanations are listed below:

Tag

Explanation_Chinese

Explantation_English

A

非謂形容詞

Non-predicative adjective

Caa

對等連接詞,如:和、跟

Conjunctive conjunction

Cab

連接詞,如:等等

Conjunction, e.g.deng3deng3

Cba

連接詞,如:的話

Conjunction, e.g.de5hua4

Cbb

關聯連接詞

Correlative Conjunction

D

副詞

Adverb

Da

數量副詞

Quantitative Adverb

DE

的, 之, 得, 地

Particle DE and its functional equivalents

Dfa

動詞前程度副詞

Pre-verbal Adverb of degree

Dfb

動詞後程度副詞

Post-verbal Adverb of degree

Di

時態標記

Aspectual Adverb

Dk

句副詞

Sentential Adverb

FW

外文標記

Foreign Word

I

感嘆詞

Interjection

Na

普通名詞

Common Noun

Nb

專有名稱

Proper Noun

Nc

地方詞

Place Noun

Ncd

位置詞

Localizer

Nd

時間詞

Time Noun

Nep

指代定詞

Demonstrative Determinatives

Neqa

數量定詞

Quantitative Determinatives

Neqb

後置數量定詞

Post-quantitative Determinatives

Nes

特指定詞

Specific Determinatives

Neu

數詞定詞

Numeral Determinatives

Nf

量詞

Measure

Ng

後置詞

Postposition

Nh

代名詞

Pronoun

P

介詞

Preposition

SHI

you3 (to have)

T

語助詞

Particle

VA

動作不及物動詞

Active Intransitive Verb

VAC

動作使動動詞

Active Causative Verb

VB

動作類及物動詞

Active Pseudo-transitive Verb

VC

動作及物動詞

Active Transitive Verb

VCL

動作接地方賓語動詞

Active Verb with a Locative Object

VD

雙賓動詞

Ditransitive Verb

VE

動作句賓動詞

Active Verb with a Sentential Object

VF

動作謂賓動詞

Active Verb with a Verbal Object

VG

分類動詞

Classificatory Verb

VH

狀態不及物動詞

Stative Intransitive Verb

VHC

狀態使動動詞

Stative Causative Verb

VI

狀態類及物動詞

Stative Pseudo-transitive Verb

VJ

狀態及物動詞

Stative Transitive Verb

VK

狀態句賓動詞

Stative Verb with a Sentential Object

VL

狀態謂賓動詞

Stative Verb with a Verbal Object

V_2

Since neither manual checking nor automatic checking against a gold standard is feasible for gigaword size corpora, the authors proposed quality assurance of automatic annotation of very large corpora based on heterogeneous CKIP and ICTCLAS tagging systems (Huang et al., 2008). By comparing to word lists generated from the ICTCLAS version of an automatic tagged Xinhua portion of Chinese Gigaword, a set of heuristics for building out-of-vocabulary dictionaries to improve quality were proposed. Randomly selected texts for evaluating effects of these out-of-vocabulary dictionaries were manually checked. Experimental results indicate that there were 30,562 correct words (about 97.3 %) of tested words. The quality control test result follows:

Corpora

Thousands of words

No. Test words

No. Correct Words

CNA

501459

42,695

41,449

XIN

311718

28,744

27,967

ZBN

18632

22,825

22,270

Total

831809

31,421

30,562

Samples

Please view this sample.

Available Media

View Fees





Login for the applicable fee