Home › Language Resources › Data

Tagged Chinese Gigaword

Item Name:	Tagged Chinese Gigaword
Author(s):	Chu-Ren Huang
LDC Catalog No.:	LDC2007T03
ISBN:	1-58563-409-3
ISLRN:	614-675-002-053-4
DOI:	https://doi.org/10.35111/ckna-1h68
Release Date:	June 20, 2007
Member Year(s):	2007
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	information retrieval, language modeling, natural language processing
Language(s):	Mandarin Chinese
Language ID(s):	cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2007T03 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Huang, Chu-Ren. Tagged Chinese Gigaword LDC2007T03. Web Download. Philadelphia: Linguistic Data Consortium, 2007.
Related Works: Hide	View hasVersion LDC2009T14 Tagged Chinese Gigaword Version 2.0 isAnnotationOf LDC2005T14 Chinese Gigaword Second Edition

Introduction

Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags.

In order to avoid any problems or confusion that could result from differences in character-set specifications in the source data, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the readme file, all characters in the text are either single-byte ASCII or multi-byte Chinese.

All sources have been categorized into four distinct "types":

story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.
advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users."
other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.

Data

The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents.

Source

#Files

Rzip-MB

Totl-MB

K-wrds

#DOCs

CNA_CMN

168

994

7363

792195

1769953

XIN_CMN

168

615

4535

471110

992261

ZBN_CMN

223

28066

41418

TOTAL

346

1648

12121

1291371

2803632

The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type:

#DOCs

K-wrds

type="advis":

CNA_CMN

8160

751

XIN_CMN

6553

711

ZBN_CMN

TOTAL

14713

1462

type="multi":

CNA_CMN

30552

23429

XIN_CMN

11329

7516

ZBN_CMN

TOTAL

41936

30986

type="other":

CNA_CMN

100758

40258

XIN_CMN

31255

9999

ZBN_CMN

279

130

TOTAL

132292

50387

type="story":

CNA_CMN

1630483

727748

XIN_CMN

943132

452878

ZBN_CMN

41084

27898

TOTAL

2614691

1208524

The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006.

The test result is shown as follows:

Doc#

RefWord#

TestWord#

MatchWord#

Recall (%)

Precision (%)

F-Score (%)

Bakeoff 2005

190

116509

116443

112091

96.2

96.3

96.2

Bakeoff 2006

148

90405

90327

87332

96.6

96.7

96.6

Note:

Recall=MatchWord# / RefWord#

Precision=MatchWord# / TestWord#

F-Score=2 * Recall * Precision / (Recall + Precision)

Samples

For an example of the data contained in this corpus, please view this screen capture(jpg) of the annotated text.

Tagged Chinese Gigaword

Introduction

Data

Samples

Copyright

Available Media

View Fees