Tagged Chinese Gigaword
Item Name: | Tagged Chinese Gigaword |
Author(s): | Chu-Ren Huang |
LDC Catalog No.: | LDC2007T03 |
ISBN: | 1-58563-409-3 |
ISLRN: | 614-675-002-053-4 |
DOI: | https://doi.org/10.35111/ckna-1h68 |
Release Date: | June 20, 2007 |
Member Year(s): | 2007 |
DCMI Type(s): | Text |
Data Source(s): | newswire |
Application(s): | information retrieval, language modeling, natural language processing |
Language(s): | Mandarin Chinese |
Language ID(s): | cmn |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2007T03 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Huang, Chu-Ren. Tagged Chinese Gigaword LDC2007T03. Web Download. Philadelphia: Linguistic Data Consortium, 2007. |
Related Works: | View |
Introduction
Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags.
In order to avoid any problems or confusion that could result from differences in character-set specifications in the source data, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the readme file, all characters in the text are either single-byte ASCII or multi-byte Chinese.
All sources have been categorized into four distinct "types":
- story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
- multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.
- advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users."
- other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.
Data
The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents.
Source | #Files | Rzip-MB | Totl-MB | K-wrds | #DOCs | CNA_CMN | 168 | 994 | 7363 | 792195 | 1769953 | XIN_CMN | 168 | 615 | 4535 | 471110 | 992261 | ZBN_CMN | 10 | 40 | 223 | 28066 | 41418 | TOTAL | 346 | 1648 | 12121 | 1291371 | 2803632 |
The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type:
#DOCs | K-wrds | type="advis": | CNA_CMN | 8160 | 751 | XIN_CMN | 6553 | 711 | ZBN_CMN | 0 | 0 | TOTAL | 14713 | 1462 |
type="multi": | CNA_CMN | 30552 | 23429 | XIN_CMN | 11329 | 7516 | ZBN_CMN | 55 | 41 | TOTAL | 41936 | 30986 |
type="other": | CNA_CMN | 100758 | 40258 | XIN_CMN | 31255 | 9999 | ZBN_CMN | 279 | 130 | TOTAL | 132292 | 50387 |
type="story": | CNA_CMN | 1630483 | 727748 | XIN_CMN | 943132 | 452878 | ZBN_CMN | 41084 | 27898 | TOTAL | 2614691 | 1208524 |
The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006.
The test result is shown as follows:
Doc# | RefWord# | TestWord# | MatchWord# | Recall (%) | Precision (%) | F-Score (%) | Bakeoff 2005 | 190 | 116509 | 116443 | 112091 | 96.2 | 96.3 | 96.2 | Bakeoff 2006 | 148 | 90405 | 90327 | 87332 | 96.6 | 96.7 | 96.6 |
Note:
Recall=MatchWord# / RefWord#
Precision=MatchWord# / TestWord#
F-Score=2 * Recall * Precision / (Recall + Precision)
Samples
For an example of the data contained in this corpus, please view this screen capture(jpg) of the annotated text.