Tagged Chinese Gigaword

Item Name: Tagged Chinese Gigaword
Author(s): Chu-Ren Huang
LDC Catalog No.: LDC2007T03
ISBN: 1-58563-409-3
ISLRN: 614-675-002-053-4
DOI: https://doi.org/10.35111/ckna-1h68
Release Date: June 20, 2007
Member Year(s): 2007
DCMI Type(s): Text
Data Source(s): newswire
Application(s): information retrieval, language modeling, natural language processing
Language(s): Mandarin Chinese
Language ID(s): cmn
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2007T03 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Huang, Chu-Ren. Tagged Chinese Gigaword LDC2007T03. Web Download. Philadelphia: Linguistic Data Consortium, 2007.
Related Works: View

Introduction

Tagged Chinese Gigaword, created by scholars at Academia Sinica, Taipei, Taiwan, is the part-of-speech tagged version of the LDC's Chinese Gigaword Second Edition LDC2005T14. It contains all of the data in Chinese Gigaword Second Edition -- from Central News Agency (Taiwan), Xinhua News Agency and Lianhe Zaobao -- annotated with full part of speech tags.

In order to avoid any problems or confusion that could result from differences in character-set specifications in the source data, all text files in this corpus have been converted to UTF-8 character encoding. With some exceptions described in the readme file, all characters in the text are either single-byte ASCII or multi-byte Chinese.

All sources have been categorized into four distinct "types":

  • story: This type of DOC represents a coherent report on a particular topic or event, consisting of paragraphs and full sentences.
  • multi: This type of DOC contains a series of unrelated "blurbs," each of which briefly describes a particular topic or event; examples include "summaries of today's news," "news briefs in ..." (some general area like finance or sports), and so on.
  • advis: These are DOCs which the news service addresses to news editors; they are not intended for publication to the "end users."
  • other: These DOCs clearly do not fall into any of the above types; they include items such as lists of sports scores, stock prices, temperatures around the world, and so on.
  • Data

    The table below lists the number files, their compressed and uncompressed size, number of words and number of documents divided by source. #Files = number of files. Rzip-MB = compressed size in megabytes. Totl-MB = uncompressed size in megabytes. K-words = number of words in thousands. #DOCs = number of documents.

    Source #Files Rzip-MB Totl-MB K-wrds #DOCs CNA_CMN 168 994 7363 792195 1769953 XIN_CMN 168 615 4535 471110 992261 ZBN_CMN 10 40 223 28066 41418 TOTAL 346 1648 12121 1291371 2803632

    The following tables present the quantity of "K-wrds" and "#DOCS", divided by source and DOC type:

    #DOCs K-wrds type="advis": CNA_CMN 8160 751 XIN_CMN 6553 711 ZBN_CMN 0 0 TOTAL 14713 1462
    type="multi": CNA_CMN 30552 23429 XIN_CMN 11329 7516 ZBN_CMN 55 41 TOTAL 41936 30986
    type="other": CNA_CMN 100758 40258 XIN_CMN 31255 9999 ZBN_CMN 279 130 TOTAL 132292 50387
    type="story": CNA_CMN 1630483 727748 XIN_CMN 943132 452878 ZBN_CMN 41084 27898 TOTAL 2614691 1208524

    The performance of CKIP Segmentation and POS tagging system has been tested in Bakeoff 2005 and Bakeoff 2006.

    The test result is shown as follows:

    Doc# RefWord# TestWord# MatchWord# Recall (%) Precision (%) F-Score (%) Bakeoff 2005 190 116509 116443 112091 96.2 96.3 96.2 Bakeoff 2006 148 90405 90327 87332 96.6 96.7 96.6

    Note:

    Recall=MatchWord# / RefWord#

    Precision=MatchWord# / TestWord#

    F-Score=2 * Recall * Precision / (Recall + Precision)

    Samples

    For an example of the data contained in this corpus, please view this screen capture(jpg) of the annotated text.

Available Media

View Fees





Login for the applicable fee