Home › Language Resources › Data

Chinese Treebank 6.0

Item Name:	Chinese Treebank 6.0
Author(s):	Martha Palmer, Nianwen Xue, Fei Xia, Fu-Dong Chiou, Zixin Jiang, Meiyu Chang
LDC Catalog No.:	LDC2007T36
ISBN:	1-58563-450-6
ISLRN:	616-484-921-813-1
DOI:	https://doi.org/10.35111/bfb8-gt03
Release Date:	December 20, 2007
Member Year(s):	2007
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	GALE, TIDES
Application(s):	natural language processing, parsing, machine translation, linguistic analysis, information extraction
Language(s):	Mandarin Chinese
Language ID(s):	cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2007T36 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Palmer, Martha, et al. Chinese Treebank 6.0 LDC2007T36. Web Download. Philadelphia: Linguistic Data Consortium, 2007.
Related Works: Hide	View isVersionOf LDC2001T11 Chinese Treebank 2.0 LDC2004T05 Chinese Treebank 4.0 LDC2005T01 Chinese Treebank 5.0 hasVersion LDC2010T07 Chinese Treebank 7.0 LDC2013T21 Chinese Treebank 8.0 LDC2016T13 Chinese Treebank 9.0 hasAnnotation LDC2008T07 Chinese Proposition Bank 2.0 LDC2012T04 2009 CoNLL Shared Task Part 2 hasOutcome LDC2015T06 GALE Chinese-English Parallel Aligned Treebank -- Training

Introduction

This file contains documentation for Chinese Treebank 6.0, Linguistic Data Consortium (LDC) catalog number LDC2007T36 and isbn 1-58563-450-6.

The Chinese Treebank project began at the University of Pennsylvania in 1998 and continues at Penn and the University of Colorado. Chinese Treebank 6.0 is the latest version produced from this effort, consisting of 780,000 words (over 1.28 million Chinese characters) that are segmented, part-of-speech tagged and fully bracketed. The data sources include newswire from Xinhua News Agency, articles from Sinorama Magazine, news from the website of the Hong Kong Special Administrative Region and transcripts from various broadcast news programs.

The LDC published Chinese Treebank 1.0 in 2000; it was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11) and consisted of approximately 100,000 words. The LDC released Chinese Treebank 4.0 (LDC2004T05), an updated version containing roughly 400,000 words, in 2004. A year later, the LDC published the 500,000 word Chinese Treebank 5.0 (LDC2005T01).

For information about Chinese Treebank methodology and guidelines, consult the attached documentation files and the Chinese Treebank Project website.

This release encompasses 2,036 text files, containing 28,295 sentences, 781,351 words and 1,285,149 hanzi (Chinese characters). The data is provided in two encodings: GBK and UTF-8, and the annotation has Penn Treebank-style labeled brackets. Details of the annotation standard can be found in the enclosed segmentation, POS-tagging and bracketing guidelines. The data is provided in four different formats: raw text, word segmented, word segmented and POS-tagged, and syntactically bracketed.

Samples

For an example of the data in this publication, please examine this sample of the bracketed data.

Copyright

Portions © 2000-2001 China Broadcasting System, © 2000-2001 China Central TV, © 2000-2001 China National Radio, © 2000-2001 China Television System, © 1997 The Government of the Hong Kong Special Administrative Region, © 1996-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 2001, 2004, 2005, 2007 Trustees of the University of Pennsylvania

Chinese Treebank 6.0

Introduction

Samples

Copyright

Available Media

View Fees