Home › Language Resources › Data

Chinese Treebank 5.0

Item Name:	Chinese Treebank 5.0
Author(s):	Martha Palmer, Fu-Dong Chiou, Nianwen Xue, Tsan-Kuang Lee
LDC Catalog No.:	LDC2005T01
ISBN:	1-58563-323-2
ISLRN:	426-628-131-806-1
DOI:	https://doi.org/10.35111/j3yz-jw79
Release Date:	January 15, 2005
Member Year(s):	2005
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	GALE, TIDES
Application(s):	natural language processing, parsing, tagging
Language(s):	Mandarin Chinese
Language ID(s):	cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2005T01 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Palmer, Martha, et al. Chinese Treebank 5.0 LDC2005T01. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: Hide	View isVersionOf LDC2001T11 Chinese Treebank 2.0 LDC2004T05 Chinese Treebank 4.0 hasVersion LDC2007T36 Chinese Treebank 6.0 LDC2010T07 Chinese Treebank 7.0 LDC2013T21 Chinese Treebank 8.0 LDC2016T13 Chinese Treebank 9.0 hasAnnotation LDC2005T23 Chinese Proposition Bank 1.0 hasOutcome LDC2007T02 English Chinese Translation Treebank v 1.0

Introduction

Chinese Treebank 5.0 was developed by the Linguistic Data Consortium (LDC) contains approximately 500,000 words of Chinese newswire text annotated in the manner of the Penn English Treebank.

The Penn Chinese Treebank is an ongoing project that started in the summer of 1998. The goal of the project is to create a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000, and it was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11). Another updated version was released in 2004 as Chinese Treebank 4.0 (LDC2004T05). More information about the project is available on the Chinese Treebank website.

For this release, 52 new Sinorama files have been added and some of the errors that existed in earlier releases have been corrected.

The content used in this corpus comes from the following newswire sources:

698 articles	Xinhua (1994-1998)
55 articles	Information Services Department of HKSAR (1997)
132 articles	Sinorama magazine, Taiwan (1996-1998; 2000-2001)

Data

Chinese Treebank 5.0 contains 890 data files, 18,782 sentences, 507,222 words, and 824,983 characters.

All files are GB encoded. The format of Chinese Treebank 5.0 is the same as the Penn English Treebank. All files have been annotated at least twice. The first pass was done by one annotator, and the resulting files were checked by a second annotator (second pass). Some files were also double-blind annotated and then adjudicated to create gold standard files.

The corpus provides four versions of files: bracketed, raw, segmented, and postagged. The raw, segmented, and postagged versions are generated from the bracketed version and so do not reflect the previous annotation stages. The bracketed files are sequentially named as follows: chtb_nnnn.fid, where nnnn is a sequential file number.

Samples

For an example of the data in this corpus, please view this gold standard sample (TXT).

Updates

The 5.1 update contains corrections to errors found in the earlier version. Specifically, sentences which had more than one top-level node have been modified. Additionally, some GB-encoded white spaces have been converted to ASCII. The 5.1 package is available as an additional download to all those who have licensed CTB5.0.

Copyright

Portions © 1994-1998 Xinhua News Agency, © 1996-2001 Sinorama Magazine, © 1997 The Government of the Hong Kong Special Administrative Region, © 2001, 2004, 2005 Trustees of the University of Pennsylvania