Chinese Treebank 4.0

Item Name: Chinese Treebank 4.0
Authors: Martha Palmer, Fu-Dong Chiou, Nianwen Xue, and Tsan-Kuang Lee
LDC Catalog No.: LDC2004T05
ISBN: 1-58563-287-2
Release Date: Mar 15, 2004
Data Type: text
Data Source(s): newswire
Project(s): GALE, TIDES
Application(s): natural language processing, parsing, tagging
Language(s): Mandarin Chinese
Language ID(s): cmn
Distribution: Web Download
Member fee: $0 for 2004 members
Non-member Fee: US $225.00
Reduced-License Fee: US $225.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Martha Palmer, et al.
Chinese Treebank 4.0
Linguistic Data Consortium, Philadelphia


Chinese Treebank 4.0 was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T05 and ISBN 1-58563-287-2.

The Penn Chinese Treebank is an ongoing project that started in the summer of 1998. The goal of the project is to create of a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000. It was later corrected and released in 2001 as Chinese Treebank 2.0. More information about the project is available on the Penn Chinese Treebank website.

The content used in this corpus comes from the following newswire sources:

698 articles Xinhua (1994-1998)
55 articles Information Services Department of HKSAR (1997)
80 articles Sinorama magazine, Taiwan (1996-1998 & 2000-2001)


Chinese Treebank 4.0 contains 404,156 words, 664,633 Hanzi, 15,162 sentences, and 838 data files.

All files are GB encoded. The format of Chinese Treebank 4.0 is the same as the Penn English Treebank. All files have been annotated at least twice. The first pass was done by one annotator, and the resulting files were checked by a second annotator (second pass). The corpus also provides seven files intended to serve as the gold standard annotation.

The corpus provides four versions of files: bracketed, raw, segmented and postagged. The raw, segmented and postagged versions are generated from the bracketed version and so do not reflect the previous annotation stages.


Additional information, updates, bug fixes will be posted on the Penn Chinese Treebank website.


This corpus was funded in part through the DARPA-TIDES grant number N66001-00-1-8915.

Content Copyright

Portions 1997 The Government of the Hong Kong Special Administrative Region, 1996-1998, 2000-2001 Sinorama Magazine, 1994-1998 Xinhua News Agency, 2001, 2004 Trustees of the University of Pennsylvania