Home › Language Resources › Data

Chinese Treebank 4.0

Item Name:	Chinese Treebank 4.0
Author(s):	Martha Palmer, Fu-Dong Chiou, Nianwen Xue, Tsan-Kuang Lee
LDC Catalog No.:	LDC2004T05
ISBN:	1-58563-287-2
ISLRN:	191-685-030-898-8
DOI:	https://doi.org/10.35111/0qv2-1916
Release Date:	March 15, 2004
Member Year(s):	2004
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	GALE, TIDES
Application(s):	natural language processing, parsing, tagging
Language(s):	Mandarin Chinese
Language ID(s):	cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2004T05 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Palmer, Martha, et al. Chinese Treebank 4.0 LDC2004T05. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: Hide	View isVersionOf LDC2001T11 Chinese Treebank 2.0 hasVersion LDC2005T01 Chinese Treebank 5.0 LDC2007T36 Chinese Treebank 6.0 LDC2010T07 Chinese Treebank 7.0 LDC2013T21 Chinese Treebank 8.0 LDC2016T13 Chinese Treebank 9.0

Introduction

Chinese Treebank 4.0 was developed by the Linguistic Data Consortium (LDC) and contains approximately 400,000 words of Chinese newswire text annotated in the manner of the Penn English Treebank.

The Penn Chinese Treebank is an ongoing project that started in the summer of 1998. The goal of the project is to create of a 500,000-word corpus of Chinese text with syntactic bracketing. Chinese Treebank 1.0 was first published in 2000. It was later corrected and released in 2001 as Chinese Treebank 2.0 (LDC2001T11). More information about the project is available on the Chinese Treebank website.

Data

The content used in this corpus comes from the following newswire sources:

Articles	Source
698	Xinhua (1994-1998)
55	Information Services Department of HKSAR (1997)
80	Sinorama magazine, Taiwan (1996-1998 & 2000-2001)

Here is the breakdown of the content:

Words	Hanzi	Sentences	Files
404,156	664,633	15,162	838

All files are GB encoded. The format of Chinese Treebank 4.0 is the same as the Penn English Treebank. All files have been annotated at least twice. The first pass was done by one annotator, and the resulting files were checked by a second annotator (second pass). The corpus also provides seven files intended to serve as the gold standard annotation.

The corpus provides four versions of files: bracketed, raw, segmented, and part-of-speech tagged. The raw, segmented, and part-of-speech tagged versions are generated from the bracketed version and so do not reflect the previous annotation stages.

Samples

Please view these samples:

Updates

None at this time.

Sponsorship

This corpus was funded in part through the DARPA-TIDES grant number N66001-00-1-8915.

Copyright

Portions © 1997 The Government of the Hong Kong Special Administrative Region, © 1996-1998, 2000-2001 Sinorama Magazine, © 1994-1998 Xinhua News Agency, © 2001, 2004 Trustees of the University of Pennsylvania