Home › Language Resources › Data

Chinese Treebank 2.0

Item Name:	Chinese Treebank 2.0
Author(s):	Martha Palmer, Mitch Marcus, Anthony Kroch, Fei Xia, Nianwen Xue, Fu-Dong Chiou
LDC Catalog No.:	LDC2001T11
ISBN:	1-58563-204-X
ISLRN:	324-683-461-517-1
DOI:	https://doi.org/10.35111/jfkh-w176
Member Year(s):	2001
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	TIDES, GALE
Application(s):	parsing, natural language processing, tagging
Language(s):	Mandarin Chinese
Language ID(s):	cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2001T11 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Palmer, Martha, et al. Chinese Treebank 2.0 LDC2001T11. Web Download. Philadelphia: Linguistic Data Consortium, 2001.
Related Works: Hide	View hasVersion LDC2004T05 Chinese Treebank 4.0 LDC2005T01 Chinese Treebank 5.0 LDC2007T36 Chinese Treebank 6.0 LDC2010T07 Chinese Treebank 7.0 LDC2013T21 Chinese Treebank 8.0 LDC2016T13 Chinese Treebank 9.0 hasOutcome LDC2002T01 Multiple-Translation Chinese Corpus

The Chinese Treebank 2.0 was produced by:

Principal Investigators: Martha Palmer, Mitch Marcus, Tony Kroch

Consultants: Martha Palmer, Mitch Marcus, Tony Kroch, Shizhe Huang, Mary Ellen Okurowski, John Kovarik, Boyan A. Onyshkevyc

Project Managers and Guideline Designers: Fei Xia, Nianwen Xue

Annotators: Fu-Dong Chiou, Nianwen Xue

Programming support: Zhibiao Wu

Introduction

Published by the Linguistic Data Consortium (LDC), catalog number LDC2001T11 and ISBN 1-58563-204-X.

The Chinese Penn Treebank Project started in Summer 1998. The goal is the creation of a 100,000 word corpus of Chinese with syntactic bracketing. More information is available at The Chinese Treebank Project. Chinese Treebank 2.0 supersedes and replaces the Chinese Penn Treebank Final Release (LDC2000T48 ISBN 1-58563-187-6).

Data

Size:	About 100K words, 325 data files
Source:	325 articles from Xinhua newswire between 1994 and 1998
Coding:	GB code
Format:	Same as the UPenn English Treebank except that we keep some original file information was retained such as "SRCID" and "DATE" in the data file.
Annotation:	All the files are annotated at least twice, the first-pass is done by one annotator, and the resulting files are checked by the second annotator (second-pass).
SGML:	All data files validate against chtb.dtd using nsmls.

The files are located in the data subdirectory and are sequentially named as follows: chtb_nnn.fid where nnn is the sequential file number. There is a cross reference in file.tbl which provides some annotator and historical information.

Chinese Treebank 2.0

Introduction

Data

Copyright

Available Media

View Fees