Home › Language Resources › Data

Chinese Dependency Treebank 1.0

Item Name:	Chinese Dependency Treebank 1.0
Author(s):	Wanxiang Che, Zhenghua Li, Ting Liu
LDC Catalog No.:	LDC2012T05
ISBN:	1-58563-612-6
ISLRN:	475-765-099-443-8
DOI:	https://doi.org/10.35111/69ts-ey63
Release Date:	May 16, 2012
Member Year(s):	2012
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	information extraction, information retrieval, language modeling, language teaching, machine translation, parsing, part of speech tagging, tagging
Language(s):	Mandarin Chinese, Chinese
Language ID(s):	cmn, zho
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2012T05 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Che, Wanxiang, Zhenghua Li, and Ting Liu. Chinese Dependency Treebank 1.0 LDC2012T05. Web Download. Philadelphia: Linguistic Data Consortium, 2012.
Related Works: Hide	View relatesTo LDC2025T06 Chinese Sentence Pattern Structure Treebank

Introduction

Chinese Dependency Treebank 1.0 was developed by the Harbin Institute of Technologys Research Center for Social Computing and Information Retrieval (HIT-SCIR). It contains 49,996 Chinese sentences (902,191 words) randomly selected from Peoples Daily newswire stories published between 1992 and 1996 and annotated with syntactic dependency structures.

Data

Ill-formed or short sentences were eliminated from the randomly-selected sentences prior to annotation. The data was segmented and annotated for part of speech (POS), syntactic structures, verb subclasses and noun compounds.Word segmentation and POS tagging were accomplished automatically using statistical models trained on a larger, annotated corpus of Peoples Daily newswire stories. Humans manually annotated the syntactic structures and corrected word segmentation errors. POS tags were not corrected.

The data is provided in the format of CoNLL-X and in UTF-8. One line presents information for one word. An empty line indicates the end of a sentence. Each line contains 10 columns separated with a tab.

Chinese Dependency Treebank 1.0

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees