1. Publication title: Chinese Dependency Treebank (CDT) 1.0 

-----------

2. Authors:
- Wanxiang Che 
- Zhenghua Li 
- Ting Liu 

- Contact: Wanxiang Che 
- Organization: Research Center for Social Computing and Information Retrieval of Harbin Institute of Technology (HIT-SCIR), China

-----------

3. Data type: text

-----------

4. Introduction

Chinese Dependency Treebank (CDT) was developed by Harbin Institute of Technology's Research Center 
for Social Computing and Information Retrieval (HIT-SCIR). It contains 49,996 Chinese sentences 
with 902,191 words, which were annotated with syntactic dependency structures. 
All the sentences are randomly selected from People's Daiy between 1992 and 1996, with eliminating 
the ill-formed or short sentences, 

Word segmentation and Part-of-speech (POS) tagging are automatically done using statistical models trained 
on People's Daily corpus (PD), a large-scale corpus annotated with word segmentation and POS tags.

The syntactic structures of the sentences are annotated by human annotaters.
The annotaters are also required to correct the word segmentation errors.
However, the POS tags are not corrected.

-----------

5. Format:

The data is provided in the format of CoNLL-X.
One line present information of one word.
An empty line indicates the end of a sentence.
Each line contains 10 columns seperated with a tab (\t).


======= An example sentence ========
1	双方	_	n	_	_	2	ATT	_	_
2	间	_	nd	_	_	4	ATT	_	_
3	的	_	u	_	_	2	RAD	_	_
4	谈判	_	n	_	_	6	SBV	_	_
5	已经	_	d	_	_	6	ADV	_	_
6	破裂	_	v	_	_	0	HED	_	_
7	。	_	wp	_	_	6	WP	_	_
[An empty line indicates the end of the sentence.]
==============

Description of each column:
1:	id, start from 1.
2:	word form
3:	lemma (EMPTY for Chinese)
4:	coarse-grained POS tag 
5:	fine-grained POS tag (EMPTY)
6:	other-features (EMPTY)
7:	syntactic head
8:	syntactic relation
9:	predicted head (EMPTY)
10:	predicted relation (EMPTY)

-----------

6. Data split

To  facilitate future comparison of different parsing models on this dataset, 
we randomaly split the data into training/development/test sets.

train.conll06:	46,996 sentences, 847,994 words
dev.conll06:	 1,000 sentences,  18,293 words
test.conll06: 	 2,000 sentences,  35,904 words

-----------

7. Description of the syntactic relations:

[1]	SBV:	subject of verb
[2]	VOB:	object of verb
[3]	IOB:	indirect object
[4]	FOB:	fronting object
[5]	DBL:	double roles: subject & object
[6]	ATT:	attribute
[7]	ADV:	adverbial
[8]	CMP:	complement
[9]	COO:	coordinate
[10]	POB:	preposition-object
[11]	LAD:	left adjunct
[12]	RAD:	right adjunct
[13]	IS:	independent structure
[14]	HED:	head

-----------

8. Description of the Part-of-speech tags:

The POS tags follows the national 863 standard and include 26 different tags.

[1]	a	adjective			美丽
[2]	b	other noun-modifier		大型, 西式
[3]	c	conjunction			和, 虽然
[4]	d	adverb			很
[5]	e	exclamation			哎
[6]	h	prefix			阿, 伪
[7]	i	idiom			百花齐放
[8]	j	abbreviation		公检法
[9]	k	suffix			界, 率
[10]	m	number			一, 第一
[11]	n	general noun		苹果
[12]	nd	direction noun		右侧
[13]	nh	person name			杜甫, 汤姆
[14]	ni	organization name		保险公司
[15]	nl	location noun		城郊
[16]	ns	geographical name		北京
[17]	nt	temporal noun		近日, 明代
[18]	nz	other proper noun		诺贝尔奖
[19]	o	onomatopoeia		哗啦
[20]	p	preposition			在, 把
[21]	q	quantity			个
[22]	r	pronoun			我们
[23]	u	auxiliary			的, 地
[24]	v	verb			跑, 学习
[25]	wp	punctuation			,。!
[26]	ws	foreign words		CPU


9. References:  

People’s Daily corpus (PD) web site: http://icl.pku.edu.cn/icl_groups/corpustagging.asp

Ting Liu, Jinshan Ma, and Sheng Li. 2006. 
Building a dependency treebank for improving Chinese parser.
In Journal of Chinese Language and Computing, volume 16, pages 207–224.