1. Publication title: Chinese Dependency Treebank (CDT) 1.0 ----------- 2. Authors: - Wanxiang Che- Zhenghua Li - Ting Liu - Contact: Wanxiang Che - Organization: Research Center for Social Computing and Information Retrieval of Harbin Institute of Technology (HIT-SCIR), China ----------- 3. Data type: text ----------- 4. Introduction Chinese Dependency Treebank (CDT) was developed by Harbin Institute of Technology's Research Center for Social Computing and Information Retrieval (HIT-SCIR). It contains 49,996 Chinese sentences with 902,191 words, which were annotated with syntactic dependency structures. All the sentences are randomly selected from People's Daiy between 1992 and 1996, with eliminating the ill-formed or short sentences, Word segmentation and Part-of-speech (POS) tagging are automatically done using statistical models trained on People's Daily corpus (PD), a large-scale corpus annotated with word segmentation and POS tags. The syntactic structures of the sentences are annotated by human annotaters. The annotaters are also required to correct the word segmentation errors. However, the POS tags are not corrected. ----------- 5. Format: The data is provided in the format of CoNLL-X. One line present information of one word. An empty line indicates the end of a sentence. Each line contains 10 columns seperated with a tab (\t). ======= An example sentence ======== 1 双方 _ n _ _ 2 ATT _ _ 2 间 _ nd _ _ 4 ATT _ _ 3 的 _ u _ _ 2 RAD _ _ 4 谈判 _ n _ _ 6 SBV _ _ 5 已经 _ d _ _ 6 ADV _ _ 6 破裂 _ v _ _ 0 HED _ _ 7 。 _ wp _ _ 6 WP _ _ [An empty line indicates the end of the sentence.] ============== Description of each column: 1: id, start from 1. 2: word form 3: lemma (EMPTY for Chinese) 4: coarse-grained POS tag 5: fine-grained POS tag (EMPTY) 6: other-features (EMPTY) 7: syntactic head 8: syntactic relation 9: predicted head (EMPTY) 10: predicted relation (EMPTY) ----------- 6. Data split To facilitate future comparison of different parsing models on this dataset, we randomaly split the data into training/development/test sets. train.conll06: 46,996 sentences, 847,994 words dev.conll06: 1,000 sentences, 18,293 words test.conll06: 2,000 sentences, 35,904 words ----------- 7. Description of the syntactic relations: [1] SBV: subject of verb [2] VOB: object of verb [3] IOB: indirect object [4] FOB: fronting object [5] DBL: double roles: subject & object [6] ATT: attribute [7] ADV: adverbial [8] CMP: complement [9] COO: coordinate [10] POB: preposition-object [11] LAD: left adjunct [12] RAD: right adjunct [13] IS: independent structure [14] HED: head ----------- 8. Description of the Part-of-speech tags: The POS tags follows the national 863 standard and include 26 different tags. [1] a adjective 美丽 [2] b other noun-modifier 大型, 西式 [3] c conjunction 和, 虽然 [4] d adverb 很 [5] e exclamation 哎 [6] h prefix 阿, 伪 [7] i idiom 百花齐放 [8] j abbreviation 公检法 [9] k suffix 界, 率 [10] m number 一, 第一 [11] n general noun 苹果 [12] nd direction noun 右侧 [13] nh person name 杜甫, 汤姆 [14] ni organization name 保险公司 [15] nl location noun 城郊 [16] ns geographical name 北京 [17] nt temporal noun 近日, 明代 [18] nz other proper noun 诺贝尔奖 [19] o onomatopoeia 哗啦 [20] p preposition 在, 把 [21] q quantity 个 [22] r pronoun 我们 [23] u auxiliary 的, 地 [24] v verb 跑, 学习 [25] wp punctuation ,。! [26] ws foreign words CPU 9. References: People’s Daily corpus (PD) web site: http://icl.pku.edu.cn/icl_groups/corpustagging.asp Ting Liu, Jinshan Ma, and Sheng Li. 2006. Building a dependency treebank for improving Chinese parser. In Journal of Chinese Language and Computing, volume 16, pages 207–224.