============================================================== Chinese corpus for the CoNLL-2009 shared task "Syntactic and Semantic Dependencies in Multiple Languages" http://ufal.mff.cuni.cz/conll2009-st/ Version 1.1: February 11, 2009 Organizer of this corpus: Nianwen Xue, Brandeis University, xuen@cs.brandeis.edu ============================================================== This document provides the basic information of this corpus. WARNING This data is a subset of the Chinese Treebank 6.0 released by the Linguistic Data Consortium. (Catalog number: LDC2007T36). The semantic annotation in this data is released by the LDC as the Chinese Proposition Bank 2.0 (Catalog number LDC2008T07). The use of this data during and after the CONLL shared task is conditional on your agreement to the LDC special license for the 2009 CoNLL shared task. (1) GENERAL This distribution includes the data sets for Chinese language, part of the CoNLL-2009 shared task. The corpus includes syntactic dependencies and semantic dependencies. (2) CONTENTS OF THE DISTRIBUTION 1.0 The following files are included in this distribution: * README.TXT - this file * Chinese-conll09-train.txt - the training corpus * Chinese-conll09-dev.txt - the development corpus * frames: the frame files used in the annotation of the Chinese Proposition Bank. The frame files are named with a unique ID + pinyin, and the corresponding Chinese character(s) can be found inside the frame files. * tagsets.txt - the document describing the tagsets The tagsets of parts of speech (POS), syntactic dependencies and semantic dependencies are described. (3) SPECIFICS OF THE CHINESE DATA SET The special features of this corpus are: Encoding: * The encoding of the corpus files is UTF-8. Segmentation and POS-tagging: * Each line corresponds to a word. Words are correctly segmented and the CoNLL-2009 shared task does not handle word segmentation. The POS tags in the PPOS field are generated MXPOST retrained on the Chinese Treebank. The POS-tagger is trained on half of the data to tag the other half. * The (P)FEAT field is left empty because of the lack of inflection moprhology in Chinese. Syntactic dependency: * The Chinese Treebank is a phrase structure corpus and the dependency structure is automatically converted from the phrase structure representation. The conversion is based on the identification of phrases that have four different kind syntactic relations: modification, predication, complementation and coordination. It's worth noting that the head of a coordination structure is the first conjunct, and the heads of other conjuncts are conjunctions immediately preceding them. The conjunctions all "modify" the first conjunct. Some punctuation marks are treated as coordinating conjunctions. Semantic dependency (predicates and arguments): * Only verbal predicates are annotated. * The same word can be an argument to multiple predicates. * The predicate senses correspond to those in the Chinese Propbank frame files. The following tools have been used to generate the Predicted (P-) columns: * The PPOS field is generated by MXPOST (ftp://ftp.cis.upenn.edu/pub/adwait/jmx/jmx.tar.gz) retrained on the Chinese Treebank. * PHEAD and PDEPREL are generated using the MALT Parser (http://maltparser.org/). The parser was trained using one half of the data to tag the other half. REFERENCES Nianwen Xue, Fei Xia, Fu-Dong Chiou and Martha Palmer. 2005. The Penn Chinese Treebank: Phrase Structure Annotation of a Large Corpus. Natural Language Engineering, 11(2):207-238 Nianwen Xue and Martha Palmer. 2009. Adding semantic roles to the Chinese Treebank. Natural Language Engineering, 15(1):143-172. CONTRIBUTORS The provider thanks everyone who has contributed to the Chinese Treebank and the Chinese Proposition Bank. The following individuals directly contributed to the Chinese Treebank (in alphabetic order): Mitch Marcus, Meiyu Chang, Fu-Dong Chiou, Shizhe Huang, Zixin Jiang, Tony Kroch, Martha Palmer, Fei Xia, Nianwen Xue. The contributors to the Chinese Proposition Bank include (in alphabetic order): Meiyu Chang, Gang Chen, Helen Chen, Zixin Jiang, Martha Palmer, Zhiyi Song, Nianwen Xue, Ping Yu, Hua Zhong.