============================================================================== Czech data for the CoNLL-2009 shared task "Syntactic and Semantic Dependencies in Multiple Languages" Version 1.3: January 13, 2012 ============================================================================== (1) LIST OF VERSIONS v1.3 [2012/01/13]: OOD data description v1.2 [2009/02/07]: missing Y added to FILLPRED at PRED='0' v1.1 [2009/02/05]: details on parser added to README.TXT v1.0 [2009/01/19]: initial distribution of the trial, train and development data, valency lexicon and this README.TXT file. (2) CONTENTS * README.TXT this file * CoNLL2009-ST-Czech-development.txt development data set for Czech, 5228 sentences * CoNLL2009-ST-Czech-train.txt train data set for Czech, 38727 sentences * CoNLL2009-ST-Czech-trial.txt trial data set (part of train data set) for Czech, 193 sentences * CoNLL2009-ST-evaluation-Czech.txt Czech evaluation data. 4213 sentences * CoNLL2009-ST-evaluation-Czech-ood.txt Czech out-of-domain data. 1184 sentences * Czech.vallex Czech valency lexicon, 14979 entries. See below for details. (3) ON THE DATA SET The Czech corpus was converted from PDT 2.0 (http://ufal.mff.cuni.cz/pdt2.0/) following these rules: - LEMMA contains only the word itself, no details nor explanation. - Detailed part of speech is part of FEAT (SubPOS). - HEAD is taken from the analytical layer of PDT 2.0. - DEPREL is taken from afun attribute of the analytical layer of PDT 2.0. - Every analytical node referenced by a lexical reference (a/lex.rf) from the tectogrammatical layer of PDT 2.0 has a PRED value filled. If the refering non-generated tectogrammatical node has a valency frame assigned, the value of PRED is the identifier of the frame. Otherwise, it is set to the lemma. - For every tectogrammatical node, corresponding analytical node is found: 1. If the tectogrammatical node is not generated and has a lexical reference, the referenced node is taken. 2. Otherwise, if the tectogrammatical node has a coreference or complement reference to a node that has an analytical node assigned (by 1. or 2.), the assigned node is taken. APRED columns are filled with respect to this correspondence: for a tectogrammatical node P and its effective child C with functor F, the column for P's corresponding analytical node at the row for C's corresponding analytical node is filled with F. (Some nodes can thus have several functors in one APRED column, they are separated by a vertical bar). - PLEMMA, PPOS and PFEAT were generated by cross-trained tagger MORCE (http://ufal.mff.cuni.cz/morce/) by Jan Raab. - PHEAD and PDEPREL were generated by cross-trained MST parser [McDonald et al., 2005] (Chu-Liu alogorithm) by Drahomira Johanka Spoustova. - The valency lexicon has four columns: 1. lemma (can occur several times in the lexicon, with different frames) 2. frame identifier 3. list of space-separated actants and obligatory members of the frame 4. example(s) The out-of-domain data were created in the same way from a small part of unfinished Prague Czech-English Dependency Treebank 2.0. (4) ORGANIZATION Jan Hajic, Pavel Stranak, Jan Stepanek UFAL MFF UK Malostranske namesti 25 11800 Praha Czech Republic Phone: +420-221 914 257 Fax: +420-221 914 309 E-mail: [lastname][at]ufal.mff.cuni.cz