This is the readme file for the Czech part of the CONLL-X Shared Task. Version: $Id: README,v 1.2 2006/01/08 02:24:04 sabine Exp $ 1. Preamble 1.1 Source The Prague Dependency Treebank 1.0, see http://ufal.mff.cuni.cz/pdt/ The PDT 1.0 has been developed by the Institute of Formal and Applied Linguistics and the Center for Computational Linguistics, Charles University, Prague (see http://ufal.mff.cuni.cz/). The PDT 1.0 is available from LDC, catalog number LDC2001T10 Please note that the training-test data split is NOT the same as the official split of PDT 1.0! PDT 1.0 contains 73,088 non-empty training sentences, 7319 non-empty development-test sentences and a similar number of evaluation-test sentences annotated on the analytical layer. In contrast, the CoNLL-X subset contains 72,703 training sentences and 365 test sentences, and it is not guaranteed that these do not overlap with the PDT 1.0 training set. 1.2 License See license.htm 2. Documentation 2.1 Data format Data adheres to the following rules: * Data files contain one or more sentences separated by a blank line. * A sentence consists of one or tokens, each one starting on a new line. * A token consists of ten fields described in the list below. Fields are separated by one tab. * All data files will contain these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are guaranteed to contain non-underscore values for all languages. * Data files are UTF-8 encoded (Unicode). "Manual" in the list below refers to: Jiří Hana, Hana Hanová Manual for Morphological Annotation CKL Technical Report TR-2002-14, Charles University in Prague, Czech Republic, 2002. available online at: http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html and attached in this folder as: mman.html Field 1: ID Token counter, starting at 1 for each new sentence. Field 2: FORM Word form or punctuation mark. Field 3: LEMMA The lemma of the FORM. This is the "lemma proper" of PDT 1.0, see section 2.1. "Lemma structure" of the "Manual" or http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#lemma Note that we used the manually corrected lemmas and morphological tags as the basis for the CoNLL-X data, not the automatically predicted ones that are also available in PDT 1.0. Field 4: CPOSTAG Coarse-grained part-of-speech tag. This is the first character of the PDT 1.0 morphological tag (positional tag), see section 2.2.1.1. "Part of speech" of the "Manual" or http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tags Field 5: POSTAG Fine-grained part-of-speech tag. This is the second character of the PDT 1.0 morphological tag (positional tag), see section 2.2.1.2. "Detailed part of speech" of the "Manual" or http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tags Field 6: FEATS List of set-valued syntactic and/or morphological features. These come from the 3rd to 15th character of the PDT 1.0 morphological tag (positional tag), see sections 2.2.1.3.- 2.2.1.13. of the "Manual" or http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tags The correspondence between the positions in the PDT 1.0 morphological tag (03..15) and the feature names: 03 = Gen (gender) 04 = Num (number) 05 = Cas (case) 06 = PGe (possessor's gender) 07 = PNu (possessor's number) 08 = Per (person) 09 = Ten (tense) 10 = Gra (degree of comparison) 11 = Neg (negativeness) 12 = Voi (voice) 13 ... empty 14 ... empty 15 = Var (stylistical variation) In addition, we put the semantic information (originally in the second part of the lemma) in FEATS. See section 2.1.2. "Semantic Information" of the "Manual" or http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#sem-info Most of the time, this semantic feature encodes the type of a proper name / named entity (geographical name, given name, surname etc.) The value is stored in the "Sem" feature. The attached file pdt2conll_tag_conversion_table.txt lists 4294 known PDT tags together with their equivalents in the CoNLL format. Out of these, 1941 actually occur in the training part of the CoNLL data; for these tags, up to 5 most frequent examples are also given. There are the following columns: PDT tag - CoNLL CPOSTAG - CoNLL POSTAG - CoNLL FEATS - examples. Note that the "Sem" feature never occurs in this list because it comes from PDT lemma, not PDT tag. Fields 7: HEAD Head of current token, which is either a value of ID or zero ('0'). A value of zero means the token attaches to the virtual root node. The dependency structure resulting from the HEAD information can be non-projective. Field 8: DEPREL Dependency relation to the HEAD. See "List of analytical functions" attached as afun.html or online at http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/aman-en/ch03.html#s1-list-anal-func or Jan Hajič, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, Alla Bémová, Jan Štěpánek, Petr Pajas, Jiří Kárník: Anotace na analytické rovině. Návod pro anotátory (A Manual for Analytic Layer Tagging of the Prague Dependency Treebank (in Czech)). ÚFAL Technical Report TR-1997-03, Charles University in Prague, Czech Republic, 1997-1999 or its English translation, both available at http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/index.html Field 9: PHEAD Projective head of current token, which is always an underscore because it is not available from the Czech treebank. Field 10: PDEPREL Dependency relation to projective head, which is always an underscore, because it is not from the Czech treebank. 2.2 Text The electronic text sources have been provided by the Institute of the Czech National Corpus. The text material contains samples from the following sources: Lidové noviny (daily newspapers), 1991, 1994, 1995 Mladá fronta Dnes (daily newspapers), 1992 Českomoravský Profit (business weekly), 1994 Vesmír (scientific magazine), Academia Publishers, 1992, 1993 2.3 Conversion The conversion process started from the PDT specific SGML format called CSTS. 3. Acknowledgements The PDT people for making the treebank. Jan Hajič for granting the special license for CoNLL-X and talking to LDC about it. Christopher Cieri, Executive Director of LDC, for arranging distribution through LDC. Tony Castelletto, Publications Programmer at LDC, for handling the distribution.