This is the readme file for the Czech part of the CONLL 2007 Shared Task. Version: $Id: README,v 1.2 2006/01/08 02:24:04 sabine Exp $ Modified for CoNLL 2007 by Zdeněk Žabokrtský, Mon Jan 8 17:18:18 CET 2007 Modified for LDC publication by Daniel Zeman, Thu Apr 15 2010 1. Preamble 1.1 Source The Prague Dependency Treebank 2.0, see http://ufal.mff.cuni.cz/pdt2.0/ The PDT 2.0 has been developed by the Institute of Formal and Applied Linguistics and the Center for Computational Linguistics, Charles University, Prague (see http://ufal.mff.cuni.cz/). The PDT 2.0 is available from LDC, catalog number LDC2006T01 Please note that the data selected for CoNLL 2007 is not the whole PDT 2.0. PDT 2.0 contains 68562 training sentences, 9270 dev-test sentences and 10148 eval-test sentences annotated on the analytical level. In contrast, only 25364 training sentences and 286 test sentences have been selected for the CoNLL 2007 data. See also Section 2.2 below. 1.2 License See http://ufal.mff.cuni.cz/pdt2.0/doc/pdt-guide/en/html/ch07.html 2. Documentation 2.1 Data format Data adheres to the following rules: * Data files contain one or more sentences separated by a blank line. * A sentence consists of one or tokens, each one starting on a new line. * A token consists of ten fields described in the list below. Fields are separated by one tab. * All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are guaranteed to contain non-underscore values for all languages. * Data files are UTF-8 encoded (Unicode). "Manual" in the list below refers to: Jiří Hana, Daniel Zeman Manual for Morphological Annotation Revision for the Prague Dependency Treebank 2.0 UFAL Technical Report No. 2005-27, Charles University, Czech Republic, 2002. also available at: http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/index.html or http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf the PDF version being also attached here as m-man-en.pdf Field 1: ID Token counter, starting at 1 for each new sentence (unrelated to unique identifiers within the PDT 2.0) Field 2: FORM Word form or punctuation mark. Field 3: LEMMA The lemma of the FORM. This is the "lemma proper" (without technical suffixes) of PDT 2.0, see section 2.1. "Lemma structure" of the "Manual" or http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s01.html Note that we used the manually corrected lemmas and morphological tags as the basis for the CoNLL data, not the automatically predicted ones that were available in PDT 1.0. Field 4: CPOSTAG Coarse-grained part-of-speech tag. This is the first character of the PDT 2.0 morphological tag (positional tag), see section 2.2.1.1. "Part of speech" of the "Manual" or http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html#POS Field 5: POSTAG Fine-grained part-of-speech tag. This is the second character of the PDT 1.0 morphological tag (positional tag), see section 2.2.1.2. "Detailed part of speech" of the "Manual" or http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html#SubPOS Field 6: FEATS List of set-valued syntactic and/or morphological features. These come from the 3rd to 15th character of the PDT 2.0 positional morphological tag see sections 2.2.1.3.- 2.2.1.13. of the "Manual" or http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html#gendert The correspondence between the positions in the PDT 1.0 morphological tag (03..15) and the feature names: 03 = Gen (gender) 04 = Num (number) 05 = Cas (case) 06 = PGe (possessor's gender) 07 = PNu (possessor's number) 08 = Per (person) 09 = Ten (tense) 10 = Gra (degree of comparison) 11 = Neg (negativeness) 12 = Voi (voice) 13 ... empty 14 ... empty 15 = Var (stylistical variation) In addition, we put the semantic information (originally in the second part of the lemma) in FEATS. See section 2.1.4. "Term" of the "Manual" or http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s01s04.html The value is stored in the "Sem" feature. The attached file pdt2conll_tag_conversion_table.txt lists 4294 known PDT tags together with their equivalents in the CoNLL format. Out of these, 1739 actually occur in the training part of the CoNLL data; for these tags, up to 5 most frequent examples are also given. There are the following columns: PDT tag - CoNLL CPOSTAG - CoNLL POSTAG - CoNLL FEATS - examples. Note that the "Sem" feature never occurs in this list because it comes from PDT lemma, not PDT tag. Fields 7: HEAD Head of current token, which is either a value of ID or zero ('0'). A value of zero means the token attaches to the virtual root node. The dependency structure resulting from the HEAD information can be non-projective. Field 8: DEPREL Type of the dependency relation to the HEAD. See "List of analytical functions" in Jan Hajič et al. A Manual for Analytic Layer Tagging of the Prague Dependency Treebank (in Czech). UFAL Technical Report TR-1997-03, Charles University, Czech Republic, 1997 or at http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03.html#s1-list-anal-func or attached as afun.html Note that besides the analytical function itself, the value of DEPREL may contain also the suffix '_M' corresponding to the is_member attribute in PDT 2.0. It distinguishes members of coordination/apposition constructions from shared modifiers of such constructions (illustration: "We.Sb sell.Pred fresh.Atr vegetable.Obj_M and.Coord fruits.Obj_M"). It is used for exactly the same purpose as suffices _Co and _Ap in PDT 1.0, as described at http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s04.html#s2-coord Field 9: PHEAD Projective head of current token, which is always an underscore because it is not available from the PDT 2.0. Field 10: PDEPREL Dependency relation to projective head, which is always an underscore, because it is not from the PDT 2.0. 2.2 Text The electronic text sources have been provided by the Institute of the Czech National Corpus. The text material contains samples from the following sources: Lidové noviny (daily newspapers), 1991, 1994, 1995 Mladá fronta Dnes (daily newspapers), 1992 Českomoravský Profit (business weekly), 1994 Vesmír (scientific magazine), Academia Publishers, 1992, 1993 Due to size limitations, only 40 % of the PDT 2.0 analytical data are used for CoNLL 2007. The respective subsets are expressed as masks in the PDT 2.0 CD-ROM directory structure as follows: * train123.conll corresponds to data/full/*amw/train-{1,2,3}/*a.gz * dtest.conll corresponds to data/full/*amw/dtest/*a.gz 2.3 Conversion The conversion process started from the PDT 2.0 specific XML-based file format called PML (Prague Markup Language). The conversion btred script is available at http://ufal.mff.cuni.cz/~zabokrtsky/tools/pdt20-to-CoNLL.btred 3. Acknowledgements The PDT people for making the treebank. Jan Hajič for granting the special license for CoNLL and talking to LDC about it. Christopher Cieri, Executive Director of LDC, for arranging distribution through LDC. Tony Castelletto, Publications Programmer at LDC, for handling the distribution.