This is the readme file for the Czech part of the CONLL 2007 Shared Task.

Version: $Id: README,v 1.2 2006/01/08 02:24:04 sabine Exp $
Modified for CoNLL 2007 by Zdeněk Žabokrtský, Mon Jan  8 17:18:18 CET 2007
Modified for LDC publication by Daniel Zeman, Thu Apr 15 2010

1. Preamble

    1.1 Source

        The Prague Dependency Treebank 2.0, see http://ufal.mff.cuni.cz/pdt2.0/

        The PDT 2.0 has been developed by the Institute of Formal and Applied 
        Linguistics and the Center for Computational Linguistics, Charles 
        University, Prague (see http://ufal.mff.cuni.cz/).

        The PDT 2.0 is available from LDC, catalog number LDC2006T01
        
        Please note that the data selected for CoNLL 2007 is not the whole
        PDT 2.0. PDT 2.0 contains 68562 training sentences, 9270 dev-test
        sentences and 10148 eval-test sentences annotated on the analytical
        level. In contrast, only 25364 training sentences and 286 test
        sentences have been selected for the CoNLL 2007 data. See also Section
        2.2 below.

    1.2 License

        See http://ufal.mff.cuni.cz/pdt2.0/doc/pdt-guide/en/html/ch07.html

2. Documentation

    2.1 Data format

        Data adheres to the following rules:

        * Data files contain one or more sentences separated by a
          blank line.

        * A sentence consists of one or tokens, each one starting on a
          new line.

        * A token consists of ten fields described in the list
          below. Fields are separated by one tab.

        * All data files will contains these ten fields, although only
          the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are
          guaranteed to contain non-underscore values for all
          languages.

        * Data files are UTF-8 encoded (Unicode).

        "Manual" in the list below refers to:
	        Jiří Hana, Daniel Zeman
                Manual for Morphological Annotation
		Revision for the Prague Dependency Treebank 2.0
                UFAL Technical Report No. 2005-27, Charles University, 
                Czech Republic, 2002.

        also available at:
  	  http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/index.html
	or 
          http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/pdf/m-man-en.pdf
        
        the PDF version being also attached here as
          m-man-en.pdf


        Field 1: ID         

          Token counter, starting at 1 for each new sentence 
	  (unrelated to unique identifiers within the PDT 2.0)
         

        Field 2: FORM

          Word form or punctuation mark.

        Field 3: LEMMA         

          The lemma of the FORM. This is the "lemma proper" (without technical suffixes) of PDT 2.0, see
          section 2.1. "Lemma structure" of the "Manual" or
	  http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s01.html

          Note that we used the manually corrected lemmas and morphological tags
          as the basis for the CoNLL data, not the automatically predicted ones
          that were available in PDT 1.0.

        Field 4: CPOSTAG 

          Coarse-grained part-of-speech tag. This is the first character of 
          the PDT 2.0 morphological tag (positional tag), see section
          2.2.1.1. "Part of speech" of the "Manual" or
          http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html#POS

        Field 5: POSTAG         

          Fine-grained part-of-speech tag. This is the second character of
          the PDT 1.0 morphological tag (positional tag), see section
          2.2.1.2. "Detailed part of speech" of the "Manual" or
          http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html#SubPOS

        Field 6: FEATS         

          List of set-valued syntactic and/or morphological features. These 
          come from the 3rd to 15th character of the PDT 2.0 positional morphological tag
          see sections 2.2.1.3.- 2.2.1.13. of the "Manual" or
          http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s02s01.html#gendert
          
          The correspondence between the positions in the PDT 1.0 morphological tag
          (03..15) and the feature names:
          03 = Gen (gender)
          04 = Num (number)
          05 = Cas (case)
          06 = PGe (possessor's gender)
          07 = PNu (possessor's number)
          08 = Per (person)
          09 = Ten (tense)
          10 = Gra (degree of comparison)
          11 = Neg (negativeness)
          12 = Voi (voice)
          13 ... empty
          14 ... empty
          15 = Var (stylistical variation)

          In addition, we put the semantic information (originally in the
          second part of the lemma) in FEATS. See section  2.1.4. "Term" of the "Manual" or
          http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/m-layer/html/ch02s01s04.html
          
          The value is stored in the "Sem" feature.
          
          The attached file pdt2conll_tag_conversion_table.txt lists 4294 known
          PDT tags together with their equivalents in the CoNLL format. Out of
          these, 1739 actually occur in the training part of the CoNLL data;
          for these tags, up to 5 most frequent examples are also given. There
          are the following columns: PDT tag - CoNLL CPOSTAG - CoNLL POSTAG -
          CoNLL FEATS - examples. Note that the "Sem" feature never occurs in
          this list because it comes from PDT lemma, not PDT tag.

        Fields 7: HEAD         

          Head of current token, which is either a value of ID or zero ('0').
          A value of zero means the token attaches to the virtual root node.
          The dependency structure resulting from the HEAD information can be
          non-projective.
        
        Field 8: DEPREL         

          Type of the dependency relation to the HEAD. See "List of analytical functions"
	  in
            Jan Hajič et al.
            A Manual for Analytic Layer Tagging of the Prague Dependency
            Treebank (in Czech). UFAL Technical Report TR-1997-03, Charles
            University, Czech Republic, 1997
          or at
             http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03.html#s1-list-anal-func
          or attached as
             afun.html

	  Note that besides the analytical function itself, the value of DEPREL may contain also
          the suffix '_M' corresponding to the is_member attribute in PDT 2.0.
          It distinguishes members of coordination/apposition
          constructions from shared modifiers of such constructions
          (illustration: "We.Sb sell.Pred fresh.Atr vegetable.Obj_M and.Coord fruits.Obj_M").
          It is used for exactly the same purpose as suffices _Co and _Ap in PDT 1.0,
          as described at
            http://ufal.mff.cuni.cz/pdt2.0/doc/manuals/en/a-layer/html/ch03s04.html#s2-coord

        Field 9: PHEAD         

          Projective head of current token, which is always an
          underscore because it is not available from the PDT 2.0.


        Field 10: PDEPREL 

          Dependency relation to projective head, which is always an
          underscore, because it is not from the PDT 2.0.

    2.2 Text

        The electronic text sources have been provided by the Institute of the 
        Czech National Corpus. The text material contains samples from the 
        following sources:

        Lidové noviny (daily newspapers), 1991, 1994, 1995 
        Mladá fronta Dnes (daily newspapers), 1992 
        Českomoravský Profit (business weekly), 1994 
        Vesmír (scientific magazine), Academia Publishers, 1992, 1993 

	Due to size limitations, only 40 % of the PDT 2.0 analytical data
        are used for CoNLL 2007. The respective subsets are expressed
        as masks in the PDT 2.0 CD-ROM directory structure as follows:

           * train123.conll corresponds to data/full/*amw/train-{1,2,3}/*a.gz
           * dtest.conll corresponds to data/full/*amw/dtest/*a.gz


    2.3 Conversion

        The conversion process started from the PDT 2.0 specific XML-based file format
        called PML (Prague Markup Language). The conversion btred script is available at
        http://ufal.mff.cuni.cz/~zabokrtsky/tools/pdt20-to-CoNLL.btred


3. Acknowledgements

        The PDT people for making the treebank.

        Jan Hajič for granting the special license for CoNLL and 
        talking to LDC about it.

        Christopher Cieri, Executive Director of LDC, for arranging
        distribution through LDC.

        Tony Castelletto, Publications Programmer at LDC, for handling
        the distribution.