This is the readme file for the Czech part of the CONLL-X Shared Task.

Version: $Id: README,v 1.2 2006/01/08 02:24:04 sabine Exp $


1. Preamble

    1.1 Source

        The Prague Dependency Treebank 1.0, see http://ufal.mff.cuni.cz/pdt/

        The PDT 1.0 has been developed by the Institute of Formal and Applied 
        Linguistics and the Center for Computational Linguistics, Charles 
        University, Prague (see http://ufal.mff.cuni.cz/).

        The PDT 1.0 is available from LDC, catalog number LDC2001T10
        
        Please note that the training-test data split is NOT the same as the
        official split of PDT 1.0! PDT 1.0 contains 73,088 non-empty training
        sentences, 7319 non-empty development-test sentences and a similar
        number of evaluation-test sentences annotated on the analytical layer.
        In contrast, the CoNLL-X subset contains 72,703 training sentences
        and 365 test sentences, and it is not guaranteed that these do not
        overlap with the PDT 1.0 training set.

    1.2 License

        See license.htm

2. Documentation

    2.1 Data format

        Data adheres to the following rules:

        * Data files contain one or more sentences separated by a
          blank line.

        * A sentence consists of one or tokens, each one starting on a
          new line.

        * A token consists of ten fields described in the list
          below. Fields are separated by one tab.

        * All data files will contain these ten fields, although only
          the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are
          guaranteed to contain non-underscore values for all
          languages.

        * Data files are UTF-8 encoded (Unicode).

        "Manual" in the list below refers to:
                Jiří Hana, Hana Hanová
                Manual for Morphological Annotation
                CKL Technical Report TR-2002-14, Charles University in Prague, 
                Czech Republic, 2002.

        available online at:
          http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html
        and attached in this folder as:
          mman.html

        Field 1: ID

          Token counter, starting at 1 for each new sentence.

        Field 2: FORM

          Word form or punctuation mark.

        Field 3: LEMMA         

          The lemma of the FORM. This is the "lemma proper" of PDT 1.0, see
          section 2.1. "Lemma structure" of the "Manual" or
          http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#lemma

          Note that we used the manually corrected lemmas and morphological tags
          as the basis for the CoNLL-X data, not the automatically predicted ones
          that are also available in PDT 1.0.

        Field 4: CPOSTAG 

          Coarse-grained part-of-speech tag. This is the first character of 
          the PDT 1.0 morphological tag (positional tag), see section
          2.2.1.1. "Part of speech" of the "Manual" or
          http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tags

        Field 5: POSTAG         

          Fine-grained part-of-speech tag. This is the second character of
          the PDT 1.0 morphological tag (positional tag), see section
          2.2.1.2. "Detailed part of speech" of the "Manual" or
          http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tags

        Field 6: FEATS         

          List of set-valued syntactic and/or morphological features. These 
          come from the 3rd to 15th character of the PDT 1.0 morphological tag
          (positional tag), see sections
          2.2.1.3.- 2.2.1.13. of the "Manual" or
          http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#pos-tags
          
          The correspondence between the positions in the PDT 1.0 morphological tag
          (03..15) and the feature names:
          03 = Gen (gender)
          04 = Num (number)
          05 = Cas (case)
          06 = PGe (possessor's gender)
          07 = PNu (possessor's number)
          08 = Per (person)
          09 = Ten (tense)
          10 = Gra (degree of comparison)
          11 = Neg (negativeness)
          12 = Voi (voice)
          13 ... empty
          14 ... empty
          15 = Var (stylistical variation)

          In addition, we put the semantic information (originally in the
          second part of the lemma) in FEATS. See section 
          2.1.2. "Semantic Information" of the "Manual" or
          http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/mman.html#sem-info
          Most of the time, this semantic feature encodes the type of a proper name /
          named entity (geographical name, given name, surname etc.)
          
          The value is stored in the "Sem" feature.
          
          The attached file pdt2conll_tag_conversion_table.txt lists 4294 known
          PDT tags together with their equivalents in the CoNLL format. Out of
          these, 1941 actually occur in the training part of the CoNLL data;
          for these tags, up to 5 most frequent examples are also given. There
          are the following columns: PDT tag - CoNLL CPOSTAG - CoNLL POSTAG -
          CoNLL FEATS - examples. Note that the "Sem" feature never occurs in
          this list because it comes from PDT lemma, not PDT tag.

        Fields 7: HEAD         

          Head of current token, which is either a value of ID or zero ('0').
          A value of zero means the token attaches to the virtual root node.
          The dependency structure resulting from the HEAD information can be
          non-projective.
        
        Field 8: DEPREL         

          Dependency relation to the HEAD. See "List of analytical functions"
          attached as
            afun.html
          or online at
            http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/Doc/aman-en/ch03.html#s1-list-anal-func
          or
            Jan Hajič, Jarmila Panevová, Eva Buráňová, Zdeňka Urešová, Alla Bémová,
            Jan Štěpánek, Petr Pajas, Jiří Kárník:
            Anotace na analytické rovině. Návod pro anotátory
            (A Manual for Analytic Layer Tagging of the Prague Dependency
            Treebank (in Czech)). ÚFAL Technical Report TR-1997-03, Charles
            University in Prague, Czech Republic, 1997-1999

            or its English translation, both available at
            http://ufal.mff.cuni.cz/pdt/Corpora/PDT_1.0/References/index.html

        Field 9: PHEAD         

          Projective head of current token, which is always an
          underscore because it is not available from the Czech
          treebank.

        Field 10: PDEPREL 

          Dependency relation to projective head, which is always an
          underscore, because it is not from the Czech treebank.

    2.2 Text

        The electronic text sources have been provided by the Institute of the 
        Czech National Corpus. The text material contains samples from the 
        following sources:

        Lidové noviny (daily newspapers), 1991, 1994, 1995 
        Mladá fronta Dnes (daily newspapers), 1992 
        Českomoravský Profit (business weekly), 1994 
        Vesmír (scientific magazine), Academia Publishers, 1992, 1993 

    2.3 Conversion

        The conversion process started from the PDT specific SGML
	format called CSTS.

3. Acknowledgements

        The PDT people for making the treebank.

        Jan Hajič for granting the special license for CoNLL-X and 
        talking to LDC about it.

        Christopher Cieri, Executive Director of LDC, for arranging
        distribution through LDC.

        Tony Castelletto, Publications Programmer at LDC, for handling
        the distribution.