This is the readme file for the Danish part of the CONLL-X Shared Task.

Version: $Id: README,v 1.2 2005/12/12 16:15:46 erwin Exp $


1. Preamble

    1.1 Source
	
	The Szeged Treebank (SzTB) is available from
	http://www.inf.u-szeged.hu/hlt
		
	The original phrase structured XML treebank was converted by
      	an automatic C# conversion program (CoNLL-conv.exe).

    1.2 Copyright

	The Szeged Treebank is a copyrighted material.

	    * (C) 2000-2007 by the
	- Institute of Informatics at the University of Szeged, Hungary
              (Árpád tér 2., 6720 Szeged, Hungary, http://www.inf.u-szeged.hu)
	- Institute of Linguistics at the Hungarian Academy of Sciences
              (Benczúr u. 33., 1399 Budapest POB. 701/518, Hungary, http://www.nyelvtud.hu) 
	- MorphoLogic Ltd. Budapest
	      (Orbánhegyi út. 5., 1126 Budapest, Hungary, http://www.morphologic.hu)
	    who own the copyright to all annotations in the Szeged Treebank
            version 2.0.

	The annotations in the Szeged Treebank 2.0 were carried out
        between 2000-2002, IKTA 27/2000 R&D project (POS tagged corpus)
                supported by the Ministry of Education
        between 2001-2003, NKFP 2/017/2001 R&D project (NP structure annotations added)
		supported by the Ministry of Education
        between 2003-2005, IKTA 037/2002 R&D project (treebank annotations added)
                supported by the Ministry of Education

    1.3 License

	The copyright owners of the Szeged Treebank listed above,
        (The Institute of Informatics at the University of Szeged,
        the Institute of Linguistics at the Hungarian Academy of Sciences,
	and the MorphoLogic Ltd. Budapest) grant you the right to use
        the Szeged Treebank free of charge for education and research
        purposes after you have signed the license document and
	transferred to the copyright owners. If you participate in
        the CoNLL shared task 2007 competition then you are required
        to send back the documents to the CoNLL shared task organizers.

2. Documentation

    2.1 Data format

    	Data adheres to the following rules:

    	* Data files contain one or more sentences separated by a
	  blank line.

    	* A sentence consists of one or tokens, each one starting on a
	  new line.

    	* A token consists of ten fields described in the table
	  below. Fields are separated by one tab character.

	* All data files will contains these ten fields, although only
          the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are
          guaranteed to contain non-underscore values for all
          languages.

	* Data files are are UTF-8 encoded (unicode).


	Field 1: ID 	

	    Token counter, starting at 1 for each new sentence.

	Field 2: FORM

	    Word form or punctuation symbol

	Field 3: LEMMA 	

	    Stem of word form.

	Field 4: CPOSTAG 

	    Coarse-grained part-of-speech tag. 

	    --------------------------------
	    Value:  Description:
	    --------------------------------
	    A	    adjectives
	    C	    conjunctions
	    I	    interjections
	    M	    numerals
	    N	    nouns
	    O       other token symbols (e-mails, web addresses, etc.)
	    P	    pronouns
	    R	    adverbs
	    S	    adpositions
	    T	    articles
	    V	    verbs
	    X	    foreign words
	    Y	    abbreviations
	    Z	    mistyped words
	    WPUNCT  word punctuations
            SPUNCT  punctuations delimiting sentences (.,?,!)
	    --------------------------------

	Field 5: POSTAG 	

	    Fine-grained part-of-speech tag

	    -------------------------------------------
	    Value:	Description:
	    -------------------------------------------
	    Af		normal adjective
	    Cc		coordinating conjunction
	    Cs		subordinating conjunction 
	    I		interjection
	    Io		single-word sentences
            Mc          cardinal numerals
            Md          distributive numerals
            Mf          fractal numerals
	    Mo          ordinal numerals
	    Np		proper nouns
	    Nc		common nouns
            Oh          words ending in hyphens
            Oi          identifiers
            On          numbers written in digits
	    Pd		demonstrative pronouns
            Pg          general pronouns 
	    Pi		indefinite pronouns
            Pp          personal pronouns
            Pq          interrogative pronouns
            Pr          reflexive pronouns
            Ps          possessive pronouns
            Px          reflexive pronouns
            Py          reciprocal pronouns
            Rd          demontrative adverbs 
            Rg          general adverbs
            Ri          indefinite adverbs
            Rl          personal adverbs
	    Rm          modifiers 
            Rp          particles, preverbs
            Rq          interrogative adverbs
            Rr          relative adverbs
            Rv          verbal adverbs
            Rx          other adverbs
	    St		adpositions (postpositions)
            Tf          definite article
            Ti          indefinite article
	    Va		auxiliary verb
            Vm          main verb
	    X	        foreign words
	    Y           abbreviations
            Z           mistyped words
	    WPUNCT      word punctuations
            SPUNCT      punctuations delimiting sentences (.,?,!)
	    -------------------------------------------

	Field 6: FEATS 	

	    List of set-valued syntactic and/or morphological
	    features. See the file dep_szegedtreebank_en.pdf for more
	    information.
	
	Fields 7: HEAD 	

	    Non-projective head of current token, which is either a
	    value of ID or zero ('0')
	
	Field 8: DEPREL 	

	    Dependency relation to the non-projective-head, which is
	    'ROOT' when the value of HEAD is zero. See dep_szegedtreebank_en.pdf
	    documentation for a desciption of the dependency relations.

	Field 9: PHEAD 	

	    Projective head of current token, which is always an
	    underscore because it is not available from the Hungarian
	    treebank

	Field 10: PDEPREL 

	    Dependency relation to projective head, which is always an
    	    underscore, because it is not from the Hungarian treebank
    
    2.2 Text

	The text material consists of newspaper articles from the 
        HVG (World Economy Weekly, http://www.hvg.hu) and from
        Népszabadság (http://www.nol.hu) daily newspapers.
        The test and train dataset together was collected
        from the September 4 1999. issue of HVG, and from the
        April 3 1999. issue of Népszabadság. Several complete
        articles have been put to the test and the remaining text
        has been put to the training data.

    2.3 Statistics

        Training    
	-------------------------------
	#sentences		  6034
	#tokens			131799
	#non-punct tokens	111464
        #punct tokens            20335

	#coarse pos tags	    16
	#fine pos tags		    42
	#deprels		    49
	-------------------------------	
        Test
	-------------------------------
	#sentences		   390
	#tokens			  7344
	#non-punct tokens	  6090
        #punct tokens             1254

	#coarse pos tags	    16
	#fine pos tags		    41
	#deprels		    45
	-------------------------------	

    2.4 Conversion

	We departed from the TEI XML version of the Szeged Treebank
        and converted its phrase structures by a C# program (conll-conv.exe)
        produced by Zoltan Alexin (alexin@inf.u-szeged.hu)


3. Acknowledgements

        Collegues, linguists, programmers listed in the treebank_description.pdf
        who made the Szeged Treebank between 2000 and 2005. 
		
	Zoltán Alexin, who did the conversion of the phrase structured treebank 
        to a dependency treebank by an automatic procedure. send your qestions
        and remarks to Zoltán Alexin, alexin@inf.u-szeged.hu