This is the readme file for the Danish part of the CONLL-X Shared Task. Version: $Id: README,v 1.2 2005/12/12 16:15:46 erwin Exp $ 1. Preamble 1.1 Source The Szeged Treebank (SzTB) is available from http://www.inf.u-szeged.hu/hlt The original phrase structured XML treebank was converted by an automatic C# conversion program (CoNLL-conv.exe). 1.2 Copyright The Szeged Treebank is a copyrighted material. * (C) 2000-2007 by the - Institute of Informatics at the University of Szeged, Hungary (Árpád tér 2., 6720 Szeged, Hungary, http://www.inf.u-szeged.hu) - Institute of Linguistics at the Hungarian Academy of Sciences (Benczúr u. 33., 1399 Budapest POB. 701/518, Hungary, http://www.nyelvtud.hu) - MorphoLogic Ltd. Budapest (Orbánhegyi út. 5., 1126 Budapest, Hungary, http://www.morphologic.hu) who own the copyright to all annotations in the Szeged Treebank version 2.0. The annotations in the Szeged Treebank 2.0 were carried out between 2000-2002, IKTA 27/2000 R&D project (POS tagged corpus) supported by the Ministry of Education between 2001-2003, NKFP 2/017/2001 R&D project (NP structure annotations added) supported by the Ministry of Education between 2003-2005, IKTA 037/2002 R&D project (treebank annotations added) supported by the Ministry of Education 1.3 License The copyright owners of the Szeged Treebank listed above, (The Institute of Informatics at the University of Szeged, the Institute of Linguistics at the Hungarian Academy of Sciences, and the MorphoLogic Ltd. Budapest) grant you the right to use the Szeged Treebank free of charge for education and research purposes after you have signed the license document and transferred to the copyright owners. If you participate in the CoNLL shared task 2007 competition then you are required to send back the documents to the CoNLL shared task organizers. 2. Documentation 2.1 Data format Data adheres to the following rules: * Data files contain one or more sentences separated by a blank line. * A sentence consists of one or tokens, each one starting on a new line. * A token consists of ten fields described in the table below. Fields are separated by one tab character. * All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are guaranteed to contain non-underscore values for all languages. * Data files are are UTF-8 encoded (unicode). Field 1: ID Token counter, starting at 1 for each new sentence. Field 2: FORM Word form or punctuation symbol Field 3: LEMMA Stem of word form. Field 4: CPOSTAG Coarse-grained part-of-speech tag. -------------------------------- Value: Description: -------------------------------- A adjectives C conjunctions I interjections M numerals N nouns O other token symbols (e-mails, web addresses, etc.) P pronouns R adverbs S adpositions T articles V verbs X foreign words Y abbreviations Z mistyped words WPUNCT word punctuations SPUNCT punctuations delimiting sentences (.,?,!) -------------------------------- Field 5: POSTAG Fine-grained part-of-speech tag ------------------------------------------- Value: Description: ------------------------------------------- Af normal adjective Cc coordinating conjunction Cs subordinating conjunction I interjection Io single-word sentences Mc cardinal numerals Md distributive numerals Mf fractal numerals Mo ordinal numerals Np proper nouns Nc common nouns Oh words ending in hyphens Oi identifiers On numbers written in digits Pd demonstrative pronouns Pg general pronouns Pi indefinite pronouns Pp personal pronouns Pq interrogative pronouns Pr reflexive pronouns Ps possessive pronouns Px reflexive pronouns Py reciprocal pronouns Rd demontrative adverbs Rg general adverbs Ri indefinite adverbs Rl personal adverbs Rm modifiers Rp particles, preverbs Rq interrogative adverbs Rr relative adverbs Rv verbal adverbs Rx other adverbs St adpositions (postpositions) Tf definite article Ti indefinite article Va auxiliary verb Vm main verb X foreign words Y abbreviations Z mistyped words WPUNCT word punctuations SPUNCT punctuations delimiting sentences (.,?,!) ------------------------------------------- Field 6: FEATS List of set-valued syntactic and/or morphological features. See the file dep_szegedtreebank_en.pdf for more information. Fields 7: HEAD Non-projective head of current token, which is either a value of ID or zero ('0') Field 8: DEPREL Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero. See dep_szegedtreebank_en.pdf documentation for a desciption of the dependency relations. Field 9: PHEAD Projective head of current token, which is always an underscore because it is not available from the Hungarian treebank Field 10: PDEPREL Dependency relation to projective head, which is always an underscore, because it is not from the Hungarian treebank 2.2 Text The text material consists of newspaper articles from the HVG (World Economy Weekly, http://www.hvg.hu) and from Népszabadság (http://www.nol.hu) daily newspapers. The test and train dataset together was collected from the September 4 1999. issue of HVG, and from the April 3 1999. issue of Népszabadság. Several complete articles have been put to the test and the remaining text has been put to the training data. 2.3 Statistics Training ------------------------------- #sentences 6034 #tokens 131799 #non-punct tokens 111464 #punct tokens 20335 #coarse pos tags 16 #fine pos tags 42 #deprels 49 ------------------------------- Test ------------------------------- #sentences 390 #tokens 7344 #non-punct tokens 6090 #punct tokens 1254 #coarse pos tags 16 #fine pos tags 41 #deprels 45 ------------------------------- 2.4 Conversion We departed from the TEI XML version of the Szeged Treebank and converted its phrase structures by a C# program (conll-conv.exe) produced by Zoltan Alexin (alexin@inf.u-szeged.hu) 3. Acknowledgements Collegues, linguists, programmers listed in the treebank_description.pdf who made the Szeged Treebank between 2000 and 2005. Zoltán Alexin, who did the conversion of the phrase structured treebank to a dependency treebank by an automatic procedure. send your qestions and remarks to Zoltán Alexin, alexin@inf.u-szeged.hu