This is the readme file for the Arabic part of the CONLL-X Shared Task. Version: $Id: README,v 1.3 2006/01/09 00:19:19 yuval Exp $ Modified by Dan Zeman for the LDC edition on 2010/4/15. 1. Preamble 1.1 Source Prague Arabic Dependency Treebank (PADT) 1.0 For further details about the PADT consult the web site: http://ufal.mff.cuni.cz/padt/PADT_1.0/index.html and in particular the paper: Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. 2004. Prague Arabic Dependency Treebank: Development in Data and Tools. In Proceedings of the NEMLAR International Conference on Arabic Language Resources and Tools, pages 110-117, Cairo, Egypt, September 2004. http://ufal.mff.cuni.cz/padt/PADT_1.0/docs/papers/2004-nemlar-padt.pdf 1.2 Copyright Portions Copyright © 2002-2004 Trustees of the University of Pennsylvania, Copyright © 2000 Agence France Presse, Copyright © 2001 Al Hayat News Agency, Copyright © 2002 Ummah Press Service, Copyright © 2002 An Nahar News Agency, Copyright © 2003 Xinhua News Agency, Copyright © 2002-2004 Center for Computational Linguistics & Institute of Formal and Applied Linguistics & Institute of Comparative Linguistics, Charles University in Prague 1.3 License See license.htm 2. Documentation 2.1 Data format Data adheres to the following rules: * Data files contain one or more sentences separated by a blank line. * A sentence consists of one or tokens, each one starting on a new line. * A token consists of ten fields described in the list below. Fields are separated by one tab. * All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are guaranteed to contain non-underscore values for all languages. * Data files are UTF-8 encoded (Unicode). Field 1: ID Token counter, starting at 1 for each new sentence. Field 2: FORM Word form or punctuation mark. For the form to be useful for people who can as well as those who cannot read Arabic script, we have concatenated the form in Arabic script and its transliteration with an underscore in the middle. Field 3: LEMMA The lemma of the FORM. Again we concatenated the Arabic script and the transliteration. Field 4: CPOSTAG Coarse-grained part-of-speech tag. This is the first character of the PADT 1.0 morphological tag (positional tag). See file morph.txt for a detailed mapping from the coarse and fine POS tags + features to Buckwalter annotation. Field 5: POSTAG Fine-grained part-of-speech tag. The first and second character of the PADT 1.0 morphological tag (positional tag) if the second character is not '-'. Identical to CPOSTAG otherwise. See file morph.txt for a detailed mapping from the coarse and fine POS tags + features to Buckwalter annotation. Here is a list of POSTAG values with short (and hopefully correct explanation): A adjective C conjunction/subjunction D adverb F function word, other particle FI interrogative particle FN negation particle G punctuation (not used in UMH subcorpus) I interjection N noun P preposition Q number (not used in UMH subcorpus) SD demonstrative pronoun SR relative pronoun S other pronoun T typo VI verb, perfect VP verb, imperfect X non-alphabetic, also used for punctuation in the UMH subcorpus Y abbreviation Z proper noun Field 6: FEATS List of set-valued syntactic and/or morphological features. These come from the 3rd to 10th character of the PDT 1.0 morphological tag (positional tag). See file morph.txt for a detailed mapping from the coarse and fine POS tags + features to Buckwalter annotation. They encode the following properties: case 1 nominative 2 genitive 4 accusative definiteness D definite I indefinite R reduced C complex gender M masculine F feminine mood D undecided between subjunctive and jussive I indicative S subjunctive number S singular P plural D dual person 1 first 2 second 3 third voice P passive The attached file tag-examples.txt lists 238 tags that occur in the CoNLL-X training data together with up to 5 most frequent word examples. There are the following columns: CPOSTAG - POSTAG - FEATS - examples. Fields 7: HEAD Head of current token, which is either a value of ID or zero ('0'). A value of zero means the token attaches to the virtual root node. The dependency structure resulting from the HEAD information can be non-projective. Field 8: DEPREL Dependency relation to the HEAD. See file funcs.txt Field 9: PHEAD Projective head of current token, which is always an underscore because it is not available from the Arabic treebank. Field 10: PDEPREL Dependency relation to projective head, which is always an underscore, because it is not from the Arabic treebank. 2.2 Text The data were taken from four subcorpora of the PADT: ALH, ANN, XIA, and UMH which correspond to four news agencies. Subcorpora issues: The UMH subcorpus was annotated using a slightly different convention. One part-of-speech tag is used for all non-alphabatic forms, including numbers and punctuation, which have separate tags in the other subcorpora. Also, some of the particles (e.g. 'li-') are attached to the word and the lemma is not available. 2.3 Conversion The conversion process started from the FS (feature structure) files. The Arabic characters are encoded as unicode in the range U'd88c' - U'daaf' . In addition, the quotation marks U'c2ab' and U'c2bb' are used. Based on a correspondence with Otakar Smrž some errors were corrected: ALH20010911.0036_story.syntax.fs: morphological tag changed from PREP+NSUFF_FEM_SG to P-----FS-- ANN20021101.0009_story.syntax.fs: morphological tag changed from VERB_PERFECT+PVSUFF_SUBJ:3MS+PVSUFF_SUBJ:3MS to VP---3MS-- XIA20030503.0155_story.syntax.fs line 361: morphological tag changed from PREP+NSUFF_FEM_SG to P-----FS-- XIA20030503.0194_story.syntax.fs lines 461, 481, 591: morphological tag changed from DET+NOUN_PROP+NSUFF_FEM_DU_ACCGEN to Z-----FD2D UMAAH_e_mar_3rd_2002.1112.fs: first sentence, word with ord=43, changed the function from AuxG to AuxK We converted the FS files to the PADT specific SGML format called CSTS using the any2any script provided with the distribution and used the SGML files as input. The arguments used were any2any -s fs -a csts -f csts We then used the Python script padt2tab.py -f and selected the sentences for which there were no missing annotations. The list of missing annotations is in the doc/ directory, see file README-errors.txt 3. Acknowledgements The PADT people for making the treebank. Otakar Smrž for valuable help during the conversion. Jan Hajič for granting the special license for CoNLL-X and talking to LDC about it. Christopher Cieri, Executive Director of LDC, for arranging distribution through LDC. Tony Castelletto, Publications Programmer at LDC, for handling the distribution.