This is the readme file for the Arabic part of the CONLL-X Shared Task.

Version: $Id: README,v 1.3 2006/01/09 00:19:19 yuval Exp $
Modified by Dan Zeman for the LDC edition on 2010/4/15.


1. Preamble

    1.1 Source

        Prague Arabic Dependency Treebank (PADT) 1.0

        For further details about the PADT consult the web site:
          http://ufal.mff.cuni.cz/padt/PADT_1.0/index.html
        and in particular the paper:
          Jan Hajič, Otakar Smrž, Petr Zemánek, Jan Šnaidauf, and Emanuel Beška. 2004.
          Prague Arabic Dependency Treebank: Development in Data and Tools.
          In Proceedings of the NEMLAR International Conference on Arabic Language
          Resources and Tools, pages 110-117, Cairo, Egypt, September 2004.
          http://ufal.mff.cuni.cz/padt/PADT_1.0/docs/papers/2004-nemlar-padt.pdf

    1.2 Copyright

        Portions 
        Copyright © 2002-2004 Trustees of the University of Pennsylvania, 
        Copyright © 2000 Agence France Presse, 
        Copyright © 2001 Al Hayat News Agency, 
        Copyright © 2002 Ummah Press Service, 
        Copyright © 2002 An Nahar News Agency, 
        Copyright © 2003 Xinhua News Agency, 
        Copyright © 2002-2004 Center for Computational Linguistics &
                      Institute of Formal and Applied Linguistics &
                      Institute of Comparative Linguistics, 
                      Charles University in Prague

    1.3 License

        See license.htm

2. Documentation

    2.1 Data format

        Data adheres to the following rules:

        * Data files contain one or more sentences separated by a
          blank line.

        * A sentence consists of one or tokens, each one starting on a
          new line.

        * A token consists of ten fields described in the list
          below. Fields are separated by one tab.

        * All data files will contains these ten fields, although only
          the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are
          guaranteed to contain non-underscore values for all
          languages.

        * Data files are UTF-8 encoded (Unicode).

        Field 1: ID         

          Token counter, starting at 1 for each new sentence.

        Field 2: FORM

          Word form or punctuation mark. For the form to be useful for
          people who can as well as those who cannot read Arabic
          script, we have concatenated the form in Arabic script and
          its transliteration with an underscore in the middle.

        Field 3: LEMMA         

          The lemma of the FORM. Again we concatenated the Arabic
          script and the transliteration.

        Field 4: CPOSTAG 

          Coarse-grained part-of-speech tag. This is the first character of 
          the PADT 1.0 morphological tag (positional tag). See file morph.txt
          for a detailed mapping from the coarse and fine POS tags + features
          to Buckwalter annotation.


        Field 5: POSTAG         

          Fine-grained part-of-speech tag. The first and second
          character of the PADT 1.0 morphological tag (positional tag)
          if the second character is not '-'. Identical to CPOSTAG
          otherwise. See file morph.txt for a detailed mapping from the
          coarse and fine POS tags + features to Buckwalter annotation.

          Here is a list of POSTAG values with short (and hopefully correct explanation):
            A  adjective
            C  conjunction/subjunction
            D  adverb
            F  function word, other particle
            FI interrogative particle
            FN negation particle
            G  punctuation (not used in UMH subcorpus)
            I  interjection
            N  noun
            P  preposition
            Q  number (not used in UMH subcorpus)
            SD demonstrative pronoun
            SR relative pronoun
            S  other pronoun
            T  typo
            VI verb, perfect
            VP verb, imperfect
            X  non-alphabetic, also used for punctuation in the UMH subcorpus
            Y  abbreviation
            Z  proper noun

        Field 6: FEATS         

          List of set-valued syntactic and/or morphological features. These 
          come from the 3rd to 10th character of the PDT 1.0 morphological tag
          (positional tag). See file morph.txt for a detailed mapping from the
          coarse and fine POS tags + features to Buckwalter annotation.


          They encode the following properties:
          case 
                1 nominative
                2 genitive
                4 accusative
          definiteness 
                D definite
                I indefinite
                R reduced
                C complex
          gender
                M masculine
                F feminine
          mood
                D undecided between subjunctive and jussive
                I indicative
                S subjunctive
          number
                S singular
                P plural
                D dual
          person
                1 first
                2 second
                3 third
          voice
                P passive 
          
          The attached file tag-examples.txt lists 238 tags that occur in the
          CoNLL-X training data together with up to 5 most frequent word
          examples. There are the following columns: CPOSTAG - POSTAG - FEATS
          - examples.

        Fields 7: HEAD         

          Head of current token, which is either a value of ID or zero ('0').
          A value of zero means the token attaches to the virtual root node.
          The dependency structure resulting from the HEAD information can be
          non-projective.
        
        Field 8: DEPREL         

          Dependency relation to the HEAD. See file funcs.txt

        Field 9: PHEAD         

          Projective head of current token, which is always an
          underscore because it is not available from the Arabic
          treebank.

        Field 10: PDEPREL 

          Dependency relation to projective head, which is always an
          underscore, because it is not from the Arabic treebank.

    2.2 Text

        The data were taken from four subcorpora of the PADT:
        ALH, ANN, XIA, and UMH which correspond to four news agencies.

        Subcorpora issues:

          The UMH subcorpus was annotated using a slightly different
          convention. One part-of-speech tag is used for all
          non-alphabatic forms, including numbers and punctuation,
          which have separate tags in the other subcorpora. Also, some
          of the particles (e.g. 'li-') are attached to the word and
          the lemma is not available.

    2.3 Conversion

      The conversion process started from the FS (feature structure) files.
      The Arabic characters are encoded as unicode in the range U'd88c' - U'daaf' .
      In addition, the quotation marks U'c2ab' and U'c2bb' are used.
      
      Based on a correspondence with Otakar Smrž some errors were corrected:
      
       ALH20010911.0036_story.syntax.fs:
        morphological tag changed from PREP+NSUFF_FEM_SG to P-----FS-- 
      
       ANN20021101.0009_story.syntax.fs:
        morphological tag changed from VERB_PERFECT+PVSUFF_SUBJ:3MS+PVSUFF_SUBJ:3MS to VP---3MS--
      
       XIA20030503.0155_story.syntax.fs line 361:
        morphological tag changed from PREP+NSUFF_FEM_SG to P-----FS--
      
       XIA20030503.0194_story.syntax.fs lines 461, 481, 591:
        morphological tag changed from DET+NOUN_PROP+NSUFF_FEM_DU_ACCGEN to Z-----FD2D
      
       UMAAH_e_mar_3rd_2002.1112.fs:
        first sentence, word with ord=43, changed the function
        from AuxG to AuxK
      
      
      We converted the FS files to the PADT specific SGML format called
      CSTS using the any2any script provided with the distribution
      and used the SGML files as input. The arguments used were
      
         any2any -s fs -a csts -f csts <file>
      
      We then used the Python script
      
        padt2tab.py -f <csts-file>
      
      and selected the sentences
      for which there were no missing annotations. The list of missing annotations
      is in the doc/ directory, see file README-errors.txt
      

3. Acknowledgements

        The PADT people for making the treebank.

        Otakar Smrž for valuable help during the conversion.

        Jan Hajič for granting the special license for CoNLL-X and 
        talking to LDC about it.

        Christopher Cieri, Executive Director of LDC, for arranging
        distribution through LDC.

        Tony Castelletto, Publications Programmer at LDC, for handling
        the distribution.