Prague Arabic Dependency Treebank in the CoNLL-XI 2007 ------------------------------------------------------ README slightly modified on 2010-04-15 by Dan Zeman for the prepared new publication in LDC. These instructions are also available at the PADT ++ weblog [1]. [Hyperlinks] are listed at the end of this document. 1 Source ======== The CoNLL Shared Task 2007 has been announced. The extended data of PADT will be used in the competition, and we provide their rough characteristics: ------------------------------------------------------------------ Total 116,800 tokens 3,044 trees 378 files annotated on the levels of analytical syntax and morphology AEP 9,500 tokens 242 trees 29 files Arabic English Parallel News [2] AFE 13,000 tokens 411 trees 48 files Arabic 10K-word English Translation [3] ALH 14,500 tokens 312 trees 41 files Arabic Gigaword [4] ANN 12,500 tokens 209 trees 17 files Arabic Gigaword HYT 25,500 tokens 457 trees 47 files Arabic Gigaword XIA 26,500 tokens 888 trees 111 files Arabic Gigaword XNH 15,000 tokens 525 trees 85 files Arabic Gigaword ------------------------------------------------------------------ 2 CoNLL specifics ================= This year's data differ from the last year's set in two important respects: 1. The extent and the quality of annotations have improved. We added new data sources, esp. AEP and AFE (with paragraph-aligned translations available). Other data sources are the newspaper texts published by Al Hayat, An Nahar, Ummah Press Service, and Xinhua. 2. The morphology of the former data has been reannotated using MorphoTrees [5], so that the format of all data is consistent now and the informativity of the morphological tags is considerably higher. Lemmas based on the Buckwalter lexicon [6] are also provided. The morphological class identifiers (the POSTAG) consist of the part-of-speech category (the CPOSTAG) and its refinement, and their meanings read: ------------------------------------------------------------------ VI VP VC imperfect, perfect, and imperative verb forms N- A- D- nouns, adjectives, and adverbs C- P- I- conjunctions, prepositions, interjections G- Q- Y- graphical symbols, numbers, abbreviations F- FN FI particles, esp. negative and interrogative S- SD SR pronouns, esp. demonstrative and relative -- isolated definite articles Z- proper names ------------------------------------------------------------------ The attributes and morphosyntactic features (the FEATS) associated with individual tokens, i.e. the nodes in the dependency tree, include the following kinds of information. A feature can be linguistically applicable but unresolved by the annotation, in which case it is not listed with the token: ------------------------------------------------------------------ Mood Indicative, Subjunctive, or Jussive of imperfect verbs, with D if undecided between S and J Voice Active or Passive Person 1 speaker, 2 addressee, 3 others Gender morphologically overt 'gender', Masculine or Feminine Number morphologically overt 'number', Singular, Dual, or Plural Case 1 nominative, 2 genitive, 4 accusative Defin morphological 'definiteness', Indefinite, Definite, Reduced, or Complex ------------------------------------------------------------------ The attached file tag-examples.txt lists 294 tags that occur in the CoNLL-X training data together with up to 5 most frequent word examples. There are the following columns: CPOSTAG - POSTAG - FEATS - examples. The inventory of analytical dependency functions is further explained in one document [7] or another [8]: ------------------------------------------------------------------ Pred verbal predicate Coord coordination Pnom nominal predicate Apos apposition PredE existential predicate Ante anteposition PredC conjunction as the clause's head AuxC conjunction PredP preposition as the clause's head AuxP preposition Sb subject AuxE emphasizing expression Obj object AuxM modifying expression Adv adverbial AuxY auxiliary, part of compound Atr attribute AuxG graphical symbol Atv complement AuxK sentence separator ExD ellipsis, no actual dependency _ excessive token, esp. due to typo ------------------------------------------------------------------ The conversion script [9] from the original FS format to the CoNLL format [10] produces files with the .conll extension. The script is run as follows: ------------------------------------------------------------------ btred -Qm padt-conll.btred syntax/*.syntax.fs mkdir conll mv syntax/*.syntax.fs.conll conll/ ------------------------------------------------------------------ The data use the UTF-8 encoding as required. It might however be preferred to view the data in the Buckwalter transliteration, if rendering the Arabic script poses some problems. We recommend using the Encode Arabic [11] libraries in Perl or Haskell to easily convert the data. For using the Perl library from a command line, a code like this would do: ------------------------------------------------------------------ # calling the module's functions in a one-liner cat PADT-data-in-CoNLL-format | \ perl -MEncode::Arabic -pe '$_ = encode "buckwalter", decode "utf8", $_' # running the scripts installed with the module cat PADT-data-in-CoNLL-format | encode "buckwalter" ------------------------------------------------------------------ To use the module for reducing the vocalization [12], or to choose the XML-compliant [13] variant of the Buckwalter transliteration, one can set the modes of conversion easily. Consider e.g. the following script, which removes any vocalization marks from the tokenized word forms (the FORM) supplied in the second column of the CoNLL data: ------------------------------------------------------------------ use Encode::Arabic ':modes'; enmode "buckwalter", 'full', 'xml'; demode "buckwalter", 'noneplus', 'xml'; while ($line = <>) { @cols = split /\t/, decode "utf8", $line; if (@cols < 2) { print $line; next; } unless ($cols[1] =~ /[\x{20}-\x{7F}]/) { $in_buck = encode "buckwalter", $cols[1]; $cols[1] = decode "buckwalter", $in_buck; warn $in_buck . "\n"; } print encode "utf8", join "\t", @cols; } ------------------------------------------------------------------ More examples are available in the CPAN documentation [14]. 3. Acknowledgements =================== Annotators: Jakub Kráčmar, Viktor Bielický, Iveta Kouřilová, Milada Frantová, Tereza Čečáková Coordinator: Otakar Smrž 4. Online references ==================== [1] http://ufal.mff.cuni.cz/padt/online/2007/01/conll-shared-task-2007.html [2] http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004T18 [3] http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2003T07 [4] http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2006T02 [5] http://sourceforge.net/projects/elixir-fm/ [6] http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC2004L02 [7] http://ufal.mff.cuni.cz/padt/PADT_1.0/docs/papers/2004-nemlar-padt.pdf [8] http://ufal.mff.cuni.cz/~smrz/CCISSA2006/ccissa-paper.pdf [9] http://ufal.mff.cuni.cz/~smrz/CoNLL/padt-conll.btred [10] http://nextens.uvt.nl/depparse-wiki/DataFormat [11] http://sourceforge.net/projects/encode-arabic/ [12] http://search.cpan.org/dist/Encode-Arabic/lib/Encode/Arabic/Buckwalter.pm#EXPORTS_%26_MODES [13] http://www.qamus.org/transliteration.htm [14] http://search.cpan.org/dist/Encode-Arabic/lib/Encode/Arabic/Buckwalter.pm