ENGLISH/ARABIC TREEBANK 1.0 This release of the English/Arabic TreeBank consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories (corresponding to approximately the first 50K words of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02, ISBN: 1-58563-330-5). The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project. The data can be found in the following directories: Penn Treebank-style files (converted from WordFreak annotation files) /data/pennTB-style-trees/ WordFreak annotated files /data/annotated-files/ Source text files /data/rawtext-files/ The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences: 1. POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization 2. TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC) More detailed addenda to the Penn Treebank II guidelines can be found in /docs/pos-guidelines-addenda.txt and /docs/treebank-guidelines-addenda.txt A mapping from the original Arabic Treebank filenames to the current filenames used in this release can be found in /docs/afp-filename.map Annotators: Justin Mott Colin Warner Portions (c) 2000 Agence France Presse, Portions (c) 2005 Trustees of the University of Pennsylvania ----------------------------- Ann Bies bies@ldc.upenn.edu Linguistic Data Consortium May 15, 2006 -----------------------------