This file contains documentation on the English-Arabic Parallel Treebank v 1.0 , Linguistic Data Consortium (LDC) catalog number LDC2006T10, ISBN 1-58563-387-9.
This release of the English-Arabic Treebank consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories (corresponding to approximately the first 50K words of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02, ISBN: 1-58563-330-5). The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project.
Data The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences:
- POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization
- TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC)
Samples For an example of the data in this corpus, please review this text sample.
Portions © 2000 Agence France Presse, © 2006 Trustees of the University of Pennsylvania