English-Arabic Treebank v 1.0
|Item Name:||English-Arabic Treebank v 1.0|
|LDC Catalog No.:||LDC2006T10|
|Release Date:||May 18, 2006|
|Language(s):||English, Standard Arabic|
|Language ID(s):||eng, arb|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2006T10 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Bies, Ann. English-Arabic Treebank v 1.0 LDC2006T10. Web Download. Philadelphia: Linguistic Data Consortium, 2006.|
This file contains documentation on the English-Arabic Parallel Treebank v 1.0 , Linguistic Data Consortium (LDC) catalog number LDC2006T10, ISBN 1-58563-387-9.
This release of the English-Arabic Treebank consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories (corresponding to approximately the first 50K words of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02, ISBN: 1-58563-330-5). The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project.
DataThe guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences:
- POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization
- TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC)