English-Arabic Treebank v 1.0

Item Name: English-Arabic Treebank v 1.0
Author(s): Ann Bies
LDC Catalog No.: LDC2006T10
ISBN: 1-58563-387-9
ISLRN: 021-421-953-520-4
Release Date: May 18, 2006
Member Year(s): 2006
DCMI Type(s): Text
Project(s): GALE
Language(s): English, Standard Arabic
Language ID(s): eng, arb
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2006T10 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Bies, Ann. English-Arabic Treebank v 1.0 LDC2006T10. Web Download. Philadelphia: Linguistic Data Consortium, 2006.


This file contains documentation on the English-Arabic Parallel Treebank v 1.0 , Linguistic Data Consortium (LDC) catalog number LDC2006T10, ISBN 1-58563-387-9.

This release of the English-Arabic Treebank consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories (corresponding to approximately the first 50K words of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02, ISBN: 1-58563-330-5). The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project.


The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences:
  1. POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization
  2. TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC)


For an example of the data in this corpus, please review this text sample.

Available Media

View Fees

Extra Copy
Login for the applicable fee