English-Arabic Treebank v 1.0


Item Name: English-Arabic Treebank v 1.0
Authors: Ann Bies
LDC Catalog No.: LDC2006T10
ISBN: 1-58563-387-9
Release Date: May 18, 2006
Data Type: text
Project(s): GALE
Language(s): English, Modern Standard Arabic
Language ID(s): arb, eng
Distribution: Web Download
Member fee: $0 for 2006 members
Non-member Fee: US $2000.00
Reduced-License Fee: US $1000.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Ann Bies
2006
English-Arabic Treebank v 1.0
Linguistic Data Consortium, Philadelphia

Introduction

This file contains documentation on the English-Arabic Parallel Treebank v 1.0 , Linguistic Data Consortium (LDC) catalog number LDC2006T10, ISBN 1-58563-387-9.

This release of the English-Arabic Treebank consists of 52,238 words in 224 files of individual Agence France Presse (AFP) news stories (corresponding to approximately the first 50K words of the Arabic Treebank: Part 1 v 3.0 -- LDC Catalog No.: LDC2005T02, ISBN: 1-58563-330-5). The English translation was provided by LDC, and was part-of-speech tagged and treebanked for this project.

Data

The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with two notable differences:
  1. POS: tokenization of hyphenated items ("New York-based" has been replaced by "New York - based" for example), and the addition of HYPH and AFX tags necessitated by this change in tokenization
  2. TreeBank: the addition of the node label NML for sub-NP nominal constituents (replacing NX and most NP-internal NAC)

Samples

For an example of the data in this corpus, please review this text sample.

Copyright

Portions 2000 Agence France Presse, 2006 Trustees of the University of Pennsylvania