Home › Language Resources › Data

Arabic Treebank: Part 1 v 2.0

Item Name:	Arabic Treebank: Part 1 v 2.0
Author(s):	Mohamed Maamouri, Ann Bies, Hubert Jin, Tim Buckwalter
LDC Catalog No.:	LDC2003T06
ISBN:	1-58563-261-9
ISLRN:	333-321-196-670-5
DOI:	https://doi.org/10.35111/vfdx-p575
Release Date:	February 03, 2003
Member Year(s):	2003
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	GALE, TIDES
Application(s):	automatic content extraction, cross-lingual information retrieval, information detection, natural language processing
Language(s):	Standard Arabic
Language ID(s):	arb
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2003T06 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Maamouri, Mohamed, et al. Arabic Treebank: Part 1 v 2.0 LDC2003T06. Web Download. Philadelphia: Linguistic Data Consortium, 2003.
Related Works: Hide	View hasVersion LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) LDC2010T13 Arabic Treebank: Part 1 v 4.1 isAnnotationOf LDC2003T12 Arabic Gigaword hasAnnotation LDC2004T23 Prague Arabic Dependency Treebank 1.0 hasOutcome LDC2003T07 Arabic Treebank: Part 1 - 10K-word English Translation LDC2009T22 Arabic Newswire English Translation Collection isSimilarWith LDC2004T02 Arabic Treebank: Part 2 v 2.0 LDC2004T11 Arabic Treebank: Part 3 v 1.0 LDC2005T20 Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) LDC2005T30 Arabic Treebank: Part 4 v 1.0 (MPG Annotation) LDC2010T08 Arabic Treebank: Part 3 v 3.2 LDC2011T09 Arabic Treebank: Part 2 v 3.1 LDC2012T07 Arabic Treebank - Broadcast News v1.0 LDC2016T02 Arabic Treebank - Weblog relatesTo LDC2001T55 Arabic Newswire Part 1

Introduction

Arabic Treebank: Part 1 v 2.0 was developed by the Linguistic Data Consortium (LDC) and contains approximately 140,000 tokens of Arabic text with part-of-speech (POS) and treebank annotation.

The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and general linguistic research on Modern Standard Arabic. LDC was sponsored to develop an Arabic POS and Treebank of 1 million words.

The Penn Arabic Treebank, which started in November 2001 as part of the DARPA TIDES project, is particularly suitable for language developers, computational linguists, and computer scientists who are interested in various aspects of NLP. It started in the fall of 2001 with the objective of annotating a large Arabic machine-readable text corpus manually and automatically. This corpus is a release of part one of that project.

The subsequent versions of this corpus are:

Data

The following table gives a breakdown of the data contained in the entire Arabic Treebank project, with discrepancies between versions for Parts 1, 2, and 3. The fields include source, number of stories, total number of tokens, number of tokens after clitic separation, and number of Arabic word tokens after punctuation, numbers, and Latin strings have been taken out. The totals given at the bottom are calculated from the latest versions where discrepancies exist, and do not include tokens after clitic separation since that number is missing from Part 4.

Part	Source	Stories	Total Tokens	Tokens After Clitic Separation	Arabic Word Tokens
1 (V 2.0)	Agence France Presse	734	140,265	168,123	N/A
1 (V 3.0 and 4.1)	Agence France Presse	734	145,386	166,068	123,795
2 (V 2.0)	Ummah Press	501	144,199	168,297	125,698
2 (V 3.1)	Ummah Press	501	144,199	169,319	125,709
3 (V 1.0 and 2.0)	An Nahar News Agency	600	340,281	400,213	293,035
3 (V 3.2)	An Nahar News Agency	599	339,710	402,291	292,554
4	Assabah	397	161,915	N/A	146,491
Totals		2,231	791,210		688,549

This corpus uses Modern Standard Arabic text from the Agence France Presse (AFP) newswire archives for July - November 2000 later released in Arabic Gigaword (LDC2003T12). For this work, annotators must be native speakers of Arabic, and they must understand enough linguistics to check morphosyntactic analysis and build syntactic structures.

Arabic Treebank: Part 1 v 2.0

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees