Home › Language Resources › Data

Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)

Item Name:	Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)
Author(s):	Mohamed Maamouri, Ann Bies, Tim Buckwalter, Hubert Jin, Wigdan Mekki
LDC Catalog No.:	LDC2005T20
ISBN:	1-58563-341-0
ISLRN:	661-115-390-052-2
DOI:	https://doi.org/10.35111/ghrm-vt27
Release Date:	June 15, 2005
Member Year(s):	2005
DCMI Type(s):	Text
Project(s):	GALE, TIDES
Application(s):	automatic content extraction, cross-lingual information retrieval, information detection, natural language processing
Language(s):	Standard Arabic
Language ID(s):	arb
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2005T20 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Maamouri, Mohamed, et al. Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) LDC2005T20. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: Hide	View isVersionOf LDC2004T11 Arabic Treebank: Part 3 v 1.0 hasVersion LDC2010T08 Arabic Treebank: Part 3 v 3.2 hasOutcome LDC2009T22 Arabic Newswire English Translation Collection isSimilarWith LDC2003T06 Arabic Treebank: Part 1 v 2.0 LDC2004T02 Arabic Treebank: Part 2 v 2.0 LDC2005T02 Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) LDC2005T30 Arabic Treebank: Part 4 v 1.0 (MPG Annotation) LDC2010T13 Arabic Treebank: Part 1 v 4.1 LDC2011T09 Arabic Treebank: Part 2 v 3.1 LDC2012T07 Arabic Treebank - Broadcast News v1.0 LDC2016T02 Arabic Treebank - Weblog isProcessedBy LDC2004L02 Buckwalter Arabic Morphological Analyzer Version 2.0

Introduction

Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) was developed by the Linguistic Data Consortium (LDC) and contains approximately 300,000 Arabic word tokens with both syntactic treebank annotation and annotation on part of speech (POS), gloss, and word segmentation.

The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and other forms of linguistic research on Modern Standard Arabic in general. LDC was sponsored to develop an Arabic POS and Treebank of 1 million words. This corpus is part three of that project.

Treebanks are language resources that provide annotations of natural languages at various levels of structure: at the word level, the phrase level, and the sentence level. Treebanks have become crucially important for the development of both data-driven and general linguistic research. This corpus is designed for those who study and use languages either professionally or academically, and who need text corpora in their work.

The Penn Arabic Treebank, which started in November 2001 as part of the DARPA TIDES project, is particularly suitable for language developers, computational linguists, and computer scientists who are interested in various aspects of NLP. It started in the fall of 2001 with the objective of annotating a large Arabic machine-readable text corpus manually and automatically. As in previous Penn Treebanks, two different kinds of information need to be produced by two different (human and computer) processes. The Arabic Treebank project consists therefore of two distinct phases:

Part-of-Speech (POS) tagging, which includes inflectional features and gloss information not traditionally included with POS annotation
Arabic Treebanking (ArabicTB), which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc.

Data

The following table gives a breakdown of the data contained in the entire Arabic Treebank project, with discrepancies between versions for Parts 1 and 3. The fields include source, number of stories, total number of tokens, number of tokens after clitic separation, and number of Arabic word tokens after punctuation, numbers, and latin strings have been taken out. The totals given at the bottom are calculated from the latest versions where discrepencies exist, and do not include tokens after clitic separation since that number is missing from Part 4.

Part	Source	Stories	Total Tokens	Tokens After Clitic Separation	Arabic Word Tokens
1 (V 2.0)	Agence France Presse	734	140,265	168,123	N/A
1 (V 3.0 and 4.1)	Agence France Presse	734	145,386	166,068	123,795
2	Ummah Press	501	144,199	169,319	125,709
3 (V 1.0 and 2.0)	An Nahar News Agency	600	340,281	400,213	293,035
3 (V 3.2)	An Nahar News Agency	599	339,710	402,291	292,554
4	Assabah	397	161,915	N/A	146,491
Totals		2,231	791,210		688,549

For this corpus, the An Nahar News Agency stories were taken from Arabic Gigaword (LDC2003T12). This corpus is also referred to as ANNAHAR. The new features include complete vocalization of all Imperfect Verb mood endings: Indicative, Subjunctive, and Jussive. Tim Buckwalter's lexicon and morphological analyzer was used to generate a candidate list of POS tags for each word. (Please note that some words do not exist in this lexicon.) The POS task is just to select the correct POS tag.

This corpus has both previous and subsequent versions. They are, respectively:

Arabic Treebank: Part 3 v 1.0 (LDC2004T11) - POS annotation only
Arabic Treebank: Part 3 v 3.2 (LDC2010T08) - Contains significant revisions

Samples

For examples of the data contained in this corpus, please view this POS sample (XML) and this Treebank sample (XML).

Updates

None at this time.

Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis)

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees