Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)
|Item Name:||Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis)|
|Author(s):||Mohamed Maamouri, Ann Bies, Tim Buckwalter, Hubert Jin|
|LDC Catalog No.:||LDC2005T02|
|Release Date:||February 15, 2005|
|Application(s):||automatic content extraction, cross-lingual information retrieval, information detection, natural language processing|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2005T02 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Maamouri, Mohamed, et al. Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) LDC2005T02. Web Download. Philadelphia: Linguistic Data Consortium, 2005.|
Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + syntactic analysis) was developed by the Linguistic Data Consortium (LDC) and contains 123,795 Arabic word tokens with part-of-speech (POS) and syntactic treebank annotation. The POS annotation includes the lexical category, inflectional features, a gloss, full vocalization, and case ending.
The goal of the Arabic Treebank project is to support the development of data-driven approaches to natural language processing (NLP), human language technologies, automatic content extraction (topic extraction and/or grammar extraction), cross-lingual information retrieval, information detection, and general linguistic research on Modern Standard Arabic. LDC was sponsored to develop an Arabic POS and Treebank of 1 million words.
The Penn Arabic Treebank, which started in November 2001 as part of the DARPA TIDES project, is particularly suitable for language developers, computational linguists, and computer scientists who are interested in various aspects of NLP. It started in the fall of 2001 with the objective of annotating a large Arabic machine-readable text corpus manually and automatically. This corpus is a re-release of part one of that project, with the addition of improved morphological/part-of-speech annotation (including full vocalization and case endings).
The previous and subsequent versions of this corpus are, respectively:
The following table gives a breakdown of the data contained in the entire Arabic Treebank project, with discrepancies between versions for Parts 1 and 3. The fields include source, number of stories, total number of tokens, number of tokens after clitic separation, and number of Arabic word tokens after punctuation, numbers, and latin strings have been taken out. The totals given at the bottom are calculated from the latest versions where discrepencies exist, and do not include tokens after clitic separation since that number is missing from Part 4.
|Part||Source||Stories||Total Tokens||Tokens After Clitic Separation||Arabic Word Tokens|
|1 (V 2.0)||Agence France Presse||734||140,265||168,123||N/A|
|1 (V 3.0 and 4.1)||Agence France Presse||734||145,386||166,068||123,795|
|3 (V 1.0 and 2.0)||An Nahar News Agency||600||340,281||400,213||293,035|
|3 (V 3.2)||An Nahar News Agency||599||339,710||402,291||292,554|
This corpus uses Modern Standard Arabic text from the Agence France Presse (AFP) newswire archives for July - November 2000 originally released in Arabic Gigaword (LDC2003T12). For this work, annotators must be native speakers of Arabic, and they must understand enough linguistics to check morphosyntactic analysis and build syntactic structures.
For examples of the data in this corpus, please view the following samples:
None at this time.