Prague Arabic Dependency Treebank 1.0

Item Name: Prague Arabic Dependency Treebank 1.0
Author(s): Jan Hajic, Otakar Smrz, Petr Zemanek, Petr Pajas, Jan Snaidauf, Emanuel Beska, Jakub Kracmar, Kamila Hassanova
LDC Catalog No.: LDC2004T23
ISBN: ISBN 1-58563-319-4
ISLRN: 034-001-778-929-8
Release Date: November 19, 2004
Member Year(s): 2004
DCMI Type(s): Text
Data Source(s): newswire
Project(s): TIDES, GALE
Application(s): language teaching, language modeling, information retrieval, information extraction, cross-lingual information retrieval, machine translation, parsing, tagging
Language(s): Standard Arabic
Language ID(s): arb
License(s): Prague Arabic Dependency Treebank 1.0
Online Documentation: LDC2004T23 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Hajic, Jan, et al. Prague Arabic Dependency Treebank 1.0 LDC2004T23. Web Download. Philadelphia: Linguistic Data Consortium, 2004.

Introduction

Prague Arabic Dependency Treebank (PADT) not only consists of multi-level linguistic annotations over the language of Modern Standard Arabic, but even provides a variety of unique software implementations designed for general use in Natural Language Processing (NLP).

The PADT project might be summarized as an open-ended activity of the Center for Computational Linguistics, the Institute of Formal and Applied Linguistics, and the Institute of Comparative Linguistics, Charles University in Prague, resting in multi-level annotation of Arabic language resources in the light of the theory of Functional Generative Description . The project is a younger sibling to Prague Dependency Treebank for Czech, and is maintained upon co-operation with the Linguistic Data Consortium, University of Pennsylvania, who release non-annotated corpora of Arabic newswire and develop an independent Penn Arabic Treebank.

Data

The corpus of PADT 1.0 consists of morphologically and analytically annotated newswire texts of Modern Standard Arabic, which originate from the Arabic Gigaword and the plain data of Penn Arabic Treebank, Part 1 and Penn Arabic Treebank, Part 2.

The PADT 1.0 distribution comprises over 113,500 tokens of data annotated analytically and provided with the disambiguated morphological information. In addition, the release includes complete annotations of MorphoTrees resulting in more than 148,000 tokens, 49,000 of which have received the analytical processing. The contents are further divided into data sets as indicated in the Table.

Data Set[A] Tokens [M]Tokens/ParaTokens/DocOriginal Data ProviderNews PeriodRelated Corpora
AFP 13,000 N/A 34.6 [N/A] 260 [N/A] Agence France Presse July 2000 Penn ATB Part 1
UMH 38,500 N/A 43.6 [N/A] 290 [N/A] Ummah Press Service Spring 2002 Penn ATB Part 2
XIN 13,500 N/A 31.2 [N/A] 155 [N/A] Xinhua News Agency May 2003 Arabic Gigaword
ALH 10,000 73,500 47.0 [47.8] 405 [405] Al Hayat News Agency September 2001 Arabic Gigaword
ANN 12,500 25,500 60.3 [50.3] 740 [630] An Nahar News Agency November 2002 Arabic Gigaword
XIA 26,500 49,500 29.7 [25.9] 235 [205] Xinhua News Agency May 2003 Arabic Gigaword

In the Table, tokens give the number of syntactic units that are annotated [A] analytically [M] within MorphoTrees. Approximate ratios of tokens per paragraph and tokens per document come in the next columns, distinguishing the two types of annotation. The sets of selected documents could cover only a couple of days of the specified period of time.

Samples

Preview of paragraph morphology tree. New analytical rendering style.

Support

PADT 1.0 was supported by the Ministry of Education of the Czech Republic, projects LN00A063 and MSM113200006, and by the Grant Agency of the Czech Republic, project 405/02/0823.

Updates

Updates or bug fixes may be available in the LDC catalog entry for this corpus, or at the PADT website.

Your questions and suggestions are welcome at padt (at) ckl (dot) mff (dot) cuni (dot) cz.

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee