Buckwalter Arabic Morphological Analyzer Version 1.0

Item Name: Buckwalter Arabic Morphological Analyzer Version 1.0
Author(s): Tim Buckwalter
LDC Catalog No.: LDC2002L49
ISBN: 1-58563-257-0
ISLRN: 435-186-167-011-2
Release Date: November 8, 2002
Member Year(s): 2002
DCMI Type(s): Text
Data Source(s): dictionaries
Project(s): TIDES, GALE
Application(s): information retrieval, machine translation, natural language processing
Language(s): Standard Arabic
Language ID(s): arb
License(s): BAMA Agreement
Online Documentation: LDC2002L49 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Buckwalter, Tim. Buckwalter Arabic Morphological Analyzer Version 1.0 LDC2002L49. Web Download. Philadelphia: Linguistic Data Consortium, 2002.

Introduction

Buckwalter Arabic Morphological Analyzer Version 1.0 was produced by Linguistic Data Consortium (LDC), catalog number LDC2002L49 and ISBN 1-58563-257-0. The Buckwalter Arabic Morphological Analyzer is used for POS-tagging Arabic text.

Data

The data consists primarily of three Arabic-English lexicon files: prefixes (299 entries), suffixes (618 entries), and stems (82,158 entries representing 38,600 lemmas). The lexicons are supplemented by three morphological compatibility tables used for controlling prefix-stem combinations (1,648 entries), stem-suffix combinations (1,285 entries), and prefix-suffix combinations (598 entries). The actual code for morphology analysis and POS tagging is contained in a Perl script. The documentation consists of a readme file with a description of the lexicon files, the morphological compatibility tables, the morphology analysis algorithm, a summary of stem morphological categories, and a table with the author's Arabic transliteration system.

Updates

There has been a case mismatch in the manner by which six files were named in the data, compared with their names in the documentation and the script, which caused the analyzer to crash on case sensitive systems. This problem has been remedied and you can now download the fixed version of the analyzer.

The Linguistic Data Consortium is releasing this software under the GNU General Public License; organizations interested in licensing the lexicon and/or morphological analyzer for commercial use should contact: QAMUS LLC 448 South 48th St. Philadelphia, PA 19143 ATTN: Tim Buckwalter email: info@qamus.org

Note

This corpus is free of charge as a web download distribution; a request must be submitted to ldc@ldc.upenn.edu to obtain the data. Note that there is a $100 charge if requested on a CD-ROM.

Available Media

View Fees

Member
Non-Member
Reduced-License
Extra Copy
Login for the applicable fee