Home › Language Resources › Data

Buckwalter Arabic Morphological Analyzer Version 2.0

Item Name:	Buckwalter Arabic Morphological Analyzer Version 2.0
Author(s):	Tim Buckwalter
LDC Catalog No.:	LDC2004L02
ISBN:	1-58563-324-0
ISLRN:	694-194-540-336-4
DOI:	https://doi.org/10.35111/050q-5r95
Release Date:	December 15, 2004
Member Year(s):	2004
DCMI Type(s):	Text
Project(s):	GALE, TIDES
Application(s):	information retrieval, machine translation, natural language processing
Language(s):	Standard Arabic, English
Language ID(s):	arb, eng
License(s):	BAMA Agreement
Online Documentation:	LDC2004L02 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Buckwalter, Tim. Buckwalter Arabic Morphological Analyzer Version 2.0 LDC2004L02. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: Hide	View isVersionOf LDC2002L49 Buckwalter Arabic Morphological Analyzer Version 1.0 hasVersion LDC2010L01 LDC Standard Arabic Morphological Analyzer (SAMA) Version 3.1 relatesTo LDC2005T30 Arabic Treebank: Part 4 v 1.0 (MPG Annotation) processes LDC2005T20 Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis) LDC2006T15 Gulf Arabic Conversational Telephone Speech, Transcripts LDC2006T20 Arabic Broadcast News Transcripts LDC2016S09 IARPA Babel Pashto Language Pack IARPA-babel104b-v0.4bY LDC2025S05 IWSLT 2022-2023 Shared Task Training, Development and Test Set

Introduction

Buckwalter Arabic Morphological Analyzer Version 2.0 was developed by Tim Buckwalter at the Linguistic Data Consortium (LDC) and contains a Perl script for morphology analysis and part-of-speech (POS) tagging of Arabic text. The release includes lexicons with approximately 83,000 entries of Arabic prefixes, suffixes, and stems as well as compatibility tables that are referenced by the script in the analysis of the text.

The analyzer considers each Arabic word token in all possible prefix-stem-suffix segmentations and lists all known/possible annotation solutions, POS labels, and glosses. The generated output may then be reviewed by users, and the most appropriate annotation selected from among several choices.

This tool has been used frequently for LDC releases of annotated Arabic text.

Data

The data consists primarily of the Perl script, lexicons, and compatibility tables.

Here are the three Arabic-English lexicon files:

Prefixes (299 entries)
Suffixes (618 entries)
Stems (82,158 entries representing 38,600 lemmas)

The lexicons are supplemented by three morphological compatibility tables used for controlling possible word part combinations:

Prefix-stem (1,648 entries)
Stem-suffix (1,285 entries)
Prefix-suffix (598 entries)

The documentation consists of a readme file with a description of the lexicon files, the morphological compatibility tables, the morphology analysis algorithm, a summary of stem morphological categories, and a table with the author's Arabic transliteration system.

Samples

To see an example of the analyzer's output, please examine this sample.

Updates

There are no updates available at this time.

Additional Licensing Instructions

This 'members-only' corpus is available to current members who can request the data at the listed reduced-license fee. Contact ldc@ldc.upenn.edu for information about becoming a member.

Buckwalter Arabic Morphological Analyzer Version 2.0

Introduction

Data

Samples

Updates

Additional Licensing Instructions

Copyright

Available Media

View Fees