This file contains documentation on the Buckwalter Arabic Morphological Analyzer Version 2.0, Linguistic Data Consortium (LDC) catalog number LDC2004L02 and ISBN 1-58563-311-9.
Note: This release, unlike Version 1, is available only to LDC members. To find out how to join, please consult our FAQ. There are additional licenseing terms that apply. To examine the license, please follow the Member License Online link above. You will also be presented with this license upon download and will be asked to accept. You must accept the terms in order for the download to proceed.
The data consists primarily of three Arabic-English lexicon files: prefixes (548 entries), suffixes (906 entries), and stems (78,839 entries representing 40,219 lemmas). The lexicons are supplemented by three morphological compatibility tables used for controlling prefix-stem combinations (2,435 entries), stem-suffix combinations (1,612 entries), and prefix-suffix combinations (1,138 entries). The actual code for morphology analysis and POS tagging is contained in a Perl script (AraMorph.pl). Sample input (infile.txt) and corresponding output file (outfile.xml) are provided. The documentation consists of a readme file with a description of the three lexicon files, the three morphological compatibility tables, the morphology analysis algorithm, and a table with the authors Arabic transliteration system.
To see an example of the analyzers output, please examine this sample.
The release is available to 2004 and 2006 members via download here. Copies may also be requested on CD for an additional fee of US$150.
Copyright Portions © 2002-2004 QAMUS LLC (www.qamus.org),© 2002-2004 Trustees of the University of Pennsylvania