Main Documentation for the LDC Moroccan Arabic - English Lexical Database

LDC-Catalog-ID: LDC2023L01


1.0 Introduction

The LDC Moroccan Arabic - English Lexical Database comprises a set of five
interrelated tables.  The combined content of the table set is essentially
equivalent to the Arabic-English portion of the Georgetown Dictionary of
Moroccan Arabic (Maamouri, 2018), which in turn was drawn mostly from the
Arabic-English portion of the Dictionary of Moroccan Arabic (Harrell, 1966).
 
The core feature of the dictionary is to present each Moroccan Arabic (MA)
word as both an orthographic form in Arabic script, and a pronunciation form
using the International Phonetic Alphabet (IPA).

The Arabic script spellings are based on the etymological values of the Arabic
alphabet, consistent with their common usage in Modern Standard Arabic (MSA).
Spellings are supplemented where necessary by vowel diacritic marks to help
clarify distinctive properties of MA dialectal forms.

The IPA pronunciations show the relations between the etymology-based Arabic
consonant values and their pronunciation values in MA.  These relations are
discussed in section 4.2 below.

The five tables, described in detail in section 3 below, are:

 - Roots (ary_root.tab): the set of consonantal bases that make up the primary
   organization of dictionary content

 - Lemmas (ary_lemma.tab): the inventory of dictionary headwords, represented
   by their numeric IDs only, along with part-of-speech labels and the numeric
   IDs of their associated roots

 - Wordforms (ary_wordform.tab): one or more Arabic orthographic forms and
   pronunciations for each lemma, including both citation forms and some
   related forms (particularly the plural forms for various nouns and
   adjectives)

 - English definitions (ary_eng_def.tab): one or more English definitions for
   each lemma, sometimes including common MA collocations having idiomatic
   meanings

 - Example phrases (ary_phrase.tab): one or more sentences to illustrate usage
   for a given definition in many (but not all) lemmas, with the MA sentence
   presented in both Arabic and IPA orthography, along with an English
   translation


2.0 Overview of Contents

The five tables are found in the "data" directory; the quantity of entries
in each table are as follows:

 Roots  	3,567
 Lemmas  	14,255
 Wordforms	19,927
 Definitions	24,911
 Phrases	 4,418

The lemma inventory includes 6923 nouns, 4083 verbs, 2957 adjectives
(including comparatives), 75 proper nouns, and 42 adverbs, along with several
dozen entries in 11 closed-class POS categories.

The phrases contain a total of over 21,800 MA word tokens (in both Arabic
script and IPA), and nearly 33,600 English word tokens.

The "docs" directory contains the following:

 - ary_schema.sql, ary_schema.mysql: table definitions (see 4.1 below)

 - ary_ipa_arabic.tab: table listing the IPA symbols used, their phonetic
   meanings, and their relationships to Arabic letters (see 4.2)

 - abbreviations.tab: list of abbreviations used in definition text (see 4.3)


3.0 Details of Database Structure

Each table is presented here as tab-delimited, plain-text file, with Unicode
UTF-8 character encoding and UNIX/Linux-style line terminations (line-feed
character only, no carriage-return).  The first line of each file contains the
column labels, which are explained in the subsections below.

While it is not strictly necessary to do so, the numeric IDs assigned to each
entry in each table are globally unique across all tables -- i.e. each table
uses a distinct range of ID numbers to uniquely identify each row.  (There can
be gaps in the sequence of ID numbers within each table.)


3.1 The Roots Table: ary_root.tab

  1	id -- globally unique numeric identifier
  2	ltrs -- Arabic letter sequence comprising the root
  3	r_type -- etymological status of the root
  4	r_indx -- index number for semantic differentiation

 Notes:
 - ID numbers range from 10001 to 13570
 - Every letter of every root is preceded and followed by a space -- e.g.:
   " ي و م "
 - "r_type" is one of:
   -- Foreign (the "root" is just the consonant skeleton of a borrowed term)
   -- Standard_Arabic (roots shared by MSA and/or other Arabic dialects)
   -- Dialect_Specific (roots that are not borrowed, and are unique to MA)
   -- Unspecified (five roots have not been categorized)
 - "r_index" ranges from 1 to 5 -- i.e. some roots have as many as five
   distinct semantic subgroups of lemmas


3.2 The Lemmas Table: ary_lemma.tab

  1	id -- globally unique numeric identifier
  2	root_id -- numeric ID of the ary_root entry for the lemma
  3	pos -- part-of-speech label for the lemms ("noun", "verb", etc.)

 Notes:
 - ID numbers range from 100001 to 114258
 - POS labels are all-lower-case with no abbreviations


3.3 The Wordforms Table: ary_wordform.tab

  1	id -- globally unique numeric identifier
  2	lemma_id -- numeric ID of the ary_lemma entry for the wordform
  3	orth -- Arabic script orthography
  4	pron -- IPA pronunciation
  5	form -- one of: "citation", "broken_plural", "fem_plural"
  6	w_usage -- one of: "archaic", "common"

 Notes:
 - ID numbers range from 200001 to 219927
 - There is always exactly one "citation" form for each lemma_id; this is the
   singular form for nouns and adjectives, and the 3rd-person-preterite form
   for verbs
 - There are only a handful of "archaic" forms


3.4 The English Definitions Table: ary_eng_def.tab

  1	id -- globally unique numeric identifier
  2	lemma_id -- numeric ID of the ary_lemma entry for the definition
  3	sense_label -- sequence number relative to other definitions for the lemma
  4	etext -- English definition

 Notes:
 - ID numbers range from 500001 to 524987
 - The number of senses per lemma (reflected in "sense_label") ranges from 1 to 28
 - The numeric ordering by "sense_label" may be arbitrary (i.e. does not
   necessarily represent relative frequency or preference in usage)
 - The "etext" field may include word tokens or phrases in Arabic script
   and/or IPA, providing common collocations with their specialized or
   idiomatic meanings in English
 - When a definition contains Arabic text, each contiguous Arabic string is
   surrounded by Unicode direction control characters:
   -- U+202B (RIGHT-TO-LEFT EMBEDDING) marks the beginning of each Arabic string 
   -- U+202C (POP DIRECTIONAL FORMATTING) marks the end of each Arabic string
 - When a definition contains IPA text, each contiguous pronunciation string is
   surrounded by square brackets:
   -- '[' marks the beginning of each IPA string
   -- ']' marks the end of each IPA string


3.5 The Phrases Table: ary_phrase.tab

  1	id -- globally unique numeric identifier
  2	def_id -- numeric ID of the ary_eng_def entry for the phrase
  3	eng_text
  4	ara_text
  5	ipa_text

 Notes:
 - ID numbers range from 1 to 4422
 - Each phrase is linked to a specific entry in ary_eng_def; many ary_eng_def
   entries do not have example phrases
 - Each word token in ara_text is matched by a corrsponding token in ipa_text
 - There are a few entries that contain an Arabic word token in the eng_text
   field; in these cases, the Arabic token is bracketed by the direction
   control characters U+202B ... U+202C
 - The ara_text field is always surrounded by direction control characters:
   initial U+202B and final U+202C
 - The ipa_text field is NOT surrounded by square brackets
 - Two entries contain a forward-slash as a separate token (" / ") in the
   eng_text field, to indicate alternative English phrasings
 - 18 entries contain a forward-slash as a separate token in both Arabic and
   IPA fields, to indicate alternative word choices for a phrase


4.0 Accompanying Documentation

4.1 Schema Definitions

The two schema definition files, "ary_schema.mysql" and "ary_schema.sql", are
essentially the same; the only difference is that the "mysql" version includes
additional specifications intended for use with a MySQL or MariaDB server.
These involve the type of database file format (InnoDB) and the character
encoding to be used.

The "mysql" version has been tested on a current MariaDB server, and the other
version has been tested with both PostgreSQL and SQLite.  Users should first
use the server of their choice to create an empty database, and execute the
commands in the chosen schema definition file to create the tables.  Once the
database and tables are in place, the tab-delimited files in data/ can be read
into each table in the sequence: root, lemma, wordform, eng_def, phrase.

In the case of SQLite, the first two steps can be carried out with a single
shell command line; for example (using a "bash" or equivalent shell, and
assuming that the command-line interface program for SQLite is installed and
appears in the user's shell PATH as "sqlite3"), the following creates a file
called "ary_eng_dict.db" to store the database, and defines the five tables
that comprise the schema:

  sqlite3 ary_eng_dict.db < ary_schema.sql

Then the following sequence of commands can be used to load the five tables:

  echo ".import --skip 1 ary_root.tab ary_root" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db
  echo ".import --skip 1 ary_lemma.tab ary_lemma" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db
  echo ".import --skip 1 ary_wordform.tab ary_wordform" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db
  echo ".import --skip 1 ary_eng_def.tab ary_eng_def" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db
  echo ".import --skip 1 ary_phrase.tab ary_phrase" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db

Of course, users can also load the tables into any spreadsheet application.

4.2 Summary of Pronunciation-Orthographic Mappings

The file "ary_ipa_arabic.tab" contains five columns, as follows:

  1	ipa_ltr -- "b", etc.
  2	ipa_chr -- Unicode code-point value ("U+0062", etc.)
  3	ara_ltr -- "ب", etc.
  4	ara_chr -- Unicode code-point value ("U+0628", etc.)
  5	description -- phonetic definition ("voiced bilabial stop", etc.)

Two entries in this table have "(na)" as the value for columns 3 and 4:

 - IPA "e" represents an unstressed short vowel; when it appears in an IPA
   token, the corresponding Arabic orthography may contain either a "fatha" or
   "kasra" diacritic, or (in many phrase words) no diacritic at all; in
   effect, this IPA vowel represents a "schwa"-like quality, such that the
   articulatory position of the vowel is not a distinctive feature.

 - IPA "ː" (U+02D0) represents vowel length (it occurs only after "a", "i" or
   "u"); when it appears, the corresponding Arabic orthography will have some
   form of "aleph" character (for "aː"), or "ي" (for "iː") or "و" (for "uː").
   Note that this character was chosen (instead of the ASCII colon ":"),
   because it is classified in Unicode as a "modifier letter" (rather than as
   punctuation), so that Unicode-aware processes will not treat it as a word
   boundary character.

One entry in this table ("ʔ ... glottal stop") relates to a set of six Arabic
letters, listed together in columns 3 and 4.  The use of one or another Arabic
letter in the position of a glottal stop depends on contextual and
etymological properties of the given word.

4.3 List of abbreviations used in English defnition text

The file "abbreviations.tab" is a two-column table with abbreviations in
column 1 and their full meanings in column 2.  These abbreviations show up in
various entries of the fourth column ("etext") of "ary_eng_def.tab".


5.0 Notes on Orthographic Conventions

The IPA and Arabic spellings differ in terms of how segmental phonemic length
is represented for consonants and vowels.

In IPA, long consonants are indicated by doubling the consonant letter -- e.g.
"b" (short) vs. "bb" (long) -- and long vowels are indicated by placing a
vowel length mark ("modifier letter triangular colon") immediately after the
vowel letter: "a, i, u" (short) vs. "aː, iː, uː" (long).

In Arabic script orthography, long consonants are indicated by the "shadda"
diacritic mark applied to the consonant letter, and long vowels are indicated
by the use of either an alef character (for long "a") or one of the semivowel
characters ("و" for long "u", "ي" for long "i").  Short vowels are often not
indicated, but when they are, this is done by attaching a vowel diacritic mark
("fatha", "damma", or "kasra") to the preceding consonant letter.

In general, the "wordform" table has all short vowels represented as diacritic
marks in the Arabic orthography; in the "phrase" table, short-vowel diacritics
are often omitted where the vowel quality is predictable from context, or not
phonologically distinctive.


6.0 Acknowledgments

The effort to produce this dictionary was initiated by grant awards from the
International Research Studies Program of the U.S. Department of Education
(#P017A0800441).  Additional support has been provided by Georgetown
University Press, and by the Linguistic Data Consortium.

Development of the dictionary took place over two phases.  In the first phase,
from 2010 to 2012, three native Moroccans were enlisted to work at the LDC in
the arduous task of both reorganizing and transliterating the original 1966
dictionary content into the form being presented here: Fatima Zohra Laghrissi,
Ikram Youssef and Youness Nabirh confronted the most difficult phase of the
project with diligence and dedication.

In the second phase, from 2013 to 2015, we benefitted immensely from the
efforts of researchers and advanced students at Al-Akhawayn University in
Ifram, Morocco.  Dr. Abdellah Chekayri helped to refine the conventions for
both Arabic and IPA orthography.  Dr. Violetta Cavalli Sforza was instrumental
in bringing substantive improvements to our quality-control procedures and
tools, and led a well-organized annotation process to review and improve the
phase 1 content by filling gaps, extending coverage for modern usage, and
resolving inconsistencies.  She directed the Al-Akhawayn University team of
annotators: Youssef Ismaili, Hind Saddiki, Meryem Daiki, Sra El Hamdaoui,
Oualid El Meriague, Rachid Lamsairhri, Souhail Meftah, Ibtissam Ouazzani,
Hachem Saddiki and Maha Skah.


7.0 References

Harrel, R., Sobleman, H, eds. (1966 [2008]): A Dictionary of Moroccan Arabic.
  Georgetown University Press.  Washington, D.C.

Maamouri, M., ed. (2018): The Georgetown Dictionary of Moroccan Arabic.
  Georgetown University Press.  Washington, D.C.