Main Documentation for the LDC Iraqi Arabic - English Lexical Database

LDC-Catalog-ID: LDC2025L01


1.0 Introduction

The LDC Iraqi Arabic - English Lexical Database comprises a set of six
interrelated tables.  The combined content of the table set is essentially
equivalent to the Arabic-English portion of the Georgetown Dictionary of
Iraqi Arabic (Maamouri, 2013), which in turn was drawn mostly from the
Arabic-English portion of "A Dictionary of Iraqi Arabic" (Woodhead, D.R. and
Beene, W., eds, 2003).
 
The core feature of the dictionary is to present each Iraqi Arabic (IA)
word as both an orthographic form in Arabic script, and a pronunciation form
using the International Phonetic Alphabet (IPA).

The Arabic script spellings are based on the etymological values of the Arabic
alphabet, consistent with their common usage in Modern Standard Arabic (MSA).
Spellings are supplemented where necessary by vowel diacritic marks to help
clarify distinctive properties of IA dialectal forms.

The IPA pronunciations show the relations between the etymology-based Arabic
consonant values and their pronunciation values in IA.  These relations are
discussed in section 4.2 below.

The six tables, described in detail in section 3 below, are:

 - Roots (acm_root.tab): the set of consonantal bases that make up the primary
   organization of dictionary content

 - Lemmas (acm_lemma.tab): the inventory of dictionary headwords, represented
   by their numeric IDs only, along with part-of-speech labels and the numeric
   IDs of their associated roots

 - Wordforms (acm_wordform.tab): one or more Arabic orthographic forms and
   pronunciations for each lemma, including both citation forms and some
   related forms (particularly the plural forms for various nouns and
   adjectives)

 - Multi-word expressions (acm_mwe.tab): a partial, selective inventory of
   common phrases, comprising significant collocations, idioms, and some word
   sequences that take on the status of invariant function words

 - English definitions (acm_eng_def.tab): one or more English definitions for
   each lemma, sometimes including common IA collocations having idiomatic
   meanings

 - Example phrases (acm_phrase.tab): one or more sentences to illustrate usage
   for a given definition in many (but not all) lemmas, with the IA sentence
   presented in both Arabic and IPA orthography, along with an English
   translation

Another key feature of this set of tables is that they are mutually compatible
with the tables that comprise the previously published Moroccan Arabic-English
Lexical Database (LDC2023L01) -- see section 5.3 below for details on using
the two sets of tables as a single, unified database.


2.0 Overview of Contents

The six tables are found in the "data" directory; the quantity of entries
in each table are as follows:

 Roots          4,512
 Lemmas        17,224
 MWEs             261
 Wordforms     22,988
 Definitions   23,834
 Phrases       15,714

The lemma inventory includes 8448 nouns (including a few entries marked as
"nouncollective", "noununit", "nounverbal", or "proper noun"), 4867 verbs
(including entries marked as "verbpassive" and "verbpseudo"), 3619 adjectives
(including comparatives), and 135 adverbs, along with 155 entries in 10
closed-class POS categories.

The phrase inventory comprises a total of over 67,200 IA word tokens (in
both Arabic script and IPA), and over 120,800 English word tokens.

The "docs" directory contains the following:

 - acm_schema.sql, acm_schema.mysql: table definitions (see 4.1 below)

 - acm_ipa_arabic.tab: table listing the IPA symbols used, their phonetic
   meanings, and their relationships to Arabic letters (see 4.2)

 - abbreviations.tab: list of abbreviations used in definition text (see 4.3)


3.0 Details of Database Structure

Each table is presented here as tab-delimited, plain-text file, with Unicode
UTF-8 character encoding and UNIX/Linux-style line terminations (line-feed
character only, no carriage-return).  The first line of each file contains the
column labels, which are explained in the subsections below.

While it is not strictly necessary to do so, the numeric IDs assigned to each
entry in each table are globally unique across all tables -- i.e. each table
uses a distinct range of ID numbers to uniquely identify each row.  (There can
be gaps in the sequence of ID numbers within each table.)


3.1 The Roots Table: acm_root.tab

  1	id -- globally unique numeric identifier
  2	ltrs -- Arabic letter sequence comprising the root
  3	r_type -- etymological status of the root
  4	r_indx -- index number for semantic differentiation

 Notes:
 - ID numbers range from 10001 to 14512
 - Every letter of every root is preceded and followed by a space -- e.g.:
   " ي و م "
 - "r_type" is one of:
   -- Foreign (the "root" is just the consonant skeleton of a borrowed term)
   -- Standard_Arabic (roots shared by MSA and/or other Arabic dialects)
   -- Dialect_Specific (roots that are not borrowed, and are unique to IA)
 - "r_index" ranges from 1 to 9 -- i.e. some roots have as many as nine
   distinct semantic subgroups of lemmas


3.2 The Lemmas Table: acm_lemma.tab

  1	id -- globally unique numeric identifier
  2	root_id -- numeric ID of the acm_root entry for the lemma
  3	pos -- part-of-speech label for the lemms ("noun", "verb", etc.)

 Notes:
 - ID numbers range from 100001 to 117224
 - POS labels are all-lower-case with no abbreviations


3.3 The Multiword Expressions Table: acm_mwe.tab

  1	id -- globally unique numeric identifier
  2	lemma_id -- numeric ID of the acm_lemma entry for one word of the MWE
  3	entry_type -- one of: collocation, idiom, function_mwe
  4	orth -- Arabic script orthography
  5	pron -- IPA pronunciation
  6	etext -- English translation of the expression

 Notes:
 - ID numbers range from 200001 to 200258
 - Each MWE is linked to the lemma entry for just one of its component words 
 - Multiple MWE entries may be linked to a single lemma entry, because they
   have the given word in common.
 - Some acm_phrase entries are linked to MWE entries (see 3.6 below)


3.4 The Wordforms Table: acm_wordform.tab

  1	id -- globally unique numeric identifier
  2	lemma_id -- numeric ID of the acm_lemma entry for the wordform
  3	orth -- Arabic script orthography
  4	pron -- IPA pronunciation
  5	form -- one of: "citation", "broken_plural", "fem_plural"
  6	w_usage -- one of: "archaic", "common"

 Notes:
 - ID numbers range from 300001 to 322988
 - There is always exactly one "citation" form for each lemma_id; this is the
   singular form for nouns and adjectives, and the 3rd-person-preterite form
   for verbs
 - There are roughly 500 "archaic" forms


3.5 The English Definitions Table: acm_eng_def.tab

  1	id -- globally unique numeric identifier
  2	lemma_id -- numeric ID of the acm_lemma entry for the definition
  3	sense_label -- sequence number relative to other definitions for the lemma
  4	etext -- English definition

 Notes:
 - ID numbers range from 600001 to 623834
 - The number of senses per lemma (reflected in "sense_label") ranges from 1 to 15
 - The numeric ordering by "sense_label" may be arbitrary (i.e. does not
   necessarily represent relative frequency or preference in usage)
 - The "etext" field may include word tokens or phrases in Arabic script
   and/or IPA, providing common collocations with their specialized or
   idiomatic meanings in English
 - When a definition contains Arabic text, each contiguous Arabic string is
   surrounded by Unicode direction control characters:
   -- U+202B (RIGHT-TO-LEFT EMBEDDING) marks the beginning of each Arabic string 
   -- U+202C (POP DIRECTIONAL FORMATTING) marks the end of each Arabic string
 - When a definition contains IPA text, each contiguous pronunciation string is
   surrounded by square brackets:
   -- '[' marks the beginning of each IPA string
   -- ']' marks the end of each IPA string


3.6 The Phrases Table: acm_phrase.tab

  1	id -- globally unique numeric identifier
  2	def_id -- numeric ID of an acm_eng_def entry (may be null)
  3	mwe_id -- numeric ID of an acm_mwe entry (may be null)
  4	eng_text
  5	ara_text
  6	ipa_text

 Notes:
 - ID numbers range from 1 to 15714
 - Each phrase is linked to a specific entry in either acm_eng_def or acm_mwe
   (i.e. def_id is null when mwe_id is not null, and vice-versa)
 - Many acm_eng_def and acm_mwe entries do not have example phrases
 - Each word token in ara_text is matched by a corrsponding token in ipa_text
 - The ara_text field is always surrounded by direction control characters:
   initial U+202B and final U+202C
 - The ipa_text field is NOT surrounded by square brackets
 - One entry contains a forward-slash as a separate token (" / ") in the
   eng_text field, to indicate alternative English phrasings


4.0 Accompanying Documentation

4.1 Schema definitions

The two schema definition files, "acm_schema.mysql" and "acm_schema.sql", are
essentially the same; the only difference is that the "mysql" version includes
additional specifications intended for use with a MySQL or MariaDB server.
These involve the type of database file format (InnoDB) and the character
encoding to be used.

The "mysql" version has been tested on a current MariaDB server, and the other
version has been tested with both PostgreSQL and SQLite.  Users should first
use the server of their choice to create an empty database, and execute the
commands in the chosen schema definition file to create the tables.  Once the
database and tables are in place, the tab-delimited files in data/ can be read
into each table in the sequence: root, lemma, wordform, eng_def, phrase.

In the case of SQLite, the first two steps can be carried out with a single
shell command line; for example (using a "bash" or equivalent shell, and
assuming that the command-line interface program for SQLite is installed and
appears in the user's shell PATH as "sqlite3"), the following creates a file
called "acm_eng_dict.db" to store the database, and defines the five tables
that comprise the schema:

  sqlite3 acm_eng_dict.db < acm_schema.sql

Then the following sequence of commands can be used to load the five tables:

  echo ".import --skip 1 acm_root.tab acm_root" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db
  echo ".import --skip 1 acm_lemma.tab acm_lemma" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db
  echo ".import --skip 1 acm_wordform.tab acm_wordform" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db
  echo ".import --skip 1 acm_mwe.tab acm_mwe" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db
  echo ".import --skip 1 acm_eng_def.tab acm_eng_def" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db
  echo ".import --skip 1 acm_phrase.tab acm_phrase" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db

Of course, users can also load the tables into any spreadsheet application.

4.2 Summary of pronunciation-orthographic mappings

The file "acm_ipa_arabic.tab" contains five columns, as follows:

  1	ipa_ltr -- "b", etc.
  2	ipa_chr -- Unicode code-point value ("U+0062", etc.)
  3	ara_ltr -- "ب", etc.
  4	ara_chr -- Unicode code-point value ("U+0628", etc.)
  5	description -- phonetic definition ("voiced bilabial stop", etc.)

One entry in this table has "(na)" as the value for columns 3 and 4:

 - IPA "ː" (U+02D0) represents vowel length (it occurs only after "a", "i" or
   "u"); when it appears, the corresponding Arabic orthography will have some
   form of "aleph" character (for "aː"), or "ي" (for "iː") or "و" (for "uː").
   Note that this character was chosen (instead of the ASCII colon ":"),
   because it is classified in Unicode as a "modifier letter" (rather than as
   punctuation), so that Unicode-aware processes will not treat it as a word
   boundary character.

One entry table ("ʔ ... glottal stop") relates to a set of six Arabic letters,
listed together in columns 3 and 4.  The use of one or another Arabic letter
in the position of a glottal stop depends on contextual and etymological
properties of the given word.

One entry ("ð̣ ... voiced pharyngealized dental fricative"), is actually a
digraph (letter plus under-dot diacritic mark), and so has two Unicode code
point values in column 2; also, this entry represents the Iraqi pronunciation
of two distinct Arabic consonant letters, so both Arabic letters are given
(separated by comma + space) in columns 3 and 4.

4.3 List of abbreviations used in English defnition text

The file "abbreviations.tab" is a two-column table with abbreviations in
column 1 and their full meanings in column 2.  These abbreviations show up in
various entries of the fourth column ("etext") of "acm_eng_def.tab".


5.0 Notes on various conventions used in the tables

5.1 Orthography

The IPA and Arabic spellings differ in terms of how segmental phonemic length
is represented for consonants and vowels.

In IPA, long consonants are indicated by doubling the consonant letter -- e.g.
"b" (short) vs. "bb" (long) -- and long vowels are indicated by placing a
vowel length mark ("modifier letter triangular colon") immediately after the
vowel letter: "a, i, u" (short) vs. "aː, iː, uː" (long).

In Arabic script orthography, long consonants are indicated by the "shadda"
diacritic mark applied to the consonant letter, and long vowels are indicated
by the use of either an alef character (for long "a") or one of the semivowel
characters ("و" for long "u", "ي" for long "i").  Short vowels are often not
indicated, but when they are, this is done by attaching a vowel diacritic mark
("fatha", "damma", or "kasra") to the preceding consonant letter.

In general, the "wordform" table has all short vowels represented as diacritic
marks in the Arabic orthography; in the "phrase" table, short-vowel diacritics
are often omitted where the vowel quality is predictable from context, or not
phonologically distinctive.

Minor note on the ordering of Arabic diacritic marks: the original annotation
to create Arabic script orhographic forms for word forms and phrases was done
via Buckwalter transliteration.  In virtually all typical uses of Buckwalter,
the Arabic consonant length mark (shadda) is placed immediately after the
consonant it modifies; if any short-vowel diacritics are also used on the same
long consonant, they are placed after the shadda mark.  This conflicts with
the "canonical" ordering of Arabic diacritics as applied by the Unicode
Standard, which places shadda after short vowel marks.  In the present
release, the Buckwalter ordering has been maintained.

5.2 Multiword expressions, English definitions, and Phrases

The use of a separate table for multiword expressions was introduced fairly
late in the process of preparing the Iraqi data for release as an LDC corpus.
It was not in place during the annotion that led up to the publication of the
Georgetown Dictionary of Iraqi Arabic in 2013.

As a result, the "acm_mwe" table presented here is rather small.  Many items
in the "acm_eng_def" and "acm_phrase" tables could have been treated as MWE
entries, but have been left in place in those other two tables because that is
where they were placed during the main annotation effort carried out before
2013.

In particular, about 300 "acm_eng_def" entries have "etext" values that begin
with a parenthetical remark of the form "(with ...)", providing a word in
Arabic (such as a preposition, sometimes including an IPA pronunciation), 
followed by an English translation for the combination of the lemma with the
parenthesized word.

Also, in a quantity of "acm_phrase" entries, the "orth", "pron", and "etext"
values present just a brief phrase (i.e. a collocation or idiom), rather than
a full sentence.

5.3 Inter-operability with other LDC Dialectal Arabic dictionaries

The previously publishd Moroccan Arabic-English Lexical Database (LDC2023L01)
uses the same overall database design as the Iraqi Arabic-English Lexical
Database.  The differences between the two can be summarized as follows:

 - table names begin with "acm_" (for Iraqi) vs. "ary_" (for Moroccan)
 - Iraqi has the "acm_mwe" table, while Moroccan has no "ary_mwe" table
 - the Iraqi "acm_phrase" table has a column called "mwe_id" for linking
   phrase entries to mwe entries; the Moroccan "ary_phrase" table does not
   have such a column

It's important to note that the two sets of tables are, for all practical
purposes, mutually compatible, to the extent that all the tables in both
dialects can be included in a single database -- e.g. a single SQLite file.

Here's an adaptation of the instructions from section 4.1 above, for loading
both dialects into a single database, using the file name "ara_dialects.db":

  sqlite3 ara_dialects.db < ary_schema.sql
  sqlite3 ara_dialects.db < acm_schema.sql

  echo ".import --skip 1 ary_root.tab ary_root" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 acm_root.tab acm_root" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 ary_lemma.tab ary_lemma" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 acm_lemma.tab acm_lemma" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 ary_wordform.tab ary_wordform" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 acm_wordform.tab acm_wordform" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 acm_mwe.tab acm_mwe" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 ary_eng_def.tab ary_eng_def" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 acm_eng_def.tab acm_eng_def" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 ary_phrase.tab ary_phrase" | sqlite3 -cmd ".mode tabs" ara_dialects.db
  echo ".import --skip 1 acm_phrase.tab acm_phrase" | sqlite3 -cmd ".mode tabs" ara_dialects.db

The command lines above assume that the 2 "*.sql" schema files and 11 "*.tab"
files have already been copied into the user's current working directory.

Combining the data this way will support queries that scan both dialects with
regard to common roots, shared wordforms, related glosses, etc., in addition
to allowing queries on one dialect to be easily applied to the other dialect,
simply by changing the first three letters of the table name(s) involved.

(Note that the Moroccan database schema can be modified trivially, adding the
"ary_mwe" table, and the "mwe_id" field to the "ary_phrase" table, to make it
fully equivalent to the Iraqi schema, even though these added elements would
remain empty.)


6.0 Acknowledgments

The effort to produce this dictionary was initiated by grant awards from the
International Research Studies Program of the U.S. Department of Education
(#P017A0800441).  Additional support has been provided by Georgetown
University Press, and by the Linguistic Data Consortium.

Most of the development work for the dictionary took place from 2008 to 2011.
Native Iraqis were enlisted to work at the LDC in the arduous task of both
reorganizing and transliterating the original 2003 dictionary content into the
form being presented here: Alyaa Aboud, Tamara Ali Janeb, Zainab Alsawaf and
Safa Ismail confronted the most difficult phase of the project with diligence
and dedication.


7.0 References

Woodhead, D.R. and Beene, W., eds. (2003): A Dictionary of Iraqi Arabic,
  Arabic - English.  Georgetown University Press.  Washington, D.C.

Maamouri, M., ed. (2013): The Georgetown Dictionary of Iraqi Arabic.
  Georgetown University Press.  Washington, D.C.