Main Documentation for the LDC Iraqi Arabic - English Lexical Database LDC-Catalog-ID: LDC2025L01 1.0 Introduction The LDC Iraqi Arabic - English Lexical Database comprises a set of six interrelated tables. The combined content of the table set is essentially equivalent to the Arabic-English portion of the Georgetown Dictionary of Iraqi Arabic (Maamouri, 2013), which in turn was drawn mostly from the Arabic-English portion of "A Dictionary of Iraqi Arabic" (Woodhead, D.R. and Beene, W., eds, 2003). The core feature of the dictionary is to present each Iraqi Arabic (IA) word as both an orthographic form in Arabic script, and a pronunciation form using the International Phonetic Alphabet (IPA). The Arabic script spellings are based on the etymological values of the Arabic alphabet, consistent with their common usage in Modern Standard Arabic (MSA). Spellings are supplemented where necessary by vowel diacritic marks to help clarify distinctive properties of IA dialectal forms. The IPA pronunciations show the relations between the etymology-based Arabic consonant values and their pronunciation values in IA. These relations are discussed in section 4.2 below. The six tables, described in detail in section 3 below, are: - Roots (acm_root.tab): the set of consonantal bases that make up the primary organization of dictionary content - Lemmas (acm_lemma.tab): the inventory of dictionary headwords, represented by their numeric IDs only, along with part-of-speech labels and the numeric IDs of their associated roots - Wordforms (acm_wordform.tab): one or more Arabic orthographic forms and pronunciations for each lemma, including both citation forms and some related forms (particularly the plural forms for various nouns and adjectives) - Multi-word expressions (acm_mwe.tab): a partial, selective inventory of common phrases, comprising significant collocations, idioms, and some word sequences that take on the status of invariant function words - English definitions (acm_eng_def.tab): one or more English definitions for each lemma, sometimes including common IA collocations having idiomatic meanings - Example phrases (acm_phrase.tab): one or more sentences to illustrate usage for a given definition in many (but not all) lemmas, with the IA sentence presented in both Arabic and IPA orthography, along with an English translation Another key feature of this set of tables is that they are mutually compatible with the tables that comprise the previously published Moroccan Arabic-English Lexical Database (LDC2023L01) -- see section 5.3 below for details on using the two sets of tables as a single, unified database. 2.0 Overview of Contents The six tables are found in the "data" directory; the quantity of entries in each table are as follows: Roots 4,512 Lemmas 17,224 MWEs 261 Wordforms 22,988 Definitions 23,834 Phrases 15,714 The lemma inventory includes 8448 nouns (including a few entries marked as "nouncollective", "noununit", "nounverbal", or "proper noun"), 4867 verbs (including entries marked as "verbpassive" and "verbpseudo"), 3619 adjectives (including comparatives), and 135 adverbs, along with 155 entries in 10 closed-class POS categories. The phrase inventory comprises a total of over 67,200 IA word tokens (in both Arabic script and IPA), and over 120,800 English word tokens. The "docs" directory contains the following: - acm_schema.sql, acm_schema.mysql: table definitions (see 4.1 below) - acm_ipa_arabic.tab: table listing the IPA symbols used, their phonetic meanings, and their relationships to Arabic letters (see 4.2) - abbreviations.tab: list of abbreviations used in definition text (see 4.3) 3.0 Details of Database Structure Each table is presented here as tab-delimited, plain-text file, with Unicode UTF-8 character encoding and UNIX/Linux-style line terminations (line-feed character only, no carriage-return). The first line of each file contains the column labels, which are explained in the subsections below. While it is not strictly necessary to do so, the numeric IDs assigned to each entry in each table are globally unique across all tables -- i.e. each table uses a distinct range of ID numbers to uniquely identify each row. (There can be gaps in the sequence of ID numbers within each table.) 3.1 The Roots Table: acm_root.tab 1 id -- globally unique numeric identifier 2 ltrs -- Arabic letter sequence comprising the root 3 r_type -- etymological status of the root 4 r_indx -- index number for semantic differentiation Notes: - ID numbers range from 10001 to 14512 - Every letter of every root is preceded and followed by a space -- e.g.: " ي و م " - "r_type" is one of: -- Foreign (the "root" is just the consonant skeleton of a borrowed term) -- Standard_Arabic (roots shared by MSA and/or other Arabic dialects) -- Dialect_Specific (roots that are not borrowed, and are unique to IA) - "r_index" ranges from 1 to 9 -- i.e. some roots have as many as nine distinct semantic subgroups of lemmas 3.2 The Lemmas Table: acm_lemma.tab 1 id -- globally unique numeric identifier 2 root_id -- numeric ID of the acm_root entry for the lemma 3 pos -- part-of-speech label for the lemms ("noun", "verb", etc.) Notes: - ID numbers range from 100001 to 117224 - POS labels are all-lower-case with no abbreviations 3.3 The Multiword Expressions Table: acm_mwe.tab 1 id -- globally unique numeric identifier 2 lemma_id -- numeric ID of the acm_lemma entry for one word of the MWE 3 entry_type -- one of: collocation, idiom, function_mwe 4 orth -- Arabic script orthography 5 pron -- IPA pronunciation 6 etext -- English translation of the expression Notes: - ID numbers range from 200001 to 200258 - Each MWE is linked to the lemma entry for just one of its component words - Multiple MWE entries may be linked to a single lemma entry, because they have the given word in common. - Some acm_phrase entries are linked to MWE entries (see 3.6 below) 3.4 The Wordforms Table: acm_wordform.tab 1 id -- globally unique numeric identifier 2 lemma_id -- numeric ID of the acm_lemma entry for the wordform 3 orth -- Arabic script orthography 4 pron -- IPA pronunciation 5 form -- one of: "citation", "broken_plural", "fem_plural" 6 w_usage -- one of: "archaic", "common" Notes: - ID numbers range from 300001 to 322988 - There is always exactly one "citation" form for each lemma_id; this is the singular form for nouns and adjectives, and the 3rd-person-preterite form for verbs - There are roughly 500 "archaic" forms 3.5 The English Definitions Table: acm_eng_def.tab 1 id -- globally unique numeric identifier 2 lemma_id -- numeric ID of the acm_lemma entry for the definition 3 sense_label -- sequence number relative to other definitions for the lemma 4 etext -- English definition Notes: - ID numbers range from 600001 to 623834 - The number of senses per lemma (reflected in "sense_label") ranges from 1 to 15 - The numeric ordering by "sense_label" may be arbitrary (i.e. does not necessarily represent relative frequency or preference in usage) - The "etext" field may include word tokens or phrases in Arabic script and/or IPA, providing common collocations with their specialized or idiomatic meanings in English - When a definition contains Arabic text, each contiguous Arabic string is surrounded by Unicode direction control characters: -- U+202B (RIGHT-TO-LEFT EMBEDDING) marks the beginning of each Arabic string -- U+202C (POP DIRECTIONAL FORMATTING) marks the end of each Arabic string - When a definition contains IPA text, each contiguous pronunciation string is surrounded by square brackets: -- '[' marks the beginning of each IPA string -- ']' marks the end of each IPA string 3.6 The Phrases Table: acm_phrase.tab 1 id -- globally unique numeric identifier 2 def_id -- numeric ID of an acm_eng_def entry (may be null) 3 mwe_id -- numeric ID of an acm_mwe entry (may be null) 4 eng_text 5 ara_text 6 ipa_text Notes: - ID numbers range from 1 to 15714 - Each phrase is linked to a specific entry in either acm_eng_def or acm_mwe (i.e. def_id is null when mwe_id is not null, and vice-versa) - Many acm_eng_def and acm_mwe entries do not have example phrases - Each word token in ara_text is matched by a corrsponding token in ipa_text - The ara_text field is always surrounded by direction control characters: initial U+202B and final U+202C - The ipa_text field is NOT surrounded by square brackets - One entry contains a forward-slash as a separate token (" / ") in the eng_text field, to indicate alternative English phrasings 4.0 Accompanying Documentation 4.1 Schema definitions The two schema definition files, "acm_schema.mysql" and "acm_schema.sql", are essentially the same; the only difference is that the "mysql" version includes additional specifications intended for use with a MySQL or MariaDB server. These involve the type of database file format (InnoDB) and the character encoding to be used. The "mysql" version has been tested on a current MariaDB server, and the other version has been tested with both PostgreSQL and SQLite. Users should first use the server of their choice to create an empty database, and execute the commands in the chosen schema definition file to create the tables. Once the database and tables are in place, the tab-delimited files in data/ can be read into each table in the sequence: root, lemma, wordform, eng_def, phrase. In the case of SQLite, the first two steps can be carried out with a single shell command line; for example (using a "bash" or equivalent shell, and assuming that the command-line interface program for SQLite is installed and appears in the user's shell PATH as "sqlite3"), the following creates a file called "acm_eng_dict.db" to store the database, and defines the five tables that comprise the schema: sqlite3 acm_eng_dict.db < acm_schema.sql Then the following sequence of commands can be used to load the five tables: echo ".import --skip 1 acm_root.tab acm_root" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db echo ".import --skip 1 acm_lemma.tab acm_lemma" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db echo ".import --skip 1 acm_wordform.tab acm_wordform" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db echo ".import --skip 1 acm_mwe.tab acm_mwe" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db echo ".import --skip 1 acm_eng_def.tab acm_eng_def" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db echo ".import --skip 1 acm_phrase.tab acm_phrase" | sqlite3 -cmd ".mode tabs" acm_eng_dict.db Of course, users can also load the tables into any spreadsheet application. 4.2 Summary of pronunciation-orthographic mappings The file "acm_ipa_arabic.tab" contains five columns, as follows: 1 ipa_ltr -- "b", etc. 2 ipa_chr -- Unicode code-point value ("U+0062", etc.) 3 ara_ltr -- "ب", etc. 4 ara_chr -- Unicode code-point value ("U+0628", etc.) 5 description -- phonetic definition ("voiced bilabial stop", etc.) One entry in this table has "(na)" as the value for columns 3 and 4: - IPA "ː" (U+02D0) represents vowel length (it occurs only after "a", "i" or "u"); when it appears, the corresponding Arabic orthography will have some form of "aleph" character (for "aː"), or "ي" (for "iː") or "و" (for "uː"). Note that this character was chosen (instead of the ASCII colon ":"), because it is classified in Unicode as a "modifier letter" (rather than as punctuation), so that Unicode-aware processes will not treat it as a word boundary character. One entry table ("ʔ ... glottal stop") relates to a set of six Arabic letters, listed together in columns 3 and 4. The use of one or another Arabic letter in the position of a glottal stop depends on contextual and etymological properties of the given word. One entry ("ð̣ ... voiced pharyngealized dental fricative"), is actually a digraph (letter plus under-dot diacritic mark), and so has two Unicode code point values in column 2; also, this entry represents the Iraqi pronunciation of two distinct Arabic consonant letters, so both Arabic letters are given (separated by comma + space) in columns 3 and 4. 4.3 List of abbreviations used in English defnition text The file "abbreviations.tab" is a two-column table with abbreviations in column 1 and their full meanings in column 2. These abbreviations show up in various entries of the fourth column ("etext") of "acm_eng_def.tab". 5.0 Notes on various conventions used in the tables 5.1 Orthography The IPA and Arabic spellings differ in terms of how segmental phonemic length is represented for consonants and vowels. In IPA, long consonants are indicated by doubling the consonant letter -- e.g. "b" (short) vs. "bb" (long) -- and long vowels are indicated by placing a vowel length mark ("modifier letter triangular colon") immediately after the vowel letter: "a, i, u" (short) vs. "aː, iː, uː" (long). In Arabic script orthography, long consonants are indicated by the "shadda" diacritic mark applied to the consonant letter, and long vowels are indicated by the use of either an alef character (for long "a") or one of the semivowel characters ("و" for long "u", "ي" for long "i"). Short vowels are often not indicated, but when they are, this is done by attaching a vowel diacritic mark ("fatha", "damma", or "kasra") to the preceding consonant letter. In general, the "wordform" table has all short vowels represented as diacritic marks in the Arabic orthography; in the "phrase" table, short-vowel diacritics are often omitted where the vowel quality is predictable from context, or not phonologically distinctive. Minor note on the ordering of Arabic diacritic marks: the original annotation to create Arabic script orhographic forms for word forms and phrases was done via Buckwalter transliteration. In virtually all typical uses of Buckwalter, the Arabic consonant length mark (shadda) is placed immediately after the consonant it modifies; if any short-vowel diacritics are also used on the same long consonant, they are placed after the shadda mark. This conflicts with the "canonical" ordering of Arabic diacritics as applied by the Unicode Standard, which places shadda after short vowel marks. In the present release, the Buckwalter ordering has been maintained. 5.2 Multiword expressions, English definitions, and Phrases The use of a separate table for multiword expressions was introduced fairly late in the process of preparing the Iraqi data for release as an LDC corpus. It was not in place during the annotion that led up to the publication of the Georgetown Dictionary of Iraqi Arabic in 2013. As a result, the "acm_mwe" table presented here is rather small. Many items in the "acm_eng_def" and "acm_phrase" tables could have been treated as MWE entries, but have been left in place in those other two tables because that is where they were placed during the main annotation effort carried out before 2013. In particular, about 300 "acm_eng_def" entries have "etext" values that begin with a parenthetical remark of the form "(with ...)", providing a word in Arabic (such as a preposition, sometimes including an IPA pronunciation), followed by an English translation for the combination of the lemma with the parenthesized word. Also, in a quantity of "acm_phrase" entries, the "orth", "pron", and "etext" values present just a brief phrase (i.e. a collocation or idiom), rather than a full sentence. 5.3 Inter-operability with other LDC Dialectal Arabic dictionaries The previously publishd Moroccan Arabic-English Lexical Database (LDC2023L01) uses the same overall database design as the Iraqi Arabic-English Lexical Database. The differences between the two can be summarized as follows: - table names begin with "acm_" (for Iraqi) vs. "ary_" (for Moroccan) - Iraqi has the "acm_mwe" table, while Moroccan has no "ary_mwe" table - the Iraqi "acm_phrase" table has a column called "mwe_id" for linking phrase entries to mwe entries; the Moroccan "ary_phrase" table does not have such a column It's important to note that the two sets of tables are, for all practical purposes, mutually compatible, to the extent that all the tables in both dialects can be included in a single database -- e.g. a single SQLite file. Here's an adaptation of the instructions from section 4.1 above, for loading both dialects into a single database, using the file name "ara_dialects.db": sqlite3 ara_dialects.db < ary_schema.sql sqlite3 ara_dialects.db < acm_schema.sql echo ".import --skip 1 ary_root.tab ary_root" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 acm_root.tab acm_root" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 ary_lemma.tab ary_lemma" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 acm_lemma.tab acm_lemma" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 ary_wordform.tab ary_wordform" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 acm_wordform.tab acm_wordform" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 acm_mwe.tab acm_mwe" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 ary_eng_def.tab ary_eng_def" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 acm_eng_def.tab acm_eng_def" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 ary_phrase.tab ary_phrase" | sqlite3 -cmd ".mode tabs" ara_dialects.db echo ".import --skip 1 acm_phrase.tab acm_phrase" | sqlite3 -cmd ".mode tabs" ara_dialects.db The command lines above assume that the 2 "*.sql" schema files and 11 "*.tab" files have already been copied into the user's current working directory. Combining the data this way will support queries that scan both dialects with regard to common roots, shared wordforms, related glosses, etc., in addition to allowing queries on one dialect to be easily applied to the other dialect, simply by changing the first three letters of the table name(s) involved. (Note that the Moroccan database schema can be modified trivially, adding the "ary_mwe" table, and the "mwe_id" field to the "ary_phrase" table, to make it fully equivalent to the Iraqi schema, even though these added elements would remain empty.) 6.0 Acknowledgments The effort to produce this dictionary was initiated by grant awards from the International Research Studies Program of the U.S. Department of Education (#P017A0800441). Additional support has been provided by Georgetown University Press, and by the Linguistic Data Consortium. Most of the development work for the dictionary took place from 2008 to 2011. Native Iraqis were enlisted to work at the LDC in the arduous task of both reorganizing and transliterating the original 2003 dictionary content into the form being presented here: Alyaa Aboud, Tamara Ali Janeb, Zainab Alsawaf and Safa Ismail confronted the most difficult phase of the project with diligence and dedication. 7.0 References Woodhead, D.R. and Beene, W., eds. (2003): A Dictionary of Iraqi Arabic, Arabic - English. Georgetown University Press. Washington, D.C. Maamouri, M., ed. (2013): The Georgetown Dictionary of Iraqi Arabic. Georgetown University Press. Washington, D.C.