Main Documentation for the LDC Moroccan Arabic - English Lexical Database LDC-Catalog-ID: LDC2023L01 1.0 Introduction The LDC Moroccan Arabic - English Lexical Database comprises a set of five interrelated tables. The combined content of the table set is essentially equivalent to the Arabic-English portion of the Georgetown Dictionary of Moroccan Arabic (Maamouri, 2018), which in turn was drawn mostly from the Arabic-English portion of the Dictionary of Moroccan Arabic (Harrell, 1966). The core feature of the dictionary is to present each Moroccan Arabic (MA) word as both an orthographic form in Arabic script, and a pronunciation form using the International Phonetic Alphabet (IPA). The Arabic script spellings are based on the etymological values of the Arabic alphabet, consistent with their common usage in Modern Standard Arabic (MSA). Spellings are supplemented where necessary by vowel diacritic marks to help clarify distinctive properties of MA dialectal forms. The IPA pronunciations show the relations between the etymology-based Arabic consonant values and their pronunciation values in MA. These relations are discussed in section 4.2 below. The five tables, described in detail in section 3 below, are: - Roots (ary_root.tab): the set of consonantal bases that make up the primary organization of dictionary content - Lemmas (ary_lemma.tab): the inventory of dictionary headwords, represented by their numeric IDs only, along with part-of-speech labels and the numeric IDs of their associated roots - Wordforms (ary_wordform.tab): one or more Arabic orthographic forms and pronunciations for each lemma, including both citation forms and some related forms (particularly the plural forms for various nouns and adjectives) - English definitions (ary_eng_def.tab): one or more English definitions for each lemma, sometimes including common MA collocations having idiomatic meanings - Example phrases (ary_phrase.tab): one or more sentences to illustrate usage for a given definition in many (but not all) lemmas, with the MA sentence presented in both Arabic and IPA orthography, along with an English translation 2.0 Overview of Contents The five tables are found in the "data" directory; the quantity of entries in each table are as follows: Roots 3,567 Lemmas 14,255 Wordforms 19,927 Definitions 24,911 Phrases 4,418 The lemma inventory includes 6923 nouns, 4083 verbs, 2957 adjectives (including comparatives), 75 proper nouns, and 42 adverbs, along with several dozen entries in 11 closed-class POS categories. The phrases contain a total of over 21,800 MA word tokens (in both Arabic script and IPA), and nearly 33,600 English word tokens. The "docs" directory contains the following: - ary_schema.sql, ary_schema.mysql: table definitions (see 4.1 below) - ary_ipa_arabic.tab: table listing the IPA symbols used, their phonetic meanings, and their relationships to Arabic letters (see 4.2) - abbreviations.tab: list of abbreviations used in definition text (see 4.3) 3.0 Details of Database Structure Each table is presented here as tab-delimited, plain-text file, with Unicode UTF-8 character encoding and UNIX/Linux-style line terminations (line-feed character only, no carriage-return). The first line of each file contains the column labels, which are explained in the subsections below. While it is not strictly necessary to do so, the numeric IDs assigned to each entry in each table are globally unique across all tables -- i.e. each table uses a distinct range of ID numbers to uniquely identify each row. (There can be gaps in the sequence of ID numbers within each table.) 3.1 The Roots Table: ary_root.tab 1 id -- globally unique numeric identifier 2 ltrs -- Arabic letter sequence comprising the root 3 r_type -- etymological status of the root 4 r_indx -- index number for semantic differentiation Notes: - ID numbers range from 10001 to 13570 - Every letter of every root is preceded and followed by a space -- e.g.: " ي و م " - "r_type" is one of: -- Foreign (the "root" is just the consonant skeleton of a borrowed term) -- Standard_Arabic (roots shared by MSA and/or other Arabic dialects) -- Dialect_Specific (roots that are not borrowed, and are unique to MA) -- Unspecified (five roots have not been categorized) - "r_index" ranges from 1 to 5 -- i.e. some roots have as many as five distinct semantic subgroups of lemmas 3.2 The Lemmas Table: ary_lemma.tab 1 id -- globally unique numeric identifier 2 root_id -- numeric ID of the ary_root entry for the lemma 3 pos -- part-of-speech label for the lemms ("noun", "verb", etc.) Notes: - ID numbers range from 100001 to 114258 - POS labels are all-lower-case with no abbreviations 3.3 The Wordforms Table: ary_wordform.tab 1 id -- globally unique numeric identifier 2 lemma_id -- numeric ID of the ary_lemma entry for the wordform 3 orth -- Arabic script orthography 4 pron -- IPA pronunciation 5 form -- one of: "citation", "broken_plural", "fem_plural" 6 w_usage -- one of: "archaic", "common" Notes: - ID numbers range from 200001 to 219927 - There is always exactly one "citation" form for each lemma_id; this is the singular form for nouns and adjectives, and the 3rd-person-preterite form for verbs - There are only a handful of "archaic" forms 3.4 The English Definitions Table: ary_eng_def.tab 1 id -- globally unique numeric identifier 2 lemma_id -- numeric ID of the ary_lemma entry for the definition 3 sense_label -- sequence number relative to other definitions for the lemma 4 etext -- English definition Notes: - ID numbers range from 500001 to 524987 - The number of senses per lemma (reflected in "sense_label") ranges from 1 to 28 - The numeric ordering by "sense_label" may be arbitrary (i.e. does not necessarily represent relative frequency or preference in usage) - The "etext" field may include word tokens or phrases in Arabic script and/or IPA, providing common collocations with their specialized or idiomatic meanings in English - When a definition contains Arabic text, each contiguous Arabic string is surrounded by Unicode direction control characters: -- U+202B (RIGHT-TO-LEFT EMBEDDING) marks the beginning of each Arabic string -- U+202C (POP DIRECTIONAL FORMATTING) marks the end of each Arabic string - When a definition contains IPA text, each contiguous pronunciation string is surrounded by square brackets: -- '[' marks the beginning of each IPA string -- ']' marks the end of each IPA string 3.5 The Phrases Table: ary_phrase.tab 1 id -- globally unique numeric identifier 2 def_id -- numeric ID of the ary_eng_def entry for the phrase 3 eng_text 4 ara_text 5 ipa_text Notes: - ID numbers range from 1 to 4422 - Each phrase is linked to a specific entry in ary_eng_def; many ary_eng_def entries do not have example phrases - Each word token in ara_text is matched by a corrsponding token in ipa_text - There are a few entries that contain an Arabic word token in the eng_text field; in these cases, the Arabic token is bracketed by the direction control characters U+202B ... U+202C - The ara_text field is always surrounded by direction control characters: initial U+202B and final U+202C - The ipa_text field is NOT surrounded by square brackets - Two entries contain a forward-slash as a separate token (" / ") in the eng_text field, to indicate alternative English phrasings - 18 entries contain a forward-slash as a separate token in both Arabic and IPA fields, to indicate alternative word choices for a phrase 4.0 Accompanying Documentation 4.1 Schema Definitions The two schema definition files, "ary_schema.mysql" and "ary_schema.sql", are essentially the same; the only difference is that the "mysql" version includes additional specifications intended for use with a MySQL or MariaDB server. These involve the type of database file format (InnoDB) and the character encoding to be used. The "mysql" version has been tested on a current MariaDB server, and the other version has been tested with both PostgreSQL and SQLite. Users should first use the server of their choice to create an empty database, and execute the commands in the chosen schema definition file to create the tables. Once the database and tables are in place, the tab-delimited files in data/ can be read into each table in the sequence: root, lemma, wordform, eng_def, phrase. In the case of SQLite, the first two steps can be carried out with a single shell command line; for example (using a "bash" or equivalent shell, and assuming that the command-line interface program for SQLite is installed and appears in the user's shell PATH as "sqlite3"), the following creates a file called "ary_eng_dict.db" to store the database, and defines the five tables that comprise the schema: sqlite3 ary_eng_dict.db < ary_schema.sql Then the following sequence of commands can be used to load the five tables: echo ".import --skip 1 ary_root.tab ary_root" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db echo ".import --skip 1 ary_lemma.tab ary_lemma" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db echo ".import --skip 1 ary_wordform.tab ary_wordform" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db echo ".import --skip 1 ary_eng_def.tab ary_eng_def" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db echo ".import --skip 1 ary_phrase.tab ary_phrase" | sqlite3 -cmd ".mode tabs" ary_eng_dict.db Of course, users can also load the tables into any spreadsheet application. 4.2 Summary of Pronunciation-Orthographic Mappings The file "ary_ipa_arabic.tab" contains five columns, as follows: 1 ipa_ltr -- "b", etc. 2 ipa_chr -- Unicode code-point value ("U+0062", etc.) 3 ara_ltr -- "ب", etc. 4 ara_chr -- Unicode code-point value ("U+0628", etc.) 5 description -- phonetic definition ("voiced bilabial stop", etc.) Two entries in this table have "(na)" as the value for columns 3 and 4: - IPA "e" represents an unstressed short vowel; when it appears in an IPA token, the corresponding Arabic orthography may contain either a "fatha" or "kasra" diacritic, or (in many phrase words) no diacritic at all; in effect, this IPA vowel represents a "schwa"-like quality, such that the articulatory position of the vowel is not a distinctive feature. - IPA "ː" (U+02D0) represents vowel length (it occurs only after "a", "i" or "u"); when it appears, the corresponding Arabic orthography will have some form of "aleph" character (for "aː"), or "ي" (for "iː") or "و" (for "uː"). Note that this character was chosen (instead of the ASCII colon ":"), because it is classified in Unicode as a "modifier letter" (rather than as punctuation), so that Unicode-aware processes will not treat it as a word boundary character. One entry in this table ("ʔ ... glottal stop") relates to a set of six Arabic letters, listed together in columns 3 and 4. The use of one or another Arabic letter in the position of a glottal stop depends on contextual and etymological properties of the given word. 4.3 List of abbreviations used in English defnition text The file "abbreviations.tab" is a two-column table with abbreviations in column 1 and their full meanings in column 2. These abbreviations show up in various entries of the fourth column ("etext") of "ary_eng_def.tab". 5.0 Notes on Orthographic Conventions The IPA and Arabic spellings differ in terms of how segmental phonemic length is represented for consonants and vowels. In IPA, long consonants are indicated by doubling the consonant letter -- e.g. "b" (short) vs. "bb" (long) -- and long vowels are indicated by placing a vowel length mark ("modifier letter triangular colon") immediately after the vowel letter: "a, i, u" (short) vs. "aː, iː, uː" (long). In Arabic script orthography, long consonants are indicated by the "shadda" diacritic mark applied to the consonant letter, and long vowels are indicated by the use of either an alef character (for long "a") or one of the semivowel characters ("و" for long "u", "ي" for long "i"). Short vowels are often not indicated, but when they are, this is done by attaching a vowel diacritic mark ("fatha", "damma", or "kasra") to the preceding consonant letter. In general, the "wordform" table has all short vowels represented as diacritic marks in the Arabic orthography; in the "phrase" table, short-vowel diacritics are often omitted where the vowel quality is predictable from context, or not phonologically distinctive. 6.0 Acknowledgments The effort to produce this dictionary was initiated by grant awards from the International Research Studies Program of the U.S. Department of Education (#P017A0800441). Additional support has been provided by Georgetown University Press, and by the Linguistic Data Consortium. Development of the dictionary took place over two phases. In the first phase, from 2010 to 2012, three native Moroccans were enlisted to work at the LDC in the arduous task of both reorganizing and transliterating the original 1966 dictionary content into the form being presented here: Fatima Zohra Laghrissi, Ikram Youssef and Youness Nabirh confronted the most difficult phase of the project with diligence and dedication. In the second phase, from 2013 to 2015, we benefitted immensely from the efforts of researchers and advanced students at Al-Akhawayn University in Ifram, Morocco. Dr. Abdellah Chekayri helped to refine the conventions for both Arabic and IPA orthography. Dr. Violetta Cavalli Sforza was instrumental in bringing substantive improvements to our quality-control procedures and tools, and led a well-organized annotation process to review and improve the phase 1 content by filling gaps, extending coverage for modern usage, and resolving inconsistencies. She directed the Al-Akhawayn University team of annotators: Youssef Ismaili, Hind Saddiki, Meryem Daiki, Sra El Hamdaoui, Oualid El Meriague, Rachid Lamsairhri, Souhail Meftah, Ibtissam Ouazzani, Hachem Saddiki and Maha Skah. 7.0 References Harrel, R., Sobleman, H, eds. (1966 [2008]): A Dictionary of Moroccan Arabic. Georgetown University Press. Washington, D.C. Maamouri, M., ed. (2018): The Georgetown Dictionary of Moroccan Arabic. Georgetown University Press. Washington, D.C.