Gadalla, Hassan, Hanaa Kilany, Howaida Arram, Ashraf Yacoub, Alaa El-Habashi, Amr Shalaby, Krisjanis Karins, Everett Rowson, Robert MacIntyre, Paul Kingsbury, David Graff and Cindie McLemore, Nov. 1998: LDC Callhome Egyptian Colloquial Arabic Lexicon. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. ----------------------------------------------------------- Description of the LDC Egyptian Colloquial Arabic lexicon ----------------------------------------------------------- CONTENTS 1. Summary abstract 2. Lexicon information fields 3. Orthographic convention (romanization) 4. Orthographic convention (Arabic script) 5. Character/letter correspondence table 6. Phonology table 7. Stress information 8. Morphological tags 9. Word source and frequency 10. Arabic script/romanization correspondence table ----------------------------------------------------------------------- 1. Summary abstract The LDC Arabic lexicon was compiled primarily for support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. This lexicon represents the first electronic pronunciation dictionary of Egyptian Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect of ECA that this dictionary represents is Cairene Arabic. This lexicon consists of 51,202 words. The LDC Arabic lexicon contains tab-separated information fields, including orthographic representation in both the LDC romanization as well as Arabic script, morphological, phonological, stress, source, and frequency information for each word. The lexical entries found in this lexicon come from four sources: (1) the 80 LVCSR CallHome training transcripts, (2) the 20 LVCSR CallHome development test (devtest) transcripts, (3) the 40 LVCSR CallHome evaluation test (evltest) transcripts that have been used prior to September 1998 in benchmark tests organized by NIST, and (4) entries from the Badawi & Hinds print dictionary of Egyptian Colloquial Arabic [Badawi, El-Said and Hinds, Martin (1986) "A Dictionary of Egyptian Arabic: Arabic-English". Librairie du Liban.] ----------------------------------------------------------------------- 2. Lexicon information fields The LDC Arabic lexicon contains seven tab-separated information fields: Field 1: orthographic form (headword) in LDC romanized script Field 2: orthographic form of the headword in Arabic script Field 3: pronunciation of the headword Field 4: primary stress information of the headword Field 5: morphological analysis of the headword Field 6: word frequency in training transcripts Field 7: word frequency in devtest transcripts Field 8: word frequency in evaltest transcripts Field 9: source from which the word entry was derived In the fields containing pronunciation, stress and morphological information, alternate forms or analyses are separated by two slashes "//". More on each of these fields is described in sections 3 - 9 below. ----------------------------------------------------------------------- 3. Orthographic convention (LDC romanization) The first field in the Arabic lexicon contains the romanized orthographic representation of the Arabic word. The bulk of the words found in this lexicon come from the transcripts of the 140 LVCSR Arabic conversations collected and transcribed at the LDC. The original transcription of the recorded conversations was done in the romanized version of ECA developed at the LDC. The romanized orthography of ECA (using ASCII characters) is phonemically based, and attempts to preserve both word identity and word pronunciation while limiting ambiguity. More documentation on this can be found with the released LVCSR CallHome Arabic transcripts. ----------------------------------------------------------------------- 4. Orthographic convention (Arabic script) The second field in the Arabic lexicon contains the Arabic script equivalent of the romanized headword from which it is derived. In turn, the LVCSR Arabic transcripts were converted from the original romanized script to Arabic script via replacement with the orthographic form found in this lexicon. The Arabic script representations of words in this lexicon were created using the Arabic character set available in MULE (Multi-Lingual Emacs). The character correspondences are one-to-one where this is possible (see the correspondence table in section 6.) . There are a number of general instances where the romanized character sequence differs from the Arabic script character sequence: 1. In verbal forms, the romanized script indicates stem-vowel length distinctions which are not found in the Arabic script. 2. Where the romanized script writes (historical) /th/ as the spoken /s/ or /t/, and (historical) /dh/ as /z/ or /d/, the Arabic script version writes both the /th/ and /dh/ where these are pronounced as /s/ and /z/ respectively. This is schematized below: MSA: s th t th z dh d dh \ / \ / \ / \ / LDC romanization: s t z d / \ | / \ | LDC ECA script: s th t z dh d 3. The LDC romanized script indicates "doubled" consonants in ECA with two orthographic letters. The Arabic script version would be expected to indicate consonant quantity or duration with a "shadda". Since the "shadda" is unfortunately not currently available in the MULE Arabic character set, consonant duration is not indicated in the Arabic script. 4. Initial vowel correspondences between the romanized and Arabic script versions is the following: a Alif A Alif with madda E/I Alif and ya i Alif O/U Alif and waw u Alif ----------------------------------------------------------------------- 5. Character/letter correspondence table Refer to the file "scr2rom.tbl", whose contents are also presented below in Section 10. ----------------------------------------------------------------------- 6. Phonology table The third field in the lexicon contains pronunciation information of each headword. The phonetic symbols used are adapted from the romanization of ECA provided in section 6. above. The symbol used, its phonetic description, and an example word from Arabic is provided in the table below. This lexicon contains some alternate pronunciations of words, including the variants of the words with the morphophonemic marker "tEh marbUta" /B/. In most words, orthographic /q/ is pronounced as a voiceless glottal stop in ECA. However, in those somewhat rare instances where it is pronounced as a voiceless pharyngeal stop, its pronunciation is given as [Q]. In other cases, the pronunciation is left as [a]. This gives rise to two phonetic symbols used for the glottal stop: /C/ and /q/. However, retaining these two symbols in the pronunciation field allows one to trace the origin of the glottal stop: either a hamza or qAf. If there is more than one pronunciation of a headword, the alternate pronunciations are separated by a "//". Phonology table of the LDC Arabic lexicon LDC symbol Phonetic description Sample word C voiceless glottal stop b voiced bilabial stop t voiceless dental stop g voiced velar stop H voiceless pharyngeal fricative x voiceless velar fricative d voiced dental stop r voiced alveolar flap z voiced alveolar fricative s voiceless alveolar fricative $ voiceless alveopalatal fricative S voiceless alveolar velarized fricative D voiced dental velarized stop T voiceless dental velarized stop Z voiced velarized interdental fricative c voiced pharyngeal fricative G voiced uvular fricative f voiceless labio-dental fricative q voiceless glottal stop Q voiceless pharyngeal stop k voiceless velar stop l voiced alveolar lateral m voiced bilabial nasal n voiced alveolar nasal h voiceless glottal fricative w voiced bilabial continuant y voiced palatal continuant v voiced labio-dental fricative j voiced alveopalatal affricate @ low front unrounded vowel a low back unrounded vowel i high front unrounded vowel u high back rounded vowel % long @ A long a I long i O long back mid rounded vowel U long u E long front mid unrounded vowel ay front upgliding diphthong aw back upgliding diphthong ----------------------------------------------------------------------- 7. Stress information The fourth information field in the lexicon contains information about the primary word stress in the language. Each syllable of the word is indicated by a number, with unstressed syllables indicated by "0" and the stressed syllable indicated by "1". Only one stress per word is indicated. If there are multiple pronunciations for a word, the single stress pattern applies to all pronunciations. (In this release, there is one entry having two stress patterns, separated by "//" -- in this case, there are two pronunciations, also separated by "//"; the first stress entry relates to the first pron, the second stress entry to the second pron.) ----------------------------------------------------------------------- 8. Morphological tags The fifth information field of the Arabic lexicon contains morphological information about the headword. The abbreviations used are explained below. The basic pattern for the morphology information is determined by the part of speech for the entry. The morphological components are separated by ``+'' or ``-'', as indicated in the table below. If there is more than one possible morphological parse for a given word, the different parses are separated by two slashes "//". The first entry for any morphological tag is the base (or traditional "look-up" form) of the headword. Part of speech tags: :adj adjective :adv adverb :article definite article :conj conjunction :dem demonstrative pronoun :interj interjection :modal modal verb :noun noun :num numeral :part particle :part-itr interrogative particle :part-neg negative particle :part-voc vocative particle :part-int introductory particle :pple-act active participle :pple-pass passive participle :prep preposition :pro pronoun :prorel relative pronoun :vbn verbal noun :verb verb :advpiece part of a multi-word adverb :conjpiece part of a multi-word conjunction :nounportion part of a multi-word noun :interjportion part of a multi-word interjection Morphological attributes: +amb ambiguous +article definite article +coll collective +conj(_prefix) conjunction prefix (e.g. /fa/) +DO direct object +IO indirect object +elative elative +fut future tense +gen genitive suffix +imp imperfect tense +inv invariant +neg negative marker +nom nominative suffix +part particle not as a separate part of speech +past past tense +prep_prefix prepositional prefix (e.g. /li/) +pres present tense +prop proper name +subj subjunctive mood +sufxprep suffixal preposition /l/ (for indirect object) -1st first person -2nd second person -3rd third person -sg singular [-/+]dual dual [-/+]fem feminine [-/+]inan inanimate [-/+]masc masculine [-/+]plural plural The last set of attributes may be preceded by either ``+'' or ``-'', depending on whether they directly follow a part-of-speech tag or some other attribute. (That is, part-of-speech tags are always followed by immediately ``+'', while other attributes may be followed by ``-''.) Relative to the earlier release of the Egyptian Arabic lexicon, we have made some changes in the naming of morphological attributes, to improve consistency in the lexicon. ----------------------------------------------------------------------- 9. Word source and frequency All word frequency information is based upon the romanized headword found in the first column of the dictionary. Training words (field 6): The sixth tab-separated field in the lexicon contains information about frequency of the word in the training transcripts. Devtest words (field 7): The seventh tab-separated field in the lexicon contains information about frequency of the word in the development test (devtest) transcripts. Evaltest words (field 8): The eighth tab-separated field in the lexicon contains information about frequency of the word in the evaluation test (evltest) transcripts; 40 of these transcripts have been used in LVCSR benchmark tests as of this release. There are an additional 60 evltest transcripts that remain "unexposed", and words that are unique to these transcripts have been withheld from release in this lexicon, pending their use in future benchmarks. Word source (field 9): The primary source from which a word is derived is encoded by a single letter in this field, as follows: T - word initially included from training transcripts D - word initially included from devtest transcripts E - word initially included from (exposed) evltest transcripts B - word initially included from the Badawi & Hinds dictionary (but may have subsequently been found in one or more transcripts) ----------------------------------------------------------------------- 10. Arabic script/romanization correspondence table The character correspondences between Arabic script and the LDC romanization of ECA is provided in the table below, along with a phonetic description of the symbol used. (You will need to use mule to view the Arabic script characters in this table.) This table is also stored in the file "scr2rom.tbl". LDC correspondence table for Egyptian Colloquial Arabic Arabic LDC Arabic name Phonetic description Á C hamza voiceless glottal stop (frequently combined with an adjacent alif, yA, or wAw "chair" or realized as "madda") È b bA voiced or voiceless bilabial stop Ê t tA voiceless dental stop Ì g gIm voiced velar stop Ì j jIm voiced alveopalatal affricate Í H HA voiceless pharyngeal fricative Î x xA voiceless velar fricative Ï d dAl voiced dental stop Ñ r rA voiced alveolar flap Ò z zEn voiced alveolar fricative Ð z dhAl voiced alveolar fricative Ó s sIn voiceless alveolar fricative Ë s thA voiceless alveolar fricative Ô $ $In voiceless alveopalatal fricative Õ S SAD voiceless alveolar velarized fricative Ö D DAD voiced dental velarized stop × T Tah voiceless dental velarized stop Ø Z Zah voiced velarized interdental fricative Ù c cEn voiced pharyngeal fricative Ú G GEn voiced uvular fricative á f fA voiceless labio-dental fricative á v vi voiced labio-dental fricative â q qAf voiceless pharyngeal stop ã k kAf voiceless velar stop ä l lAm voiced alveolar lateral å m mIm voiced bilabial nasal æ n nUn voiced alveolar nasal ç h hA voiceless glottal fricative è w wAw voiced bilabial continuant é/ê y yA voiced palatal continuant (é- connected only on right or unconnected) (ê- connected on both sides or left only) É B tEh marbuta morphophonemic feminine marker a fatHa low front unrounded vowel i kasra high front unrounded vowel u Damma high back rounded vowel Ç A alif long a é/ê I yA long i è O wAw long back mid rounded vowel è U wAw long u é/ê E yA long front mid unrounded vowel ay front upgliding diphthong aw back upgliding diphthong