Gadalla, Hassan, Hanaa Kilany, Howaida Arram, Ashraf Yacoub, Alaa El-Habashi,
	Amr Shalaby, Krisjanis Karins, Everett Rowson, Robert
	MacIntyre, Paul Kingsbury, David Graff and Cindie McLemore,
	Nov. 1998:  LDC Callhome Egyptian Colloquial Arabic Lexicon.
	Philadelphia: Linguistic Data Consortium, University of
	Pennsylvania.

	-----------------------------------------------------------
	  Description of the LDC Egyptian Colloquial Arabic lexicon
	-----------------------------------------------------------

CONTENTS

	1.   Summary abstract
	2.   Lexicon information fields
	3.   Orthographic convention (romanization)
	4.   Orthographic convention (Arabic script)
	5.   Character/letter correspondence table
	6.   Phonology table
	7.   Stress information
	8.   Morphological tags
	9.   Word source and frequency
	10.  Arabic script/romanization correspondence table


-----------------------------------------------------------------------
1.  Summary abstract

	The LDC Arabic lexicon was compiled primarily for support of
the project on Large Vocabulary Conversational Speech Recognition
(LVCSR), sponsored by the U.S. Department of Defense.  

	This lexicon represents the first electronic pronunciation
dictionary of Egyptian Colloquial Arabic (ECA), the spoken variety of
Arabic found in Egypt.  The dialect of ECA that this dictionary
represents is Cairene Arabic.  

	This lexicon consists of 51,202 words.  The LDC Arabic
lexicon contains tab-separated information fields, including
orthographic representation in both the LDC romanization as well as
Arabic script, morphological, phonological, stress, source, and
frequency information for each word.

	The lexical entries found in this lexicon come from four
sources: (1) the 80 LVCSR CallHome training transcripts, (2) the 20
LVCSR CallHome development test (devtest) transcripts, (3) the 40
LVCSR CallHome evaluation test (evltest) transcripts that have been
used prior to September 1998 in benchmark tests organized by NIST,
and (4) entries from the Badawi & Hinds print dictionary of Egyptian
Colloquial Arabic [Badawi, El-Said and Hinds, Martin (1986) "A
Dictionary of Egyptian Arabic: Arabic-English". Librairie du Liban.]


-----------------------------------------------------------------------
2.  Lexicon information fields

	The LDC Arabic lexicon contains seven tab-separated information
fields:

Field 1:  orthographic form (headword) in LDC romanized script
Field 2:  orthographic form of the headword in Arabic script
Field 3:  pronunciation of the headword
Field 4:  primary stress information of the headword
Field 5:  morphological analysis of the headword
Field 6:  word frequency in training transcripts
Field 7:  word frequency in devtest transcripts
Field 8:  word frequency in evaltest transcripts
Field 9:  source from which the word entry was derived

	In the fields containing pronunciation, stress and
morphological information, alternate forms or analyses are separated
by two slashes "//".  More on each of these fields is described in
sections 3 - 9 below.


-----------------------------------------------------------------------
3.  Orthographic convention (LDC romanization)

	The first field in the Arabic lexicon contains the romanized
orthographic representation of the Arabic word.  The bulk of the words
found in this lexicon come from the transcripts of the 140 LVCSR Arabic
conversations collected and transcribed at the LDC.  The original
transcription of the recorded conversations was done in the romanized
version of ECA developed at the LDC.  The romanized orthography of ECA
(using ASCII characters) is phonemically based, and attempts to preserve
both word identity and word pronunciation while limiting ambiguity.
More documentation on this can be found with the released LVCSR CallHome
Arabic transcripts.

-----------------------------------------------------------------------
4.  Orthographic convention (Arabic script)

	The second field in the Arabic lexicon contains the Arabic
script equivalent of the romanized headword from which it is derived.
In turn, the LVCSR Arabic transcripts were converted from the original 
romanized script to Arabic script via replacement with the
orthographic form found in this lexicon.  
	The Arabic script representations of words in this lexicon
were created using the Arabic character set available in MULE
(Multi-Lingual Emacs).  The character correspondences are one-to-one
where this is possible (see the correspondence table in section 6.) .
There are a number of general instances where the romanized character
sequence differs from the Arabic script character sequence:

	1.  In verbal forms, the romanized script indicates stem-vowel 
length distinctions which are not found in the Arabic script.

	2.  Where the romanized script writes (historical) /th/
as the spoken /s/ or /t/, and (historical) /dh/ as /z/ or /d/, the
Arabic script version writes both the /th/ and /dh/ where these are
pronounced as /s/ and /z/ respectively.  This is schematized below:

MSA:                     s  th   t  th               z  dh  d  dh
                          \ /     \ /                 \ /    \ /
LDC romanization:          s       t                   z      d
                          / \      |                  / \     |
LDC ECA script:          s  th     t                 z  dh    d


	3.  The LDC romanized script indicates "doubled" consonants in 
ECA with two orthographic letters.  The Arabic script version would be 
expected to indicate consonant quantity or duration with a "shadda".
Since the "shadda" is unfortunately not currently available in the
MULE Arabic character set, consonant duration is not indicated in the
Arabic script.  

	4.  Initial vowel correspondences between the romanized and
Arabic script versions is the following:

        a   Alif
        A   Alif with madda
        E/I Alif and ya
        i   Alif
        O/U Alif and waw
        u   Alif


-----------------------------------------------------------------------
5.  Character/letter correspondence table

	Refer to the file "scr2rom.tbl", whose contents are also
presented below in Section 10.


-----------------------------------------------------------------------
6.  Phonology table

	The third field in the lexicon contains pronunciation
information of each headword.  The phonetic symbols used are adapted
from the romanization of ECA provided in section 6. above.  The symbol
used, its phonetic description, and an example word from Arabic is
provided in the table below.  This lexicon contains some alternate
pronunciations of words, including the variants of the words with the
morphophonemic marker "tEh marbUta" /B/.  In most words, orthographic
/q/ is pronounced as a voiceless glottal stop in ECA.  However, in
those somewhat rare instances where it is pronounced as a voiceless
pharyngeal stop, its pronunciation is given as [Q].  In other cases,
the pronunciation is left as [a].  This gives rise to two phonetic
symbols used for the glottal stop: /C/ and /q/.  However, retaining
these two symbols in the pronunciation field allows one to trace the
origin of the glottal stop: either a hamza or qAf.

If there is more than one pronunciation of a headword, the alternate
pronunciations are separated by a "//".


	Phonology table of the LDC Arabic lexicon

    LDC symbol	Phonetic description			Sample word

	C	voiceless glottal stop
	b	voiced bilabial stop
	t	voiceless dental stop
	g	voiced velar stop
	H	voiceless pharyngeal fricative
	x	voiceless velar fricative
	d	voiced dental stop
	r	voiced alveolar flap
	z	voiced alveolar fricative
	s	voiceless alveolar fricative
	$	voiceless alveopalatal fricative
	S	voiceless alveolar velarized fricative
	D	voiced dental velarized stop
	T	voiceless dental velarized stop
	Z	voiced velarized interdental fricative
	c	voiced pharyngeal fricative
	G	voiced uvular fricative
	f	voiceless labio-dental fricative
	q	voiceless glottal stop
	Q	voiceless pharyngeal stop
	k	voiceless velar stop
	l	voiced alveolar lateral
	m	voiced bilabial nasal
	n	voiced alveolar nasal
	h	voiceless glottal fricative
	w	voiced bilabial continuant
	y	voiced palatal continuant

	v	voiced labio-dental fricative
	j	voiced alveopalatal affricate


	@	low front unrounded vowel
	a	low back unrounded vowel
	i	high front unrounded vowel
	u	high back rounded vowel

	%	long @
	A	long a
	I	long i
	O	long back mid rounded vowel
	U	long u
	E	long front mid unrounded vowel

	ay	front upgliding diphthong
	aw	back upgliding diphthong


-----------------------------------------------------------------------
7.  Stress information

	The fourth information field in the lexicon contains
information about the primary word stress in the language.  Each
syllable of the word is indicated by a number, with unstressed
syllables indicated by "0" and the stressed syllable indicated by "1".
Only one stress per word is indicated.  If there are multiple
pronunciations for a word, the single stress pattern applies to all
pronunciations.  (In this release, there is one entry having two stress
patterns, separated by "//" -- in this case, there are two
pronunciations, also separated by "//"; the first stress entry relates
to the first pron, the second stress entry to the second pron.)


-----------------------------------------------------------------------
8.  Morphological tags

	The fifth information field of the Arabic lexicon contains
morphological information about the headword.  The abbreviations used
are explained below.  The basic pattern for the morphology information
is determined by the part of speech for the entry.  The morphological
components are separated by ``+'' or ``-'', as indicated in the table
below.

	If there is more than one possible morphological parse for a
given word, the different parses are separated by two slashes "//".

	The first entry for any morphological tag is the base (or
traditional "look-up" form) of the headword.


Part of speech tags:

:adj		adjective
:adv		adverb
:article	definite article
:conj		conjunction
:dem		demonstrative pronoun
:interj		interjection
:modal		modal verb
:noun		noun
:num		numeral
:part		particle
:part-itr	interrogative particle
:part-neg	negative particle
:part-voc	vocative particle
:part-int	introductory particle
:pple-act	active participle
:pple-pass	passive participle
:prep		preposition
:pro		pronoun
:prorel		relative pronoun
:vbn		verbal noun
:verb		verb
:advpiece	part of a multi-word adverb
:conjpiece	part of a multi-word conjunction
:nounportion	part of a multi-word noun
:interjportion	part of a multi-word interjection

Morphological attributes:

+amb		ambiguous
+article	definite article
+coll		collective
+conj(_prefix)	conjunction prefix (e.g. /fa/)
+DO		direct object
+IO		indirect object
+elative	elative
+fut		future tense
+gen		genitive suffix
+imp		imperfect tense
+inv		invariant
+neg		negative marker
+nom		nominative suffix
+part		particle not as a separate part of speech
+past		past tense
+prep_prefix	prepositional prefix (e.g. /li/)
+pres		present tense
+prop		proper name
+subj		subjunctive mood
+sufxprep	suffixal preposition /l/ (for indirect object)

-1st		first person
-2nd		second person
-3rd		third person
-sg		singular

[-/+]dual	dual
[-/+]fem	feminine
[-/+]inan	inanimate
[-/+]masc	masculine
[-/+]plural	plural


The last set of attributes may be preceded by either ``+'' or ``-'',
depending on whether they directly follow a part-of-speech tag or some
other attribute.  (That is, part-of-speech tags are always followed by
immediately ``+'', while other attributes may be followed by ``-''.)

Relative to the earlier release of the Egyptian Arabic lexicon, we
have made some changes in the naming of morphological attributes, to
improve consistency in the lexicon.

-----------------------------------------------------------------------
9.  Word source and frequency

All word frequency information is based upon the romanized headword
found in the first column of the dictionary.  

Training words (field 6):

	The sixth tab-separated field in the lexicon contains
information about frequency of the word in the training transcripts.


Devtest words (field 7):

	The seventh tab-separated field in the lexicon contains
information about frequency of the word in the development test
(devtest) transcripts.  


Evaltest words (field 8):

	The eighth tab-separated field in the lexicon contains
information about frequency of the word in the evaluation test (evltest)
transcripts; 40 of these transcripts have been used in LVCSR benchmark
tests as of this release.  There are an additional 60 evltest
transcripts that remain "unexposed", and words that are unique to these
transcripts have been withheld from release in this lexicon, pending
their use in future benchmarks.

Word source (field 9):

	The primary source from which a word is derived is encoded by a
single letter in this field, as follows:

  T - word initially included from training transcripts
  D - word initially included from devtest transcripts
  E - word initially included from (exposed) evltest transcripts
  B - word initially included from the Badawi & Hinds dictionary (but
	may have subsequently been found in one or more transcripts)


-----------------------------------------------------------------------
10.  Arabic script/romanization correspondence table

	The character correspondences between Arabic script and the
LDC romanization of ECA is provided in the table below, along with a
phonetic description of the symbol used.  (You will need to use mule
to view the Arabic script characters in this table.)  This table is
also stored in the file "scr2rom.tbl".

	LDC correspondence table for Egyptian Colloquial Arabic

Arabic	LDC	Arabic name	Phonetic description

Á	C	hamza		voiceless glottal stop
				(frequently combined with an adjacent alif,
				yA, or wAw "chair" or realized as "madda")
È	b	bA		voiced or voiceless bilabial stop
Ê	t	tA		voiceless dental stop
Ì	g	gIm		voiced velar stop
Ì	j	jIm		voiced alveopalatal affricate
Í	H	HA		voiceless pharyngeal fricative
Î	x	xA		voiceless velar fricative
Ï	d	dAl		voiced dental stop
Ñ	r	rA		voiced alveolar flap
Ò	z	zEn		voiced alveolar fricative
Ð	z	dhAl		voiced alveolar fricative
Ó	s	sIn		voiceless alveolar fricative
Ë	s	thA		voiceless alveolar fricative
Ô	$	$In		voiceless alveopalatal fricative
Õ	S	SAD		voiceless alveolar velarized fricative
Ö	D	DAD		voiced dental velarized stop
×	T	Tah		voiceless dental velarized stop
Ø	Z	Zah		voiced velarized interdental fricative
Ù	c	cEn		voiced pharyngeal fricative
Ú	G	GEn		voiced uvular fricative
á	f	fA		voiceless labio-dental fricative
á	v	vi		voiced labio-dental fricative
â	q	qAf		voiceless pharyngeal stop
ã	k	kAf		voiceless velar stop
ä	l	lAm		voiced alveolar lateral
å	m	mIm		voiced bilabial nasal
æ	n	nUn		voiced alveolar nasal
ç	h	hA		voiceless glottal fricative
è	w	wAw		voiced bilabial continuant
é/ê	y	yA		voiced palatal continuant
				(é- connected only on right or unconnected)
				(ê- connected on both sides or left only)

É	B	tEh marbuta	morphophonemic feminine marker

	a	fatHa		low front unrounded vowel
	i	kasra		high front unrounded vowel
	u	Damma		high back rounded vowel

Ç	A	alif		long a
é/ê	I	yA		long i
è	O	wAw		long back mid rounded vowel
è	U	wAw		long u
é/ê	E	yA		long front mid unrounded vowel

	ay			front upgliding diphthong
	aw			back upgliding diphthong