-----------------------------------------------------------
	Description of the CallHome telephone speech and transcript 
		corpus for Egyptian Colloquial Arabic
	-----------------------------------------------------------

	April, 1997

	Project leader:		Krisjanis Karins

	Consultation:		Mark Liberman
				Cynthia McLemore
				Everett Rowson

	Transcribers/
	corrections:		Howaida Arram
				Alaa El-Habashi
				Hassan A. Gadalla
				Hanaa Kilany
				Amr A. Shalaby
				Ashraf Yacoub

	Programming support:	Robert MacIntyre


CONTENTS

	1.  Summary abstract
	2.  Data acquisition
	3.  Data verification
	4.  Speaker demographics
	5.  Data transcription - General
	6.  Data transcription - Non-lexemes
	7.  Data transcription - Egyptian Arabic special conventions
	8.  Quality control (QC) procedures
	9.  Conversion to Arabic script
	10. Arabic script/romanization correspondence table
	11. ISO 8859-6 character encoding

-----------------------------------------------------------------------
1.  Summary abstract

	The CallHome Arabic corpus of telephone speech was collected
and transcribed by the Linguistic Data Consortium primarily in support
of the project on Large Vocabulary Conversational Speech Recognition
(LVCSR), sponsored by the U.S. Department of Defense.

	This release of the CallHome Arabic corpus consists of 120
unscripted telephone conversations between native speakers of Egyptian
Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt.
The dialect of ECA that this corpus represents is Cairene Arabic.

	The transcripts cover a contiguous 5 or 10 minute segment
taken from a recorded conversation lasting up to 30 minutes.  All
speakers were aware that they were being recorded.  They were given no
guidelines concerning what they should talk about.  Once a caller was
recruited to participate, he/she was given a free choice of whom to
call.  Most participants called family members or close friends
overseas.  All calls originated in North America. and were placed
overseas.  The distribution of call destinations can be found in the
file "spkrinfo.doc".

	The transcripts are timestamped by speaker turn, and are
provided in the LDC standardized orthography.  The timestamps are
aligned with the speech signal.

	Given the lack of a standard orthographic system for ECA, the
LDC developed a standard romanized orthography for the language.  The
romanized orthography uses ASCII characters and is phonemically based.
It strives to maintain both word pronunciation information and word
identity, while minimizing ambiguity.  Once the transcripts were
completed in romanized form, they were converted back to Arabic script
via lookup through the LDC lexicon of Egyptian Colloquial Arabic.
Both the romanized as well as the Arabic script versions of the
transcripts are found in this release.


-----------------------------------------------------------------------
2.  Data acquisition

	Speakers were solicited by the LDC to participate in this
telephone speech collection effort via the internet, publications
(advertisements), and personal contacts.  A total of 200 call
originators were found, each of whom placed a telephone call via a
toll-free robot operator maintained by the LDC.  Access to the robot
operator was possible via a unique Personal Identification Number
(PIN) issued by the recruiting staff at the LDC when the caller
enrolled in the project.  The participants were made aware that their
telephone call would be recorded, as were the call recipients.  The
call was allowed only if both parties agreed to being recorded.  Each
caller was allowed to talk up to 30 minutes.  Upon successful
completion of the call, the caller was paid $20 (in addition to making
a free long-distance telephone call).  Each caller was allowed to
place only one telephone call.

	Although the goal of the call collection effort was to have
unique speakers in all calls, a handful of repeat speakers are
included in the corpus.  Specific information on this can be found in
the files "spkrinfo.doc" and "callinfo.doc".

	In all, 200 calls were transcribed.  Of these, 80 have been
designated as training calls, 20 as development test calls, and 100 as
evaluation test calls.  For each of the training and development test
calls, a contiguous 10-minute region was selected for transcription;
for the evaluation test calls, a 5-minute region was transcribed.  For
the present publication, 20 of the evaluation test calls are being
released; the remaining 80 test calls are being held in reserve for
future LVCSR benchmark tests.


-----------------------------------------------------------------------
3.  Data verification

	After a successful call was completed, a human audit of each
telephone call was conducted to verify that the proper language was
spoken, to check the quality of the recording, and to make an auditory
check to see if the caller had not participated before.  Many
fraudulent calls were caught in this manner, since when requesting a
second PIN, some people gave different (false) names and addresses.

-----------------------------------------------------------------------
4.  Speaker demographics

	Refer to the files "spkrinfo.doc" and "callinfo.doc", which
describe the files "spkrinfo.tbl" and "callinfo.tbl" respectively.


-----------------------------------------------------------------------
5.  Data transcription - General

	All CallHome telephone conversations were transcribed using
the general conventions described below.  In addition to these general
conventions, each language also specified a finite set of
"non-lexemes" (hesitation sounds) provided in section 6.  The
additional special conventions for Egyptian Arabic are provided in
section 7 below.  

	The transcription was carried out on Sun 4 workstations.  The
transcription was done using the emacs text editor which was linked to
the visual and auditory soundwave from the telephone recording in an
xwaves window.  A program written at the LDC linked the xwaves signal
to the emacs buffer so that a highlighted region of the soundwave
could be brought into the emacs buffer as a timestamp via a simple
keystroke.  Similarly, the transcribers could listen to any timemarked
turn in the transcript, and view the aligned soundwave as well.  Thus,
the transcribers had a visual as well as auditory signal that they
were transcribing.  Both the visual and auditory signal were broken
into two separate channels that could be reviewed separately or
together.

	The transcribers were given the transcription conventions
provided below as a guideline how to transcribe the telephone
conversations.


               CALLHOME TRANSCRIPTION CONVENTIONS - General


What to transcribe:	10 contiguous minutes (600 seconds) from the
			recorded telephone conversations (5 minutes
			for evaltest calls).  This should not include
			the beginning of the conversation
                        where the speakers are getting permission
                        for being recorded.
			

Definition of turns:    Separate turns are defined by the following
                        criteria:

                (1) speaker change, e.g.

                        A:  Well I was thinking about that

                        B:  I know I talked to &Jan about it yesterday

                (2) within one speaker's stretch of talk, a long
                turn should be broken up in terms of what makes
                grammatical/semantic sense, e.g.

                        A: And I told her %um I didn't I wasn't
                        setting you up to be a spiritual director or
                        anything {laugh} but I did say to her that if she
                        were to talk if she felt that she wanted to
                        talk about her prayer experience in Spanish

                        A: that you would probably be able to certainly
                        to understand her but to empathize a little bit
                        with what she was experiencing

                (3) If there is an extra-long pause within a
                single speaker's turn, break the turn up into two
                turns, e.g.

                        B: When we were fishing out on &Lake &Travis last
                        August I thought I saw, %uh [[long pause]]

                        B: %uh, &George &Martin, but I wasn't sure it was him.


Timestamps:             Each speaker turn is marked with a unique timestamp
                        (in seconds). The timestamps mark the beginning and
                        end time of each turn relative to the beginning of the
                        recording. Each timestamp is precise to the 100th of a
                        second, and is in the format: beginning time [space]
                        ending time, followed by the turn.
                        Some samples:

                27.98 28.72 A: You know so

                137.49 139.47 A: yeah {breath} (( )) [distortion]

                284.54 286.79 B: %ah &Lydia &Van &Damme.


Special Conventions:


    Acronyms            Acronyms pronounced like a word are written in all caps
                        with no spaces, e.g.

                        AIDS    NARAL

                        Acronyms pronounced like the individual letters are
                        written in all caps with spaces between the letters:

                        C I A           H I V           C E O

    Numbers             Write all numbers out, do not use digits

                        twenty-two      nineteen-ninety-five

    Interjections       Use the most standard spelling (as given on the
                        lexicon list, if it's there); don't try to
                        represent lengthening by writing multiple consonants
                        (like 'ooooh').

                        uh-huh  mm-hm   uh-oh   okay    jeez

    Punctuation		Transcribers are free to add any punctuation
			that they feel is helpful to someone reading
			the transcript.  


Special symbols:


    Noises, conversational phenomena, foreign words, etc. are marked
    with special symbols.  In the table below, "text" represents any
    word or descriptive phrase.

    {text}              sound made by the talker

                        {laugh} {cough} {sneeze} {breath} {lipsmack}

    [text]              sound not made by the talker (background or channel)

                        [distortion]    [static]       [background]   

    [/text]             end of continuous or intermittent sound not made by
                        the talker (beginning marked with previous [text/])

    [[text]]            comment; most often used to describe unusual
                        characteristics of immediately preceding or following
                        speech (as opposed to separate noise event)

                        [[drawn out]]

    ((text))            unintelligible; text is best guess at transcription

                        ((coffee klatch))

    (( ))               unintelligible; can't even guess text

                        (( ))


    <language text>     speech in another language

                        <English going to California>

    <? (( ))>           ? indicates unrecognized language; (( )) indicates
                        untranscribable speech

                        <? ayo canoli>  <? (( ))>

    -text		partial word
    text-               
			-tion absolu- 

    #text#              simultaneous speech on the same channel
                        (simultaneous speech on different channels is not
                        explicitly marked, but is identifiable as such by
                        reference to time marks)

    //text//            aside (talker addressing someone in background)

                        //quit it, I'm talking to your sister!//

    +text+              mispronounced word (spell it in usual orthography)

                        +probably+

   **text**             idiosyncratic word, not in common use

                        **poodle-ish**

    %text               This symbol flags non-lexemes, which are
			general hesitation sounds.  See the section on
			non-lexemes below to see a complete list for
			each language.  

			%mm %uh 

    &text               used to mark proper names and place names

                        &Mary &Jones    &Arizona        &Harper's
                        &Fiat           &Joe's &Grill


    text --             marks end of interrupted turn and continuation
    -- text             of same turn after interruption, e.g.

                        A: I saw &Joe yesterday coming out of --

                        B: You saw &Joe?!

                        A: -- the music store on &Seventeenth and &Chestnut.

-----------------------------------------------------------------------
6.  Data transcription - Non-lexemes

	For LVCSR purposes, some of the speech sounds uttered by the
conversational participants were deemed to be "non-lexemes" or
periodic sound sequences that are not listed as words in the
pronunciation dictionary.  The "non-lexemes" are distinct from the set
of interjections such as "OkkE" and "A" which are considered as words
in the lexicon.  The "non-lexemes" can loosely be considered as
hesitation sounds that a speaker makes while speaking.  While the
spelling of these sounds is somewhat arbitrary, the transcribers were
given a finite list from which to choose in order to maintain
orthographic consistency.  

	Below is a histogram of the type and frequency of non-lexemes
occurring in the 80 training and 20 devtest transcriptions.


Arabic training and devtest transcripts:

	5061 %ah
	1659 %E
	1275 %M
	569 %mhm
	405 %ha
	76 %uh
	66 %yA
	54 %aha
	53 %yaa
	51 %hm
	10 %hum
	10 %Ah
	5 %ayyO
	5 %Eyy
	4 %yuu
	3 %ih
	3 %yO
	3 %O
	2 %wAw
	2 %hi
	1 %Hay
	1 %yOO
	1 %OhO
	1 %hE


-----------------------------------------------------------------------
7.  Data transcription - Egyptian Arabic special conventions


            PRINCIPLES OF TRANSCRIBING ECA (special conventions)


1.  Spelling

When a question arises about the proper spelling of a word (such as
/H/ or /h/, /a/ or /i/), our "authoritative" source is the Badawi &
Hinds "Dictionary of Egyptian Arabic".  In general, we are avoiding
writing long vowels at the ends of words (with some exceptions below).
Initial glottal stops are not written, since they are fully predictable
and occur before all word-initial vowels.  

If a romanized spelling could have more that one Arabic script
equivalent, disambiguate the word with an "=" followed by a unique
character or numeral.  Verbs are often ambiguous between a final
"alif" and a final "hamza".  Our general convention is indicating "=a"
for the first and "=h" for the second condition.  All other
ambiguities simply get a digit, such as "=1", "=2", etc.

NB: each disambiguated romanized word appears as a separate entry in
the LDC Arabic lexicon.


2.  Definite articles

The definite article /il/ is followed by a "+" if immediately followed
by a noun, regardless of its actual pronunciation.  Some examples:

        il+rAgil        "the man"
        il+salAm        "the peace"
        il+qizAzaB      "the bottle"

The exception to this involves a high frequency set phrase:

        ilHamdulillA     "Thank God"

For words that begin with the definite article 'il', preceding a word
that begins with 'k' or 'g', assimilation of the 'l' is variable,
producing either /ikk/ or /ilk/, and /igg/ or /ilg/ respectively.  In
the devtest and training transcripts, the particular pronunciation of
each such case in the transcripts is notated as:

	il+k		unassimilated
	il(k)+k		assimilated
	il+g		unassimilated
	il(g)+g		assimilated


2.1.  Definite articles and proper names

If a proper name is preceded by the definite article /il/, place the
"&" symbol after the "+" before the name itself:

        il+&sucudiyyaB
        il+&raml


3.  tEh marbUta "B"

In ECA, many feminine nouns and some feminine adjectives ending in
/-a/ can be pronounced as either [-a] or [-it], depending upon what
word comes after it in a sentence.  To capture the generalization that
only the pronunciation is changing, all words which have the tEh 
marbUta in MSA are written with a final /-B/ for ECA, regardless of
the actual pronunciation.  The rules for deriving the set of pronunciations
are in the Lexicon.  Examples:

        HAgaB
        ca$araB
        diyya
        tuscumiyyaB
        tuscumiyyaB wi xamsIn
        tuscumiyyaB dulAr
        baqiyyaB
        mAmaB   (many speakers say "mamti" for 'my mama')

In the devtest and training transcripts, words that end in the
orthographic symbol 'B' (e.g. feminine nouns) which may be pronounced
either [a] or [it] are coded for the specific pronunciation used in
each case with the alternatives 'B~' and 'B(t)', respectively:

	B~      [a]
	B(t)    [it]


4.  Verbal prefixes and suffixes (not pronominal suffixes)

Verbal prefixes and suffixes will be written as part of the verb (just
as in MSA), without the use of "+" or the inclusion of a space.  The
vowel deletions which occur in such forms will be recorded in the
spelling.  Some examples:

        biyitxAniq      (not bi+yitxAniq)       "he is fighting"
        biyifham        "he understands"
        Hayifham        "he will understand"
        mafhimti$       "I don't understand"


5.  Pronominal suffixes

Pronominal suffixes are also written as part of the word without a "+"
or space between the verb and the suffix.  The reason for this is that
for maintaining constancy with negated verbs such as /mafahimtaha$/ "I
don't understand it", where the /$/ remains attached to the verb as in
(4) above.  Examples:

        katabha         "he wrote it (fem.)"
        katabu          "he wrote it (masc.)"
        katabtaha       "I wrote it (fem.)"


6.  "Inseparable" prepositions

The "inseparable" prepositions /bi-/ "with", /li-/ "to, for", /ka-/
"like" are all written with the following word, separated by a "+".
If the definite article comes between the inseparable preposition and
the word stem, it is written in the same manner.  Example:

        bi+il+lEl	"at night"
        li+il+madInaB   "to the city"


In addition to the definite article and inseparable prepositions, the
conjunction "fa" is written together with the word that it modifies,
separated by a "+" symbol.  Example:

	fa+xalAS	"that's it"
	fa+xallIni	"let me"
	

Finally, there are three instances where a "+" symbol is included
after the word "ya":

	fa+ya+rEt 
	ya+rEt
	ya+rEtak


7.  Numerals

The numerals should all be written in citation form.  The lexicon will
include the rules for deriving their pronunciation, since numerals
behave differently from other adjectives.  Examples:

        xamsaB          "five"
        ca$araB         "ten"
        ca$araB ayyAm   "ten days"

Note: The word for "days" will always be written /ayyAm/ even though
it is pronounced [iyyAm] after the numerals 3-10.  We will include
this as a rule in the lexicon.


8.  Foreign words and placenames

Foreign words are transcribed using the convention <English research
assistant>.  However, there are some instances where the words or
placenames have been nativized.  These words should be written as
pronounced.  Some examples:

        &niujirsi		"New Jersey"
        &niuyOrk		"New York"
        &lusanjilus		"Los Angeles"
        yA			"yeah"
        <English seven up>      "Seven Up"


9.  Standard spellings

The words below should be transcribed as shown, regardless of variant
pronunciations.

        matuqcudI$      "don't sit down"
        kuwayyis        "well"
        nuSS            "half"
        bass            "enough"
        bAba            "father"
        mAmaB           "mother"
        diqiqtEn        "two minutes"
        mazilt          "still"
        walla           "or / a short version of wallAhi"
        la              "no"


10.  Words with variable spellings

The words below should be transcribed as shown depending upon what one
hears:

        iwci     /      iwca            "don't"
        buqq     /      buqqi           "mouth, my mouth"
        ca       /      cala            "on"
        kat      /      kAnit           "was"
        laHsan   /      li+aHsan        "for the better"
        ca$An    /      cala$An         "because"
        ana xadt /      ana axadt       "I took it"


11.  Miscellaneous cases

The following phrases are written as one word, the reason being that
they are high frequency and occur as set phrases:

        in$ACallA / in$alla     "God willing" (depending upon what is said)
        biCiznillA              "God willing"
        wallAhi                 "swear to God"
        liCinn                  "because"
        allAhuakbar             "God is greatest"


12.  On indicating dialectal words

If a speaker pronounces a word in marked dialect (especially if the
word changes shape due to the dialect), the word will be flagged as if
it is a foreign word with either <Upper word> or <Delta word>, the two
main dialect areas of Egypt.  The same is true if a word follows the
grammatical pattern of Modern Standard Arabic, in which case it is
marked as <MSA word>.


-----------------------------------------------------------------------
8.  Quality control (QC) procedures

	The creation of the transcripts was made in an iterative
manner.  The first step was to transcribe and timestamp the
appropriate portion of each conversation.  Once this was completed,
proper formatting and spelling was checked and corrected.  Once this
was completed, a second pass over all of the transcripts was made,
where both content and formatting was checked once more.  Throughout
this process, small improvements were constantly made and re-checked
for accuracy.  In most instances, a third (or even fourth) pass was
made over the transcript to verify its accuracy.  

Spelling: 

	As the telephone conversations were being transcribed, the
words found in the transcripts were being compiled for inclusion in
pronunciation dictionaries also being prepared by the LDC.  As the
lexicon workers compiled lists of words, they checked (among other
things) for spelling errors.  The lists of spelling/typo errors found
in the transcripts were compiled, and a program was run over the
transcripts to replace a misspelled word with its correct spelling.
Thus, work on the pronunciation dictionaries of the respective
languages helped to double-check the proper spelling of all words in
the transcripts.  

Syntax:  

	To check the well-formedness of the bracketing, a program was
written which goes over the transcripts and notes any apparent
irregularities.  This program was later adapted for on-line use by the
transcribers to be used while creating the transcripts.  A final
syntax check was run over all transcripts before the final release.

Timestamps:

	To check the well-formedness of timestamps, a program was
developed that checked for (1) overlapping timestamps, (2) start times
that are greater than end times, (3) turns that are missing
timestamps, (4) the proper formatting of a blank line before each
timestamp, (5) proper number of digits in each timestamp, and (6) the
proper marking of the speaker id.  This procedure was folded into the
syntax checking procedure to be used on-line by the transcribers.  

Content:

	To check that the properly spelled and formatted transcription
actually matched the spoken signal, a second human pass was made over
all of the transcripts.  In many instances, three or more passes were
made as well.


-----------------------------------------------------------------------
9.  Conversion to Arabic script

	Once the transcripts and quality check were completed on the
romanized version of the transcripts, the transcripts were converted
to Arabic script.  This was done via an automatic lookup-and-replace
process via the LDC Arabic lexicon whereby one (or more) Arabic script
equivalents are included for every romanized entry.  In the cases
where the Arabic conversion is ambiguous (where there is more than one
possible script version for a given romanized word), the correct
version was hand-chosen in the context of the converted Arabic script
transcript.

	There are a number of general instances where the romanized
character sequence differs from the Arabic script character sequence:

	1.  In verbal forms, the romanized script indicates stem-vowel 
length distinctions which are not found in the Arabic script.

	2.  Where the romanized script writes (standard) /th/ as the
spoken /s/ or /t/, and (standard) /dh/ as /z/ or /d/, the Arabic
script version writes both the /th/ and /dh/ where these are
pronounced as /s/ and /z/ respectively.  This is schematized below:

MSA:                     s  th   t  th               z  dh  d  dh
                          \ /     \ /                 \ /    \ /
LDC romanization:          s       t                   z      d
                          / \      |                  / \     |
LDC ECA script:          s  th     t                 z  dh    d


	3.  The LDC romanized script indicates "doubled" consonants in
ECA with two orthographic letters.  The Arabic script version would be
expected to indicate consonant quantity or duration with a "shadda".
Unfortunately, the Arabic character set provided by the MULE text
editor which was used to create the Arabic script equivalents does not
contain "shadda" as a possible character.  Therefore consonant
duration is not indicated in the Arabic script version of the
transcripts.

	4.  Initial vowel correspondences between the romanized and
Arabic script versions is the following:

        a   Alif
        A   Alif with madda
        E/I Alif and ya
        i   Alif
        O/U Alif and waw
        u   Alif

Since the glottal stop or "hamza" is usually not pronounced in ECA, we
do not regularly include this character either in the romanization or
in the Arabic script.

	5.  In the romanized version of the transcripts, indirect
object suffixes on verbal forms are written as part of the word.  This
follows native speaker intuition and understanding of where a word
boundary occurs.  In the Arabic script version, however, indirect
object suffixes are written separately from the verbal stem, as is the
practice in MSA.  This convention was decided upon for the sake of
readability.  The two-to-one correspondence from Arabic script to
romanized word can be found in the Arabic lexicon (where the headword
is provided first in the romanized form).

	6.  As indicated in section 10 below, there are two instances
where we have distinctive characters in the romanized version of the 
transcripts which are merged in the Arabic script.  These are:

	Roman		Arabic
	f		f		voiceless labio-dental fricative
	v		f		voiced labio-dental fricative
	g		g		voiced velar stop
	j		g		voiced avleopalatal affricate


	7.  Of all of the non-letter characters utilized in the
romanized version of the transcripts, only the "%", "&", and "+...+"
have been rendered in the Arabic script version.  Their encoding is
discussed in section 11 below.
	In addition, where the proper name flag "&" occurs following
the definite article "il+" (such as in the word "il+&a$raf") or an
inseparable preposition (such as in "bi+&amAl") in the romanized
transcripts, the corresponding character ";" always occurs prior to
the entire word in the Arabic script version.


-----------------------------------------------------------------------
10.  Arabic script/romanization correspondence table

	Refer to the file "scr2rom.tbl"


-----------------------------------------------------------------------
11.  ISO 8859-6 character encoding


	Refer to the file "iso-spec.doc"