----------------------------------------------------------- Description of the CallHome telephone speech and transcript corpus for Egyptian Colloquial Arabic ----------------------------------------------------------- April, 1997 Project leader: Krisjanis Karins Consultation: Mark Liberman Cynthia McLemore Everett Rowson Transcribers/ corrections: Howaida Arram Alaa El-Habashi Hassan A. Gadalla Hanaa Kilany Amr A. Shalaby Ashraf Yacoub Programming support: Robert MacIntyre CONTENTS 1. Summary abstract 2. Data acquisition 3. Data verification 4. Speaker demographics 5. Data transcription - General 6. Data transcription - Non-lexemes 7. Data transcription - Egyptian Arabic special conventions 8. Quality control (QC) procedures 9. Conversion to Arabic script 10. Arabic script/romanization correspondence table 11. ISO 8859-6 character encoding ----------------------------------------------------------------------- 1. Summary abstract The CallHome Arabic corpus of telephone speech was collected and transcribed by the Linguistic Data Consortium primarily in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. This release of the CallHome Arabic corpus consists of 120 unscripted telephone conversations between native speakers of Egyptian Colloquial Arabic (ECA), the spoken variety of Arabic found in Egypt. The dialect of ECA that this corpus represents is Cairene Arabic. The transcripts cover a contiguous 5 or 10 minute segment taken from a recorded conversation lasting up to 30 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends overseas. All calls originated in North America. and were placed overseas. The distribution of call destinations can be found in the file "spkrinfo.doc". The transcripts are timestamped by speaker turn, and are provided in the LDC standardized orthography. The timestamps are aligned with the speech signal. Given the lack of a standard orthographic system for ECA, the LDC developed a standard romanized orthography for the language. The romanized orthography uses ASCII characters and is phonemically based. It strives to maintain both word pronunciation information and word identity, while minimizing ambiguity. Once the transcripts were completed in romanized form, they were converted back to Arabic script via lookup through the LDC lexicon of Egyptian Colloquial Arabic. Both the romanized as well as the Arabic script versions of the transcripts are found in this release. ----------------------------------------------------------------------- 2. Data acquisition Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements), and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call. Although the goal of the call collection effort was to have unique speakers in all calls, a handful of repeat speakers are included in the corpus. Specific information on this can be found in the files "spkrinfo.doc" and "callinfo.doc". In all, 200 calls were transcribed. Of these, 80 have been designated as training calls, 20 as development test calls, and 100 as evaluation test calls. For each of the training and development test calls, a contiguous 10-minute region was selected for transcription; for the evaluation test calls, a 5-minute region was transcribed. For the present publication, 20 of the evaluation test calls are being released; the remaining 80 test calls are being held in reserve for future LVCSR benchmark tests. ----------------------------------------------------------------------- 3. Data verification After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, to check the quality of the recording, and to make an auditory check to see if the caller had not participated before. Many fraudulent calls were caught in this manner, since when requesting a second PIN, some people gave different (false) names and addresses. ----------------------------------------------------------------------- 4. Speaker demographics Refer to the files "spkrinfo.doc" and "callinfo.doc", which describe the files "spkrinfo.tbl" and "callinfo.tbl" respectively. ----------------------------------------------------------------------- 5. Data transcription - General All CallHome telephone conversations were transcribed using the general conventions described below. In addition to these general conventions, each language also specified a finite set of "non-lexemes" (hesitation sounds) provided in section 6. The additional special conventions for Egyptian Arabic are provided in section 7 below. The transcription was carried out on Sun 4 workstations. The transcription was done using the emacs text editor which was linked to the visual and auditory soundwave from the telephone recording in an xwaves window. A program written at the LDC linked the xwaves signal to the emacs buffer so that a highlighted region of the soundwave could be brought into the emacs buffer as a timestamp via a simple keystroke. Similarly, the transcribers could listen to any timemarked turn in the transcript, and view the aligned soundwave as well. Thus, the transcribers had a visual as well as auditory signal that they were transcribing. Both the visual and auditory signal were broken into two separate channels that could be reviewed separately or together. The transcribers were given the transcription conventions provided below as a guideline how to transcribe the telephone conversations. CALLHOME TRANSCRIPTION CONVENTIONS - General What to transcribe: 10 contiguous minutes (600 seconds) from the recorded telephone conversations (5 minutes for evaltest calls). This should not include the beginning of the conversation where the speakers are getting permission for being recorded. Definition of turns: Separate turns are defined by the following criteria: (1) speaker change, e.g. A: Well I was thinking about that B: I know I talked to &Jan about it yesterday (2) within one speaker's stretch of talk, a long turn should be broken up in terms of what makes grammatical/semantic sense, e.g. A: And I told her %um I didn't I wasn't setting you up to be a spiritual director or anything {laugh} but I did say to her that if she were to talk if she felt that she wanted to talk about her prayer experience in Spanish A: that you would probably be able to certainly to understand her but to empathize a little bit with what she was experiencing (3) If there is an extra-long pause within a single speaker's turn, break the turn up into two turns, e.g. B: When we were fishing out on &Lake &Travis last August I thought I saw, %uh [[long pause]] B: %uh, &George &Martin, but I wasn't sure it was him. Timestamps: Each speaker turn is marked with a unique timestamp (in seconds). The timestamps mark the beginning and end time of each turn relative to the beginning of the recording. Each timestamp is precise to the 100th of a second, and is in the format: beginning time [space] ending time, followed by the turn. Some samples: 27.98 28.72 A: You know so 137.49 139.47 A: yeah {breath} (( )) [distortion] 284.54 286.79 B: %ah &Lydia &Van &Damme. Special Conventions: Acronyms Acronyms pronounced like a word are written in all caps with no spaces, e.g. AIDS NARAL Acronyms pronounced like the individual letters are written in all caps with spaces between the letters: C I A H I V C E O Numbers Write all numbers out, do not use digits twenty-two nineteen-ninety-five Interjections Use the most standard spelling (as given on the lexicon list, if it's there); don't try to represent lengthening by writing multiple consonants (like 'ooooh'). uh-huh mm-hm uh-oh okay jeez Punctuation Transcribers are free to add any punctuation that they feel is helpful to someone reading the transcript. Special symbols: Noises, conversational phenomena, foreign words, etc. are marked with special symbols. In the table below, "text" represents any word or descriptive phrase. {text} sound made by the talker {laugh} {cough} {sneeze} {breath} {lipsmack} [text] sound not made by the talker (background or channel) [distortion] [static] [background] [/text] end of continuous or intermittent sound not made by the talker (beginning marked with previous [text/]) [[text]] comment; most often used to describe unusual characteristics of immediately preceding or following speech (as opposed to separate noise event) [[drawn out]] ((text)) unintelligible; text is best guess at transcription ((coffee klatch)) (( )) unintelligible; can't even guess text (( )) speech in another language ? indicates unrecognized language; (( )) indicates untranscribable speech -text partial word text- -tion absolu- #text# simultaneous speech on the same channel (simultaneous speech on different channels is not explicitly marked, but is identifiable as such by reference to time marks) //text// aside (talker addressing someone in background) //quit it, I'm talking to your sister!// +text+ mispronounced word (spell it in usual orthography) +probably+ **text** idiosyncratic word, not in common use **poodle-ish** %text This symbol flags non-lexemes, which are general hesitation sounds. See the section on non-lexemes below to see a complete list for each language. %mm %uh &text used to mark proper names and place names &Mary &Jones &Arizona &Harper's &Fiat &Joe's &Grill text -- marks end of interrupted turn and continuation -- text of same turn after interruption, e.g. A: I saw &Joe yesterday coming out of -- B: You saw &Joe?! A: -- the music store on &Seventeenth and &Chestnut. ----------------------------------------------------------------------- 6. Data transcription - Non-lexemes For LVCSR purposes, some of the speech sounds uttered by the conversational participants were deemed to be "non-lexemes" or periodic sound sequences that are not listed as words in the pronunciation dictionary. The "non-lexemes" are distinct from the set of interjections such as "OkkE" and "A" which are considered as words in the lexicon. The "non-lexemes" can loosely be considered as hesitation sounds that a speaker makes while speaking. While the spelling of these sounds is somewhat arbitrary, the transcribers were given a finite list from which to choose in order to maintain orthographic consistency. Below is a histogram of the type and frequency of non-lexemes occurring in the 80 training and 20 devtest transcriptions. Arabic training and devtest transcripts: 5061 %ah 1659 %E 1275 %M 569 %mhm 405 %ha 76 %uh 66 %yA 54 %aha 53 %yaa 51 %hm 10 %hum 10 %Ah 5 %ayyO 5 %Eyy 4 %yuu 3 %ih 3 %yO 3 %O 2 %wAw 2 %hi 1 %Hay 1 %yOO 1 %OhO 1 %hE ----------------------------------------------------------------------- 7. Data transcription - Egyptian Arabic special conventions PRINCIPLES OF TRANSCRIBING ECA (special conventions) 1. Spelling When a question arises about the proper spelling of a word (such as /H/ or /h/, /a/ or /i/), our "authoritative" source is the Badawi & Hinds "Dictionary of Egyptian Arabic". In general, we are avoiding writing long vowels at the ends of words (with some exceptions below). Initial glottal stops are not written, since they are fully predictable and occur before all word-initial vowels. If a romanized spelling could have more that one Arabic script equivalent, disambiguate the word with an "=" followed by a unique character or numeral. Verbs are often ambiguous between a final "alif" and a final "hamza". Our general convention is indicating "=a" for the first and "=h" for the second condition. All other ambiguities simply get a digit, such as "=1", "=2", etc. NB: each disambiguated romanized word appears as a separate entry in the LDC Arabic lexicon. 2. Definite articles The definite article /il/ is followed by a "+" if immediately followed by a noun, regardless of its actual pronunciation. Some examples: il+rAgil "the man" il+salAm "the peace" il+qizAzaB "the bottle" The exception to this involves a high frequency set phrase: ilHamdulillA "Thank God" For words that begin with the definite article 'il', preceding a word that begins with 'k' or 'g', assimilation of the 'l' is variable, producing either /ikk/ or /ilk/, and /igg/ or /ilg/ respectively. In the devtest and training transcripts, the particular pronunciation of each such case in the transcripts is notated as: il+k unassimilated il(k)+k assimilated il+g unassimilated il(g)+g assimilated 2.1. Definite articles and proper names If a proper name is preceded by the definite article /il/, place the "&" symbol after the "+" before the name itself: il+&sucudiyyaB il+&raml 3. tEh marbUta "B" In ECA, many feminine nouns and some feminine adjectives ending in /-a/ can be pronounced as either [-a] or [-it], depending upon what word comes after it in a sentence. To capture the generalization that only the pronunciation is changing, all words which have the tEh marbUta in MSA are written with a final /-B/ for ECA, regardless of the actual pronunciation. The rules for deriving the set of pronunciations are in the Lexicon. Examples: HAgaB ca$araB diyya tuscumiyyaB tuscumiyyaB wi xamsIn tuscumiyyaB dulAr baqiyyaB mAmaB (many speakers say "mamti" for 'my mama') In the devtest and training transcripts, words that end in the orthographic symbol 'B' (e.g. feminine nouns) which may be pronounced either [a] or [it] are coded for the specific pronunciation used in each case with the alternatives 'B~' and 'B(t)', respectively: B~ [a] B(t) [it] 4. Verbal prefixes and suffixes (not pronominal suffixes) Verbal prefixes and suffixes will be written as part of the verb (just as in MSA), without the use of "+" or the inclusion of a space. The vowel deletions which occur in such forms will be recorded in the spelling. Some examples: biyitxAniq (not bi+yitxAniq) "he is fighting" biyifham "he understands" Hayifham "he will understand" mafhimti$ "I don't understand" 5. Pronominal suffixes Pronominal suffixes are also written as part of the word without a "+" or space between the verb and the suffix. The reason for this is that for maintaining constancy with negated verbs such as /mafahimtaha$/ "I don't understand it", where the /$/ remains attached to the verb as in (4) above. Examples: katabha "he wrote it (fem.)" katabu "he wrote it (masc.)" katabtaha "I wrote it (fem.)" 6. "Inseparable" prepositions The "inseparable" prepositions /bi-/ "with", /li-/ "to, for", /ka-/ "like" are all written with the following word, separated by a "+". If the definite article comes between the inseparable preposition and the word stem, it is written in the same manner. Example: bi+il+lEl "at night" li+il+madInaB "to the city" In addition to the definite article and inseparable prepositions, the conjunction "fa" is written together with the word that it modifies, separated by a "+" symbol. Example: fa+xalAS "that's it" fa+xallIni "let me" Finally, there are three instances where a "+" symbol is included after the word "ya": fa+ya+rEt ya+rEt ya+rEtak 7. Numerals The numerals should all be written in citation form. The lexicon will include the rules for deriving their pronunciation, since numerals behave differently from other adjectives. Examples: xamsaB "five" ca$araB "ten" ca$araB ayyAm "ten days" Note: The word for "days" will always be written /ayyAm/ even though it is pronounced [iyyAm] after the numerals 3-10. We will include this as a rule in the lexicon. 8. Foreign words and placenames Foreign words are transcribed using the convention . However, there are some instances where the words or placenames have been nativized. These words should be written as pronounced. Some examples: &niujirsi "New Jersey" &niuyOrk "New York" &lusanjilus "Los Angeles" yA "yeah" "Seven Up" 9. Standard spellings The words below should be transcribed as shown, regardless of variant pronunciations. matuqcudI$ "don't sit down" kuwayyis "well" nuSS "half" bass "enough" bAba "father" mAmaB "mother" diqiqtEn "two minutes" mazilt "still" walla "or / a short version of wallAhi" la "no" 10. Words with variable spellings The words below should be transcribed as shown depending upon what one hears: iwci / iwca "don't" buqq / buqqi "mouth, my mouth" ca / cala "on" kat / kAnit "was" laHsan / li+aHsan "for the better" ca$An / cala$An "because" ana xadt / ana axadt "I took it" 11. Miscellaneous cases The following phrases are written as one word, the reason being that they are high frequency and occur as set phrases: in$ACallA / in$alla "God willing" (depending upon what is said) biCiznillA "God willing" wallAhi "swear to God" liCinn "because" allAhuakbar "God is greatest" 12. On indicating dialectal words If a speaker pronounces a word in marked dialect (especially if the word changes shape due to the dialect), the word will be flagged as if it is a foreign word with either or , the two main dialect areas of Egypt. The same is true if a word follows the grammatical pattern of Modern Standard Arabic, in which case it is marked as . ----------------------------------------------------------------------- 8. Quality control (QC) procedures The creation of the transcripts was made in an iterative manner. The first step was to transcribe and timestamp the appropriate portion of each conversation. Once this was completed, proper formatting and spelling was checked and corrected. Once this was completed, a second pass over all of the transcripts was made, where both content and formatting was checked once more. Throughout this process, small improvements were constantly made and re-checked for accuracy. In most instances, a third (or even fourth) pass was made over the transcript to verify its accuracy. Spelling: As the telephone conversations were being transcribed, the words found in the transcripts were being compiled for inclusion in pronunciation dictionaries also being prepared by the LDC. As the lexicon workers compiled lists of words, they checked (among other things) for spelling errors. The lists of spelling/typo errors found in the transcripts were compiled, and a program was run over the transcripts to replace a misspelled word with its correct spelling. Thus, work on the pronunciation dictionaries of the respective languages helped to double-check the proper spelling of all words in the transcripts. Syntax: To check the well-formedness of the bracketing, a program was written which goes over the transcripts and notes any apparent irregularities. This program was later adapted for on-line use by the transcribers to be used while creating the transcripts. A final syntax check was run over all transcripts before the final release. Timestamps: To check the well-formedness of timestamps, a program was developed that checked for (1) overlapping timestamps, (2) start times that are greater than end times, (3) turns that are missing timestamps, (4) the proper formatting of a blank line before each timestamp, (5) proper number of digits in each timestamp, and (6) the proper marking of the speaker id. This procedure was folded into the syntax checking procedure to be used on-line by the transcribers. Content: To check that the properly spelled and formatted transcription actually matched the spoken signal, a second human pass was made over all of the transcripts. In many instances, three or more passes were made as well. ----------------------------------------------------------------------- 9. Conversion to Arabic script Once the transcripts and quality check were completed on the romanized version of the transcripts, the transcripts were converted to Arabic script. This was done via an automatic lookup-and-replace process via the LDC Arabic lexicon whereby one (or more) Arabic script equivalents are included for every romanized entry. In the cases where the Arabic conversion is ambiguous (where there is more than one possible script version for a given romanized word), the correct version was hand-chosen in the context of the converted Arabic script transcript. There are a number of general instances where the romanized character sequence differs from the Arabic script character sequence: 1. In verbal forms, the romanized script indicates stem-vowel length distinctions which are not found in the Arabic script. 2. Where the romanized script writes (standard) /th/ as the spoken /s/ or /t/, and (standard) /dh/ as /z/ or /d/, the Arabic script version writes both the /th/ and /dh/ where these are pronounced as /s/ and /z/ respectively. This is schematized below: MSA: s th t th z dh d dh \ / \ / \ / \ / LDC romanization: s t z d / \ | / \ | LDC ECA script: s th t z dh d 3. The LDC romanized script indicates "doubled" consonants in ECA with two orthographic letters. The Arabic script version would be expected to indicate consonant quantity or duration with a "shadda". Unfortunately, the Arabic character set provided by the MULE text editor which was used to create the Arabic script equivalents does not contain "shadda" as a possible character. Therefore consonant duration is not indicated in the Arabic script version of the transcripts. 4. Initial vowel correspondences between the romanized and Arabic script versions is the following: a Alif A Alif with madda E/I Alif and ya i Alif O/U Alif and waw u Alif Since the glottal stop or "hamza" is usually not pronounced in ECA, we do not regularly include this character either in the romanization or in the Arabic script. 5. In the romanized version of the transcripts, indirect object suffixes on verbal forms are written as part of the word. This follows native speaker intuition and understanding of where a word boundary occurs. In the Arabic script version, however, indirect object suffixes are written separately from the verbal stem, as is the practice in MSA. This convention was decided upon for the sake of readability. The two-to-one correspondence from Arabic script to romanized word can be found in the Arabic lexicon (where the headword is provided first in the romanized form). 6. As indicated in section 10 below, there are two instances where we have distinctive characters in the romanized version of the transcripts which are merged in the Arabic script. These are: Roman Arabic f f voiceless labio-dental fricative v f voiced labio-dental fricative g g voiced velar stop j g voiced avleopalatal affricate 7. Of all of the non-letter characters utilized in the romanized version of the transcripts, only the "%", "&", and "+...+" have been rendered in the Arabic script version. Their encoding is discussed in section 11 below. In addition, where the proper name flag "&" occurs following the definite article "il+" (such as in the word "il+&a$raf") or an inseparable preposition (such as in "bi+&amAl") in the romanized transcripts, the corresponding character ";" always occurs prior to the entire word in the Arabic script version. ----------------------------------------------------------------------- 10. Arabic script/romanization correspondence table Refer to the file "scr2rom.tbl" ----------------------------------------------------------------------- 11. ISO 8859-6 character encoding Refer to the file "iso-spec.doc"