Publication Title: BBN/AUB DARPA Babylon Levantine Arabic Corpus Authors: BBN Technologies (with American University of Beirut as subcontractor) John Makhoul, Bushra Zawaydeh, Frederick Choi, David Stallard Primary contact: Dave Stallard (stallard@bbn.com, 617.873.2825) Project: DARPA Babylon Project Background: This is a corpus of transcribed, spontaneous speech, recorded from subjects speaking in Levantine colloquial Arabic. Levantine Arabic is the dialect of Arabic spoken by ordinary people in Lebanon, Jordan, Syria, and Palestine. It is significantly different from Modern Standard Arabic (MSA), in that it is a spoken rather than a written language. It includes different word pronounciations, and even different words, from Modern Standard Arabic, the written and "official" form of Arabic. This corpus was developed with funding from the Defense Advanced Research Project Agency (DARPA), as part of the Babylon program. The Babylon program is intended to advance the state of the art in speech-to-speech translation systems, both by creating new technology and by developing systems for field use. More information on the Babylon program may be found at http://darpa-babylon.mitre.org. BBN was funded under Babylon to develop a limited English/Arabic refugee/medical speech translation system for a handheld computer, and collected this corpus as part of its work. The corpus would be useful for anyone attempting to do speech recognition in Levantine colloquial Arabic, including for speech translation and spoken dialog systems. At the time of this writing this corpus is, as far as we know, the only publically available transcribed corpus of Levantine Arabic. Data type: speech, text Data sources: microphone Collection Procedure: The corpus was recorded using a close-talking, noise-cancelling, headset microphone (the Andrea Electronics NC-65). A Java-based data-collection tool, developed by BBN, was used to do the collection. This tool allowed the experimenter to select a particular scenario, and then step through the questions in it. To ask a question, the operator would click the "ask" button, and the tool would play out a prerecorded Arabic prompt, corresponding to the Arabic translation of the question. Upon completion of the prompt, the tool would go into listening mode. The subject would speak his reply, which the tool would record. When the subject was finished, the experimenter would click a "stop" button on the GUI, and then go on to the next question. Thus, end-pointing of the speech was done manually rather than automatically. As an additional feature, the tool indicated the volume of the recording as "low", "normal", or "too high". The experimenter could then tell the subject to speak either louder or softer, and rerecord his response. Approximately 20% of the corpus was recorded by BBN using paid subjects recruited in the Boston area from May 2002 to September 2002. This portion of the corpus was the first to be collected. Subsequently, the remaining 80% was recorded by the American University of Beirut (AUB), under subcontract to BBN, from July 2002 to November 2002. AUB students and staff served as both experimenters and subjects. This portion of the corpus was recorded in Beirut, Lebanon, on the AUB campus. The subjects in the corpus were responding to refugee/medical questions ("Where is your pain?", "How old are you?", etc.), and were playing the part of refugees. Each subject was given a part to play, that prescribed what information they were to give in response to the questions, but were told to express themselves naturally, in their own way, in Arabic. To avoid priming subjects to give their answer with a particular Arabic wording, the parts were given in English rather than Arabic. (All subjects were thus bilingual.) The following is an example scenario: You are Maraam Samiir Shamali. You were born on 8/7/1971 in Kuwait. You are now 31 years old. Your mother Nabiila Habiib and your 5 brothers and sisters live in Amman. You weigh about 50 kilos, and your height is 150 centimeters. You have been living in Jabal Husein in Amman since 1980. You live in front of Frer School. As for education, you have a bachelors in education. You are a Christian. You work as a teacher in Amman in the Frer School. You make 200 dinars per month. You live with 4 people. You are single and you have no children. Applications: speech translation, speech recognition, spoken dialog systems Languages: Levantine colloquial Arabic Special license: n/a Grant number and funding agency: Sponsored by DARPA and Monitored by SPAWAR Systems Center under Contract No. N66001-99-D-8615 Copyright statement: Copyright BBNT Solutions LLC, 2003 Corpus Statistics: Number of subjects: 164 Number of utterances: 75900 Total audio size: 6.5 GB Number of hours: 45 Total text size: 3.1 MB Vocabulary: 15K words Total words: 336K words Corpus description: The corpus takes the form of a set of files. For each utterance in the corpus there are two files, one for the audio, the other for the transcription. Audio files have the suffix "wav", while transcription files have the suffix "txt". The base of the file name encodes the date and time the recording session, the subject ID number, and the 3-digit utterance number (utterance numbers start with '000'). The format of this utterance ID is: MM-DD-YYYY_HHMMSS_III_uttNNN Thus, the files: 09-09-2002_124530_266_utt002.wav 09-09-2002_124530_266_utt002.txt are, respectively, the audio and transcription of the third utterance of a session started at 12:45:30 PM on September 9, 2002, using subject number 266 (subject IDs start with '1', and are between 1 and 3 digits long, inclusive). Data Type: Speech In directory: /data/audio Number of files: 75900 Levantine Arabic audio files zipped using WinZip (www.winzip.com) Named with a unique ID based on the date it was collected Total size: 6.5 GB Number of hours: 45 Details: Format: MS WAV (signed PCM) Channel Count: 1 Sampling rate: 16000 samples/sec Bit rate: 16 bits/sample Audio description: The audio was recorded in MS WAV, signed PCM. Sampling rate was 16Khz, with 16-bit resolution. Data Type: Text In directory: /data/text Number of files: 75900 transcription text files zipped using WinZip (www.winzip.com) Named according to the audio file it corresponds to. Total size: 3.1 MB Vocabulary: 15K words Total words: 336K words Details: Format: UTF-8 Unicode Arabic text Text description: All transcriptions are Unicode Arabic, encoded in UTF-8. They do not include short-vowel diacritics of Arabic writings, which are rarely written. As part of the work, we developed a set of transcription guidelines that specified how to spell certain colloquial-only words, and how to reconcile spelling of differently-pronounced words with their MSA spellings. These guidelines are included as part of this distribution (in the document "BBN-Babylon-transcription-guidelines.pdf", see below under CD1 Contents). Contents: Note: Each CD contains the directories /data/audio and /data/text. CD1 also contains the /doc directory (see below for the list of the contents of this directory). CD1: 5,000 utterances ID 05_31_2002_093700_2_utt000 through ID 06_25_2002_110200_19_utt011 CD1 also includes in its /doc directory: - BBN-BABYLON-README.TXT : This readme file. - BBN-Babylon-arabic-word-list.txt : A UTF-8 encoded text file of all of the unique Arabic words in the corpus. - BBN-Babylon-subject-gender-list.txt : Table with subject IDs followed by the gender of the subject. - BBN-Babylon-transcription-guidelines.pdf : Guideline to transcriptions, as Levantine Arabic is not a written language. Written by Bushra Zawaydeh and John Makhoul. CD2: 5,000 utterances ID 06_25_2002_110200_19_utt012 through ID 07_09_2002_131644_32_utt004 CD3: 10,000 utterances ID 07_09_2002_131644_32_utt005 through ID 07_24_2002_180632_206_utt009 CD4: 10,000 utterances ID 07_24_2002_180632_206_utt010 through ID 09_09_2002_152607_227_utt017 CD5: 5,000 utterances ID 09_09_2002_152607_227_utt018 through ID 09_12_2002_112358_239_utt010 CD6: 10,000 utterances ID 09_12_2002_112358_239_utt011 through ID 09_21_2002_131637_229_utt009 CD7: 5,000 utterances ID 09_21_2002_131637_229_utt010 through ID 09_25_2002_120453_252_utt003 CD8: 5,000 utterances ID 09_25_2002_120453_252_utt004 through ID 09_27_2002_151932_312_utt004 CD9: 5,000 utterances ID 09_27_2002_151932_312_utt005 through ID 10_02_2002_132753_311_utt018 CD10: 5,000 utterances ID 10_02_2002_132753_311_utt019 through ID 10_08_2002_152721_279_utt019 CD11: 5,000 utterances ID 10_08_2002_152942_279_utt000 through ID 10_15_2002_134909_309_utt013 CD12: 5,900 utterances ID 10_15_2002_134909_309_utt014 through ID 11_01_2002_202205_289_utt012 Quality Control: n/a Suggested Price: TBD