Publication Title: BBN/AUB DARPA Babylon Levantine Arabic Corpus Authors: BBN Technologies (with American University of Beirut as subcontractor) John Makhoul, Bushra Zawaydeh, Frederick Choi, David Stallard Primary contact: Dave Stallard (stallard@bbn.com, 617.873.2825) Project: DARPA Babylon Project Background: This is a corpus of transcribed, spontaneous speech, recorded from subjects speaking in Levantine colloquial Arabic. Levantine Arabic is the dialect of Arabic spoken by ordinary people in Lebanon, Jordan, Syria, and Palestine. It is significantly different from Modern Standard Arabic (MSA), in that it is a spoken rather than a written language. It includes different word pronounciations, and even different words, from Modern Standard Arabic, the written and "official" form of Arabic. This corpus was developed with funding from the Defense Advanced Research Project Agency (DARPA), as part of the Babylon program. The Babylon program is intended to advance the state of the art in speech-to-speech translation systems, both by creating new technology and by developing systems for field use. More information on the Babylon program may be found at http://darpa-babylon.mitre.org. BBN was funded under Babylon to develop a limited English/Arabic refugee/medical speech translation system for a handheld computer, and collected this corpus as part of its work. The corpus would be useful for anyone attempting to do speech recognition in Levantine colloquial Arabic, including for speech translation and spoken dialog systems. At the time of this writing this corpus is, as far as we know, the only publically available transcribed corpus of Levantine Arabic. Data type: speech, text Data sources: microphone Collection Procedure: The corpus was recorded using a close-talking, noise-cancelling, headset microphone (the Andrea Electronics NC-65). A Java-based data-collection tool, developed by BBN, was used to do the collection. This tool allowed the experimenter to select a particular scenario, and then step through the questions in it. To ask a question, the operator would click the "ask" button, and the tool would play out a prerecorded Arabic prompt, corresponding to the Arabic translation of the question. Upon completion of the prompt, the tool would go into listening mode. The subject would speak his reply, which the tool would record. When the subject was finished, the experimenter would click a "stop" button on the GUI, and then go on to the next question. Thus, end-pointing of the speech was done manually rather than automatically. As an additional feature, the tool indicated the volume of the recording as "low", "normal", or "too high". The experimenter could then tell the subject to speak either louder or softer, and rerecord his response. Approximately 20% of the corpus was recorded by BBN using paid subjects recruited in the Boston area from May 2002 to September 2002. This portion of the corpus was the first to be collected. Subsequently, the remaining 80% was recorded by the American University of Beirut (AUB), under subcontract to BBN, from July 2002 to November 2002. AUB students and staff served as both experimenters and subjects. This portion of the corpus was recorded in Beirut, Lebanon, on the AUB campus. The subjects in the corpus were responding to refugee/medical questions ("Where is your pain?", "How old are you?", etc.), and were playing the part of refugees. Each subject was given a part to play, that prescribed what information they were to give in response to the questions, but were told to express themselves naturally, in their own way, in Arabic. To avoid priming subjects to give their answer with a particular Arabic wording, the parts were given in English rather than Arabic. (All subjects were thus bilingual.) The following is an example scenario: You are Maraam Samiir Shamali. You were born on 8/7/1971 in Kuwait. You are now 31 years old. Your mother Nabiila Habiib and your 5 brothers and sisters live in Amman. You weigh about 50 kilos, and your height is 150 centimeters. You have been living in Jabal Husein in Amman since 1980. You live in front of Frer School. As for education, you have a bachelors in education. You are a Christian. You work as a teacher in Amman in the Frer School. You make 200 dinars per month. You live with 4 people. You are single and you have no children. Applications: speech translation, speech recognition, spoken dialog systems Languages: Levantine colloquial Arabic Special license: n/a Grant number and funding agency: Sponsored by DARPA and Monitored by SPAWAR Systems Center under Contract No. N66001-99-D-8615 Copyright statement: Copyright BBNT Solutions LLC, 2003,portions Copyright 2004 Trustees of the University of Pennsylvania Corpus Statistics: Number of subjects: 164 Number of utterances: 75900 Total audio size: 6.5 GB Number of hours: 45 Total text size: 3.1 MB Vocabulary: 15K words Total words: 336K words Corpus description: The corpus takes the form of a set of files. For each utterance in the corpus there are two files, one for the audio, and another for the transcription. Audio files have the suffix "wav", while transcription files have the suffix "txt". The base of the file name encodes the date and time the recording session, the subject ID number, and the 3-digit utterance number (utterance numbers start with '000'). The format of this utterance ID is: III_YYYYMMDD_HHMMSS_NNN Thus, the files: 266_20020909_124530_002.wav 266_20020909_124530_002.txt are, respectively, the audio text transcription, and xml transcription files for the third utterance of a session started at 12:45:30 PM on September 9, 2002, using subject number 266 (subject IDs start with '1', and are between 1 and 3 digits long, inclusive). The files are organized into directories by subject number. Data Type: Speech In directory: /data/audio Number of files: 75900 Levantine Arabic audio files. Named with a unique ID based on the date it was collected Total size: 6.5 GB Number of hours: 45 Details: Format: MS WAV (signed PCM) Channel Count: 1 Sampling rate: 16000 samples/sec Bit rate: 16 bits/sample Audio description: The audio was recorded in MS WAV, signed PCM. Sampling rate was 16Khz, with 16-bit resolution. Data Type: Text In directory: /data/text Number of files: 75900 transcription text files. Named according to the audio file it corresponds to. Total size: 3.1 MB Vocabulary: 15K words Total words: 336K words Details: Format: UTF-8 Unicode Arabic text Text description: All transcriptions are Unicode Arabic, encoded in UTF-8. They do not include short-vowel diacritics of Arabic writings, which are rarely written. As part of the work, we developed a set of transcription guidelines that specified how to spell certain colloquial-only words, and how to reconcile spelling of differently-pronounced words with their MSA spellings. These guidelines are included as part of this distribution (in the document "BBN-Babylon-transcription-guidelines.pdf", see below under CD1 Contents). Data Type: XML In directory: /data/xml Number of files: 1 file containing XML markup for all utterances. Named transcription.xml. Total size: 16 MB Vocabulary: 15K words Total words: 336K words Details: Format: XML with UTF-8 Unicode Arabic text. XML description: An XML version of the above text transcriptions. All transcriptions have been combined into a single file of suitable for loading into a database. Each transcription contains an ID corresponding to its filename and a link to the audio file. Tags are XML versions of the annotations described in /docs/BBN-Babylon-transcription-guidelines.pdf.