Publication Title: BBN/AUB DARPA Babylon Levantine Arabic Corpus

Authors:  
BBN Technologies (with American University of Beirut as subcontractor)
John Makhoul, Bushra Zawaydeh, Frederick Choi, David Stallard
Primary contact: Dave Stallard (stallard@bbn.com, 617.873.2825)

Project: DARPA Babylon

Project Background:

This is a corpus of transcribed, spontaneous speech, recorded from
subjects speaking in Levantine colloquial Arabic.  Levantine Arabic is
the dialect of Arabic spoken by ordinary people in Lebanon, Jordan,
Syria, and Palestine.  It is significantly different from Modern
Standard Arabic (MSA), in that it is a spoken rather than a written
language.  It includes different word pronounciations, and even
different words, from Modern Standard Arabic, the written and
"official" form of Arabic.  

This corpus was developed with funding from the Defense Advanced
Research Project Agency (DARPA), as part of the Babylon program.  The
Babylon program is intended to advance the state of the art in
speech-to-speech translation systems, both by creating new technology
and by developing systems for field use.  More information on the
Babylon program may be found at http://darpa-babylon.mitre.org.  BBN
was funded under Babylon to develop a limited English/Arabic
refugee/medical speech translation system for a handheld computer, and
collected this corpus as part of its work.  The corpus would be useful
for anyone attempting to do speech recognition in Levantine colloquial
Arabic, including for speech translation and spoken dialog
systems. At the time of this writing this corpus is, as far as we
know, the only publically available transcribed corpus of Levantine
Arabic.

Data type: speech, text

Data sources: microphone

Collection Procedure:

The corpus was recorded using a close-talking, noise-cancelling,
headset microphone (the Andrea Electronics NC-65).  A Java-based
data-collection tool, developed by BBN, was used to do the collection.
This tool allowed the experimenter to select a particular scenario,
and then step through the questions in it.  To ask a question, the
operator would click the "ask" button, and the tool would play out a
prerecorded Arabic prompt, corresponding to the Arabic translation of
the question.  Upon completion of the prompt, the tool would go into
listening mode.  The subject would speak his reply, which the tool
would record.  When the subject was finished, the experimenter would
click a "stop" button on the GUI, and then go on to the next question.
Thus, end-pointing of the speech was done manually rather than
automatically.  As an additional feature, the tool indicated the volume
of the recording as "low", "normal", or "too high".  The experimenter
could then tell the subject to speak either louder or softer, and
rerecord his response.

Approximately 20% of the corpus was recorded by BBN using paid
subjects recruited in the Boston area from May 2002 to September 2002.
This portion of the corpus was the first to be collected.
Subsequently, the remaining 80% was recorded by the American
University of Beirut (AUB), under subcontract to BBN, from July 2002
to November 2002.  AUB students and staff served as both experimenters
and subjects.  This portion of the corpus was recorded in Beirut,
Lebanon, on the AUB campus.

The subjects in the corpus were responding to refugee/medical
questions ("Where is your pain?", "How old are you?", etc.), and were
playing the part of refugees.  Each subject was given a part to play,
that prescribed what information they were to give in response to the
questions, but were told to express themselves naturally, in their own
way, in Arabic.  To avoid priming subjects to give their answer with a
particular Arabic wording, the parts were given in English rather than
Arabic.  (All subjects were thus bilingual.)  The following is an
example scenario:

  You are Maraam Samiir Shamali.  You were born on 8/7/1971 in
  Kuwait. You are now 31 years old.  Your mother Nabiila Habiib and
  your 5 brothers and sisters live in Amman. You weigh about 50 kilos,
  and your height is 150 centimeters. You have been living in Jabal
  Husein in Amman since 1980.  You live in front of Frer School. As
  for education, you have a bachelors in education. You are a
  Christian. You work as a teacher in Amman in the Frer School. You
  make 200 dinars per month. You live with 4 people. You are single
  and you have no children.

Applications: speech translation, speech recognition, spoken dialog systems

Languages: Levantine colloquial Arabic

Special license: n/a

Grant number and funding agency: Sponsored by DARPA and Monitored by
SPAWAR Systems Center under Contract No. N66001-99-D-8615

Copyright statement: Copyright BBNT Solutions LLC, 2003

Corpus Statistics: 

Number of subjects: 164
Number of utterances: 75900
Total audio size: 6.5 GB
Number of hours: 45
Total text size: 3.1 MB
Vocabulary: 15K words
Total words: 336K words

Corpus description:

The corpus takes the form of a set of files.  For each utterance in
the corpus there are two files, one for the audio, the other for the
transcription.  Audio files have the suffix "wav", while transcription
files have the suffix "txt".  The base of the file name encodes the
date and time the recording session, the subject ID number, and the
3-digit utterance number (utterance numbers start with '000').  The
format of this utterance ID is:

MM-DD-YYYY_HHMMSS_III_uttNNN

Thus, the files:

09-09-2002_124530_266_utt002.wav
09-09-2002_124530_266_utt002.txt

are, respectively, the audio and transcription of the third utterance
of a session started at 12:45:30 PM on September 9, 2002, using
subject number 266 (subject IDs start with '1', and are between 1 and 3
digits long, inclusive).

Data Type: Speech

In directory: /data/audio
Number of files: 75900 Levantine Arabic audio files zipped using WinZip
(www.winzip.com)
Named with a unique ID based on the date it was collected
Total size: 6.5 GB
Number of hours: 45

Details:
Format: MS WAV (signed PCM)
Channel Count: 1
Sampling rate: 16000 samples/sec
Bit rate: 16 bits/sample

Audio description:
The audio was recorded in MS WAV, signed PCM.  Sampling rate was
16Khz, with 16-bit resolution.

Data Type: Text

In directory: /data/text
Number of files: 75900 transcription text files zipped using WinZip
(www.winzip.com)
Named according to the audio file it corresponds to.
Total size: 3.1 MB
Vocabulary: 15K words
Total words: 336K words

Details:
Format: UTF-8 Unicode Arabic text

Text description:
All transcriptions are Unicode Arabic, encoded in UTF-8. They do not
include short-vowel diacritics of Arabic writings, which are rarely
written.  As part of the work, we developed a set of transcription
guidelines that specified how to spell certain colloquial-only words,
and how to reconcile spelling of differently-pronounced words with
their MSA spellings.  These guidelines are included as part of this
distribution (in the document "BBN-Babylon-transcription-guidelines.pdf", 
see below under CD1 Contents).

Contents:

Note: Each CD contains the directories /data/audio and /data/text. CD1
also contains the /doc directory (see below for the list of the
contents of this directory).

CD1: 5,000 utterances
ID 05_31_2002_093700_2_utt000 through 
ID 06_25_2002_110200_19_utt011

CD1 also includes in its /doc directory:
- BBN-BABYLON-README.TXT : This readme file.
- BBN-Babylon-arabic-word-list.txt : A UTF-8 encoded text file of all
of the unique Arabic words in the corpus.
- BBN-Babylon-subject-gender-list.txt : Table with subject IDs
followed by the gender of the subject.
- BBN-Babylon-transcription-guidelines.pdf : Guideline to
transcriptions, as Levantine Arabic is not a written language. Written
by Bushra Zawaydeh and John Makhoul.

CD2: 5,000 utterances 
ID 06_25_2002_110200_19_utt012 through 
ID 07_09_2002_131644_32_utt004

CD3: 10,000 utterances
ID 07_09_2002_131644_32_utt005 through 
ID 07_24_2002_180632_206_utt009

CD4: 10,000 utterances
ID 07_24_2002_180632_206_utt010 through
ID 09_09_2002_152607_227_utt017

CD5: 5,000 utterances
ID 09_09_2002_152607_227_utt018 through
ID 09_12_2002_112358_239_utt010

CD6: 10,000 utterances
ID 09_12_2002_112358_239_utt011 through
ID 09_21_2002_131637_229_utt009

CD7: 5,000 utterances
ID 09_21_2002_131637_229_utt010 through
ID 09_25_2002_120453_252_utt003

CD8: 5,000 utterances
ID 09_25_2002_120453_252_utt004 through
ID 09_27_2002_151932_312_utt004

CD9: 5,000 utterances
ID 09_27_2002_151932_312_utt005 through
ID 10_02_2002_132753_311_utt018

CD10: 5,000 utterances
ID 10_02_2002_132753_311_utt019 through
ID 10_08_2002_152721_279_utt019

CD11: 5,000 utterances
ID 10_08_2002_152942_279_utt000 through
ID 10_15_2002_134909_309_utt013

CD12: 5,900 utterances
ID 10_15_2002_134909_309_utt014 through
ID 11_01_2002_202205_289_utt012

Quality Control: n/a

Suggested Price: TBD