BBN/AUB DARPA Babylon Levantine Arabic Speech and Transcripts was produced by Linguistic Data Consortium (LDC) catalog number LDC2005S08 and ISBN 1-58563-296-1.
This corpus consists of transcribed, spontaneous speech, recorded from subjects speaking in Levantine colloquial Arabic. Levantine Arabic is the dialect of Arabic spoken by ordinary people in Lebanon, Jordan, Syria, and Palestine. It is significantly different from Modern Standard Arabic (MSA), in that it is a spoken rather than a written language. It includes different word pronounciations, and even different words, from Modern Standard Arabic, the written and "official" form of Arabic.
The corpus was developed with funding from the Defense Advanced Research Project Agency (DARPA), as part of the Babylon program. The Babylon program is intended to advance the state of the art in speech-to-speech translation systems, both by creating new technology and by developing systems for field use. More information on the Babylon program may be found at this site. BBN was funded under Babylon to develop a limited English/Arabic refugee/medical speech translation system for a handheld computer, and collected this corpus as part of its work. The corpus would be useful for anyone attempting to do speech recognition in Levantine colloquial Arabic, including for speech translation and spoken dialog systems.
To see an example of this corpus, we have provided a audio sample and transcription.
Portions © 2003 BBNT Solutions LLC, © 2004 Trustees of the University of Pennsylvania