TITLE: Levantine Arabic QT Training Data Set 5 (Speech + Transcripts) [LDC2006T07] Authors: Mohamed Maamouri (Project head), Tim Buckwalter, Dave Graff, Hubert Jin This release contains 1660 calls and the total speech is approximately 250 hours of telephone conversation in Levantine Arabic. In this directory (docs), we included the following documents: 1) filelist A list of conversation IDs with prefix of 'fsa_'. 2) wordlist.utf8.txt Wordlist mapping table 3) speaker_info.txt Speaker information on origin, gender, age (group) etc, judged by the annotators who transcribed the conversations. 4) fla_pindata.tbl Speaker PINs and phone numbers for each call in the corpus. This information is from registration of the speakers. Although it provides important information to group speech from the same speaker, it is not always the case that the person who talked in the call was the one who registered. This corpus is the combination of 4 former training data sets: LDC2004E22, LDC2004E66, LDC2005T03 and LDC2005S14(Speech and Transcripts). More than half of the speakers are Lebanese. The break down of the dialects is as follows: 559 JOR 1853 LEB 355 PAL 67 SYR 484 Lev (general category of Levantine Arabic) Directory structure annotation - 1660 transcription files in the UTF-8 format. audio - 1660 audio files in the sphere format. docs - documentation. Note: sph2pipe can be used to convert the sphere files to wave files. For more information, please search refer to http://www.ldc.upenn.edu/Using. For more information, please contact Mohamed Maamouri maamouri@ldc.upenn.edu Timbuck Water timbuck2@ldc.upenn.edu Dave Graff graff@ldc.upenn.edu Hubert Jin hubertj@ldc.upenn.edu