DISCLAIMER: This is a draft annotation of the corpus. We will continue our QC on the transcription and release the final data as soon as the job is completed. ========================================================================= Kickoff Release 2 -- Levantine Arabic CTS Transcripts (releaseID LDC2005E77) Authors: Mohamed Maamouri (Project head), Tim Buckwalter, Dave Graff, Hubert Jin LEVANTINE YELLOW ANNOTATION (careful transcription) The following are definitions of the two AMADAT transcription layers: 1. GREEN TRANSCRIPTION: this layer aims at anchoring dialectal forms to similar/related MSA orthographic-based forms/utterances whenever possible. This transcription covers the ORTHOGRAPHIC LEVEL. 2. YELLOW TRANSCRIPTION: this layer starts from the Green layer and enriches it by providing the following features: (a) 3 main short vowels and the 2 allophones [e,o]; (b) consonantal (phonological/sociolinguistic) variants; (c) gemination (consonantal length); nunation (indefiniteness marker); other consonantal changes including morphophonemic assimilation. This more careful transcription covers the SURFACE PHONEMIC LEVEL. The exact specifications of this transcription layer will be defined by the sponsors prior to project start date. TRANSCRIPTION OF YELLOW LAYER As soon as the GREEN layer is started, we propose to get annotators working on a parallel more careful annotation (YELLOW), which includes the following features: · Short vowels [i,a,u] · Consonantal (phonological/sociolinguistic) variants · Gemination (consonantal length) · Nunation (indefiniteness marker) · Other consonantal changes including morphophonemic assimilation DATA RELEASE GREEN transcription of telephone conversations in Levantine Arabic have been previously released by LDC since 2004: * Levantine Arabic QT Training Data Set 1 Transcripts (LDC2004E22) * Levantine Arabic QT Training Data Set 2 Transcripts (LDC2004E66) * Levantine Arabic QT Training Data Set 3 Transcripts (LDC2005T03) * Levantine Arabic QT Training Data Set 4 Transcripts (LDC2005S14) In this corpus, 282 conversations are selected from the Training Data Set 1-3 and the YELLOW annotation has been done on those conversations by experienced annotators at LDC. The corresponding speech for the 282 conversation is 45 hours in total. This release is self-contained, meaning that GREEN transcription, YELLOW annotation and audio are all included in the package. DIRECTORY STRUCTURE In this directory (docs), we included the following documentation: 1) filelist A list of conversation IDs with prefix of 'fsa_'. 2) wordlist.utf8.txt union of the mapping lists in the previous GREEN releases. 3) speaker_info.txt Speaker information on origin, gender, age (group) etc, judged by the annotators who transcribed the conversations. In the data directory: 1) annotation - 282 annotation files in UTF-8 and TBW format. (a) Line format of the files: start SPACE end SPACE channel: TAB green TAB yellow where SPACE is " " and TAB is "\t" green is in UTF-8 yellow in in TBW transliteration (b) metadata annotation: tokens started with a "%" are meta-language 2) audio - the audio (in sphere format) for the 282 conversations. For more information, please contact Mohamed Maamouri maamouri@ldc.upenn.edu Tim Buckwater timbuck2@ldc.upenn.edu Hubert Jin hubertj@ldc.upenn.edu