DISCLAIMER: This is a draft annotation of the corpus. We will continue our
            QC on the transcription and release the final data as soon as 
            the job is completed.

=========================================================================

		Kickoff Release 2 -- Levantine Arabic CTS Transcripts
		     (releaseID LDC2005E77) 

Authors: Mohamed Maamouri (Project head), Tim Buckwalter, Dave Graff, Hubert Jin


LEVANTINE YELLOW ANNOTATION (careful transcription)

The following are definitions of the two AMADAT transcription layers:

1. GREEN TRANSCRIPTION:  this layer aims at anchoring dialectal forms to
similar/related MSA orthographic-based forms/utterances whenever possible.
This transcription covers the ORTHOGRAPHIC LEVEL.

2. YELLOW TRANSCRIPTION:  this layer starts from the Green layer and 
enriches it by providing the following features:  

	(a) 3 main short vowels and the 2 allophones [e,o];
	(b) consonantal (phonological/sociolinguistic) variants; 
	(c) gemination (consonantal length); nunation (indefiniteness
	    marker); other consonantal changes including morphophonemic 
	    assimilation.  

This more careful transcription covers the SURFACE PHONEMIC LEVEL. The 
exact specifications of this transcription layer will be defined by the 
sponsors prior to project start date.

TRANSCRIPTION OF YELLOW LAYER 

As soon as the GREEN layer is started, we propose to get annotators 
working on a parallel more careful annotation (YELLOW), which includes 
the following  features:  

·	Short vowels [i,a,u] 
·	Consonantal (phonological/sociolinguistic) variants
·	Gemination (consonantal length) 
·	Nunation (indefiniteness marker) 
·	Other consonantal changes including morphophonemic assimilation  

DATA RELEASE

GREEN transcription of telephone conversations in Levantine Arabic 
have been previously released by LDC since 2004:

    * Levantine Arabic QT Training Data Set 1 Transcripts (LDC2004E22)
    * Levantine Arabic QT Training Data Set 2 Transcripts (LDC2004E66)
    * Levantine Arabic QT Training Data Set 3 Transcripts (LDC2005T03)
    * Levantine Arabic QT Training Data Set 4 Transcripts (LDC2005S14)

In this corpus, 282 conversations are selected from the Training Data Set
1-3 and the YELLOW annotation has been done on those conversations by 
experienced annotators at LDC. The corresponding speech for the 282 
conversation is 45 hours in total.

This release is self-contained, meaning that GREEN transcription, YELLOW 
annotation and audio are all included in the package.

DIRECTORY STRUCTURE

In this directory (docs), we included the following documentation:

   1) filelist

        A list of conversation IDs with prefix of 'fsa_'.

   2) wordlist.utf8.txt

	union of the mapping lists in the previous GREEN releases.

   3) speaker_info.txt

	Speaker information on origin, gender, age (group) etc, judged
        by the annotators who transcribed the conversations.

In the data directory:

   1) annotation - 282 annotation files in UTF-8 and TBW format.

	(a) Line format of the files:

	    start SPACE end SPACE channel: TAB green TAB yellow
		
                  where SPACE is " " and TAB is "\t"
			green is in UTF-8		
			yellow in in TBW transliteration

	(b) metadata annotation:

	    tokens started with a "%" are meta-language

   2) audio - the audio (in sphere format) for the 282 conversations.

For more information, please contact

    Mohamed Maamouri     maamouri@ldc.upenn.edu
    Tim Buckwater        timbuck2@ldc.upenn.edu
    Hubert  Jin	         hubertj@ldc.upenn.edu