GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Training Part II				     
 	
       Authors: Xuansong Li, Stephen Grimes, Safa Ismaeal, Stephanie Strassel
                            Mohamed Maamouri, Ann Bies
  	      
		            Linguistic Data Consortium


1. Introduction

 This file contains documentation for the GALE Arabic-English Parallel Aligned 
 Treebank Newswire Training Part 2 release. Data were sourced from Arabic 
 broadcast news and converstation sources newswire sources and translated 
 to English. Arabic and English Treebank annotations were performed 
 independently, and finally the parallel texts were word aligned to create 
 this release. These data match Arabic treebanked data appearing in parts of 
 ATB7, ATB8, and ATB10 and the EATB releases.

2. Source Data Profile

2.1 Data Selection 

 During data selection, files with mismatched source and  translation 
 segments were excluded. Files with bad format and  atypical newswire or 
 broadcast news style were avoided.

2.2 Data Source

 2007-2008 Abu Dhabi TV, 2008 Al Baghdadya TV, 2008 Al Fayha, 2008 Al 
 Iraqiyah, 2007 Aljazeera, 2007 Al Ordiniyah, 2008 Al Sharqiya, 2008 Dubai 
 TV, 2008 Oman TV, 2008 Saudi TV

2.3 Annotation Data Profile

 Language  Genre  Files   Words  Tokens  Segments
 ------------------------------------------------
 Arabic     BN	   31    110690  141058   7102

Note: Word count is based on untokenized Arabic source; token count is
based on ATB-tokenized Arabic source.

3. Annotation  

3.1 WA Annotation Task

Word alignment annotation consists of the following tasks:

 - Identifying different types of links: translated (correct or
   incorrect) and not translated (correct or incorrect)

 - Identifying sentence segments not suitable for
   annotation. Annotators may reject segment for blank segments,
   incorrectly-segmented segments, segments with foreign languages, or
   when the source and translation are in the same language.
  
 - Tagging unmatched words which are attached to other words or phrases

3.2 WA Annotation Guidelines 

 LDC's word alignment guidelines are adapted from previous task
 specifications including those used in the BLINKER project.

 The updated guidelines used for this corpus are available in the docs
 directory of this release.

 The guidelines can also be accessed from:
 http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Arabic_alignment_guidelines_v6.0.pdf

 Arabic guidelines changes in this release:

 - Vocative particle will be left as not translated and correct in
   case there is no counterpart.

 - The word "and" is linked to the comma.

 - In response to site's request, all the unaligned/unmatched words
   are tagged. For unaligned words or phrases having no
   locally-related constituent to attach to, they are tagged as
   not-translated correct or incorrect. For unaligned words or phrases
   having locally-related constituents to attach to, they are tagged
   as "GLU", which shows local word relations among dependency
   constituents. This is represented by an asterix (*) sign in the
   guidelines:

   -English subject pronouns omitted in Arabic are unmatched and
    tagged as "GLU".

   -Unmatched verb "to be" is tagged as "GLU" for Arabic nominal
    sentence.

   -Unmatched pronouns and relative nouns when linked to their
    referents are tagged as "GLU".

   -Unmatched possessives ('s and ') when glued to their owner are
    tagged as "GLU".

   -For cases of one preposition in one side while no counterpart in
    the other, the extra preposition glued to its object would be
    tagged as "GLU".

   -Two or more prepositions in one language while one preposition in
    the other side; the unmatched preposition would be tagged as
    "GLU"; the same is applicable for pronouns except for relative
    pronouns.

3.3 WA Annotation Process 

 This corpus was annotated using the following process:

  - Annotator training to familiarize Arabic WA annotation team with
    guidelines  

  - Annotation to produce first pass annotation on Arabic files.

  - Second pass by senior annotators to review and correct first pass
    annotation. 

  - Quality control by lead annotator for annotation consistency on
    all files. 

  - Automatic and manual sanity checks to ensure file format consistency. 

4. File Format Description

4.1 Overview

 Files that are distributed in this release include four types of
 files - raw, tokenized, treebank, and WA (word alignment). The aligned 
 parallel treebank portion of the release contains seven files
 for each document.  The parallel word aligned portion of this release
 contains five files for each document (no treebank files), and furthermore
 the format of the tokenized files differs from that found in the aligned
 parallel treebank portion (see below).

4.2 Details

4.2.1 Arabic (source) .raw

 Generally one sentence per line without markup. Text is encoded in utf-8.

4.2.2 English (translation) .raw

 One or more sentences per line without markup. Raw English files for the
 parallel aligned treebank portion were reduced by the EATB team from utf8
 to ASCII, and we include the reduced ascii files here to enforce accuracy
 of the begin and end offset characters provided by EATB and found in the
 .tkn files.

4.2.3 Arabic (source) .tkn

 For parallel word alignment (non-treebanked) files, the tokenized Arabic
 source files contain one segment per line. The tokens are space-delimited
 and in utf8 encoding.

 The tokenized files for the parallel aligned treebank portion of the
 release contain more structure: each space-delimited token entry contains
 six semi-colon delimited fields. Because semicolon was used as field
 delimiter, any semicolons in the text appear as "-SC-" in this file.
 The 6 fields are as follows:

 - TokenID: integer sequentially numbered from 1
 - Start: start character offset into .raw file
 - End: start character offset into .raw file
 - Vocalized token (Buckwalter)
 - Input string (utf8)
 - Unvocalized (Buckwalter)

 Treebank trace tokens:

 Treebank tree leaves having the POS label -NONE- correspond to trace
 positions in the syntactic tree and contain the "*" character. We give
 equal start and end positions for these tokens equal to the end offset of
 the previous token (even though, of course, these tokens have no equivalent
 in the .raw file). For these token, their Vocalized, Input String, and
 Unvocalized forms are all identical.

 Known empty tokens for Arabic are:

 *       # Pro-drop subjects and passive traces
 *0*     # Null complementizer or zero WH- pronoun
 *ICH*   # Rightward movement (for the most part, also *RNR*, etc.)
 *RNR*   # Right node raising
 *T*     # WH-traces or any topicalization

 For Arabic, we continue the practice of marking empty tokens as
 not translated (correct).

4.2.4 English (translation) .tkn

 As with Arabic tokenized files in this release, the English tokenized
 files have different structure depending on which portion of the release
 they belong to. For files in the parallel word alignment (non-treebank)
 portion of the release, the tokens are simply space-delimited. Tokenization
 was produced by using MADA followed by manual annotator correction prior to
 word alignment.
 For files in the parallel aligned treebank portion of the release, each
 line contains a space delimited list of tokens, and each token contains
 four semi-colon delimited fields. Any tokens originally containing
 semicolons have -SC- appearing instead of a semicolon.  The four fields
 are as follows:

 - TokenID: integer ID of token, sequentially numbered from 1.
 - Start: start character offset into .raw file
 - End: end character offset into .raw file
 - Token: the token string from the EATB annotated tree

 Treebank trace tokens:

 Treebank trace tokens have the POS label -NONE-. These tokens have no
 corresponding material in the input file. We give equal start and end
 positions for these tokens.

 Known empty tokens are:

 *
 *0*
 *?*
 *EXP*
 *ICH*
 *NOT*
 *RNR*
 *T*
 *U*

4.2.5 Arabic (ATB) .tree

 Trees are represented in Penn Treebank format (labeled brackets).
 The trees leaves contain token IDs corresponding to the numbers in the
 tokenized (.tkn) file. Most lines have one tree, but it is possible some
 have more.

4.2.6 English (EATB) .tree

 Trees are represented in Penn Treebank format (labeled brackets).
 The trees leaves contain token IDs corresponding to the numbers in the
 tokenized (.tkn) file. Most lines have one tree, but it is possible some
 have more. Multiple trees may be created based on the translator decision
 to break an Arabic sentence into multiple English sentences.

4.2.7 WA .wa file

 The format of the alignment file is similar to GIZA++ word alignment
 format, but with some enhancements.

 Each line contains a list of space delimited alignments for the
 corresponding sentence. Each alignment is in the follow format:

     s-t(linktype)

 where s and t are a list of comma delimited source and translation
 token IDs respectively. s or t can be empty indicating a not translated
 token. Valid values for linktype are:

 COR    translated correct
 TIN    translated incorrect
 MTA    meta token: treebank trace or transcription/translation markup

 Additionally, token number may optionally be followed by a tag enclosed in
 square brackets. Possible tags are:

 GLU    "glued" token
 TYP    typo
 TOK    tokenization error
 MET    meta data: transcription/translation markups or treebank Empty Token
 MRK    similar to MET, but markup is attached to content token

 Examples of valid alignments:

 2[TYP]-4,6(COR)     # Arabic token 2 (a typo) is aligned to English tokens
                       4 and 6. Correct.
 13[GLU],14-10(INC)  # Arabic tokens 13 (tagged as so-called glue) and 14
 are
                       aligned to English token 10. Incorrect.
 10-(COR)            # Arabic token 10 is not translated/has no English
                       correspondent. Correct.

 -19[TYP](COR)       # English token 19 (a typo) is not translated/has
                       no Arabic correspondent. Correct.

 5[MET]-(MTA)        # Arabic token 5 is a meta token

 Annotators had the option of not annotating a sentence.  In these cases,
 the word "rejected" appears instead of word alignments.  This typically
 happens when automatic sentence alignment failed -- either one of the
 sentences was empty, or they were not translations of one another.

5. Data Directory Structure

 - data/parallel_word_aligned_treebank/bn/source/{raw,tokenized}: 
   raw(un-tokenized) and tokenized Arabic source data  
 - data/parallel_word_aligned_treebank/bn/translation/{raw,tokenized}: 
   raw(un-tokenized) and tokenized English translation data   
 - data/parallel_word_aligned_treebank/bn/tree/{ATB,EATB}: Arabic and English 
   Treebank annotated data 
 - data/parallel_word_aligned_treebank/bn/WA: word-level aligned data 

6. Documentation

 - docs/GALE_Arabic_alignment_guidelines_v6.0.pdf: annotator guidelines
   for word alignment
 - docs/files.sha1: the sha1 checksums for data files in package
 - docs/README.txt: this file 

7. Data Validation

 The following data consistency checks were performed:

 - Bilingual annotators checked by hand several files to ensure that
   their word alignment annotations were faithfully recorded in the
   output format.
 - It was verified that all files associated with a given document
   contain the same number of sentence segments.
 - It was verified that all tokens for a given sentence were annotated and
   that those annotations appear in the .wa file.
 - It was verified that all token numbers referenced in the .wa file have a
   corresponding token in the .tkn file.
 - For treebank data, it was verified the syntax trees are well-formed and
   that each token has a part-of-speech tag.

8. Copyright Information

   Portions (c) 2007-2008 Abu Dhabi TV, (c) 2008 Al Baghdadya TV, (c) 2008 
   Al Fayha, (c) 2008 Al Iraqiyah, (c) 2007 Aljazeera, (c) 2007 Al Ordiniyah, 
   (c) 2008 Al Sharqiya, (c) 2008 Dubai TV, (c) 2008 Oman TV, (c) 2008 Saudi 
   TV, (c) 2011 Trustees of the University of Pennsylvania 

9. Acknowledgements

This work was supported in part by the Defense Advanced Research 
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of 
this publication does not necessarily reflect the position or the policy 
of the Government, and no official endorsement should be inferred. 

Special thanks to: Seth Kulick, Wajdi Zaghouani, Justin Mott, Mike Ciul

Special thanks to: Khalda Ahmed; Nahed Gayed; Nancy Gayed, Manal Gobran. 

10. Contact Information

   If you have questions about this data release, please contact the
   following personnel at the LDC.

   Project manager: Xuansong Li <xuansong@ldc.upenn.edu>
   Technical lead: Stephen Grimes <sgrimes@ldc.upenn.edu>
   Lead annotator: Safa Ismael <safa@ldc.upenn.edu>
   Project consultant: Stephanie Strassel <strassel@ldc.upenn.edu>

--------------------------------------------------------------------------
README Created Feb 8, 2011 Stephen Grimes
README Updated Mar 28, 2011 Xuansong Li
README Updated Jun 8, 2011 Stephen Grimes
README Updated June 29, 2011 Xuansong Li