GALE Arabic-English Parallel Aligned Treebank -- Newswire Training

     Authors: Xuansong Li, Stephen Grimes, Safa Ismaeal,Dalal Zakhary, 
          Stephanie Strassel, Mohamed Maamouri, Ann Bies         
		   
                     Linguistic Data Consortium


1. Introduction

 This file contains documentation for the GALE Arabic-English Parallel Aligned 
 Treebank Newswire Training release. Data were sourced from Arabic newswire
 sources and translated to English. Arabic and English Treebank annotations 
 were performed independently, and finally the parallel texts were word 
 aligned to create this release. These data match Arabic treebanked data
 appearing in ATB3.

2. Source Data Profile

2.1 Data Selection 

 Newswire and broadcast news files in this release were selected for
 annotation from among the superset of data that had been previously
 treebank annotated for both Arabic and English (translated from
 Arabic). During data selection, files with mismatched source and
 translation segments were excluded. Files with bad format and
 atypical newswire were also excluded.

2.2 Data Source

 The newswire data is from 2002 An Nahar. 
 
2.3 Annotation Data Profile

 Language  Genre  Files   Words  Tokens  Segments
 ------------------------------------------------
 Arabic     NW     364   182351  267520    7711

Note: Word count is based on untokenized Arabic source; token count is
based on ATB-tokenized Arabic source.

3. Annotation  

3.1 WA Annotation Task

Word alignment annotation consists of the following tasks:

 - Identifying different types of links: translated (correct or
   incorrect) and not translated (correct or incorrect)
 - Identifying sentence segments not suitable for
   annotation. Annotators may reject segment for blank segments,
   incorrectly-segmented segments, segments with foreign languages, or
   when the source and translation are in the same language.
 - Tagging unmatched words which are attached to other words or phrases

3.2 WA Annotation Guidelines 

 LDC's word alignment guidelines are adapted from previous task
 specifications including those used in the BLINKER project.

 For unaligned words or phrases having no locally-related constituent to 
 attach to, they are "aligned" as not-translated. For words or phrases having 
 locally-related constituents to attach to, they are tagged as "GLU", which 
 shows local word relations among dependency constituents. 

3.3 WA Annotation Process 

 This corpus was annotated using the following process:

 - Annotator familiarization with word alignment guidelines.
 - First pass word alignment annotation.
 - Second pass annotation by senior annotators to review and correct first 
   passannotation. 
 - Quality control by a lead annotator to check for annotation consistency on
   all files. 
 - Automatic and manual sanity checks to ensure file format consistency. 

4. File Format Description

4.1 Overview

 This release contains four types of files - raw, tokenized, treebank, and 
 "wa". The "raw" format contains the original Arabic/English sentences 
 without any annotation. The "tokenized" format includes the tokenized version 
 of the raw data. These tokens are determined through treebank annotation and 
 may contain Empty Category tokens, which are discussed below. The "treebank" 
 and "wa" files are treebank and word alignment annotations on the
 "tokenized" files.

 Seven files are associated with each document, namely Arabic/English
 raw, Arabic/English tokenized, Arabic/English treebank, and a WA
 file. All seven files associated with a given document name have the same 
 number of lines; that is, annotations of a specific sentence segment share 
 the same line number across all seven files.

 - raw: Arabic/English sentences

 - tokenized: Arabic/English tokenized sentences;

   - the tokens were taken directly from ATB and EATB without modification
   - each token has an ID which ATB, EATB and WA files can refer to
   - ATB, EATB and WA files do NOT contain actual tokens, but only the
     token IDs referencing tokens in the tokenized version

 - treebank: Arabic/English treebank files

   - treebank files in this release appear in bracked penntreebank format
   - each token has a POS tag and all higher-level nodes are labeled
   - because for Arabic each token has mutiple forms, the trees contain 
     token numbers instead of strings; the vocalized, unvocalized, and
     input string versions can be obtained by finding the token ID in the 
     tokenized file

 - wa: word alignment file

   - word alignment format is descirbed in detail below

4.2 Details

4.2.1 Arabic (source) .raw

 Generally one sentence per line without markup. Text is encoded in utf-8.

4.2.2 English (translation) .raw

 Generally one sentence per line, but there may be more sentences 
 corresponding to translator's preference to introduce sentence 
 boundaries. No markup. Text is encoded in utf-8.

4.2.3 Arabic (source) .tkn

 Each line contains a space delimited list of tokens corresponding to 
 a line in the .raw file.  Each token entry contains 8 fields separated 
 by the semicolon character, ";". Because semicolon was used as field 
 delimiter, any semicolons in the text appear as "-SC-" in this file.  
 The 8 fields are as follows:

 - TokenID: integer sequentially numbered from 1
 - Start: start character offset into .raw file
 - End: start character offset into .raw file
 - VOC_STRING: the Arabic utf-8 of the vocalized form; this is the form of
   the token annotators saw during WA annotation 
 - VOCALIZED: the vocalized Buckwalter form of the word, taken from
   the solution
 - IS_TRANS: Buckwalter transliteration of INPUT_STRING
 - UNVOCALIZED: the unvocalized Buckwalter form of the word 
 - INPUT_STRING: utf-8 characters from original .raw file

 Empty Category tokens:

 Certain treebank tree leaves have the POS label -NONE-; these are 
 Empty Category tokens. These correspond to positions in the syntax tree but
 have no string equivalent in the input. We give equal start and end character
 offsets (see below) for these tokens.  Their entries for VOC_STRING, 
 VOCALIZED, IS_TRANS, UNVOCALIZED, and INPUT STRING are all identical.

 Known Empty Category tokens for Arabic are:

 *       # Pro-drop subjects and passive traces 
 *0*     # Null complementizer or zero WH- pronoun
 *ICH*	 # Rightward movement (for the most part, also *RNR*, etc.) 
 *RNR*	 # Right node raising 
 *T* 	 # WH-traces or any topicalization

 Empty category tokens are always explicity marked as not translated (correct).

 Character Offsets:

 The start and end character offsets are structured as follows. The
 offsets refer to between-character positions. Hence for any two
 consecutive tokens, it is the case that the End offset of the
 preceding token is equal to the Start offset of the following
 token. The first start offset is numbered 0. For "null" tokens such as 
 syntactic traces, we have adopted the policy of also giving these tokens 
 offset in which Start=End; this zero-width pointer references a between-
 token position in the source text.

4.2.4 English (translation) .tkn

 The English .tkn is similar in structure to the Arabic .tkn file but there
 are four semicolon-delimited fields instead of eight. Any tokens originally 
 containing semicolons have -SC- appearing instead of the semicolon. The four 
 fields are as follows:

 - TokenID: integer ID of token, sequentially numbered from 1.
 - Start: start character offset into .raw file
 - End: end character offset into .raw file
 - Token: the representation of the token from the original EATB
   annotated tree

 See 4.2.3 for information on Empty Category tokens. Known Empty Category 
 tokens for English are:

 * 
 *0* 
 *?* 
 *EXP* 
 *ICH* 
 *NOT* 
 *RNR* 
 *T* 
 *U* 

4.2.5 Arabic (ATB) .tree

 Trees are represented in Penn Treebank format (labeled brackets). 
 Trees are taken from treebank releases but strings in tree leaves were 
 replaced by token IDs corresponding to the numbers in the tokenized (.tkn) 
 file.

 Most lines have one tree, but some may contain more than one tree separated
 by whitespace.

4.2.6 English (EATB) .tree

 Annotations were performed on an updated, unreleased version of the
 English Arabic Translation Treebank (EATB). The updates were
 requested by GALE sites an largely pertain to handling of hyphenation
 issues. The final updated version of the EATB corpus will be released
 in the coming months upon completion of the site-requested revisions.

 Trees are represented in Penn Treebank format (using labeled
 brackets).  The trees are similar to the EATB release, except here
 tree leaves were replaced by token IDs corresponding to the token
 numbers in the tokenized (.tkn) file.

 Most lines have one tree, but some may contain more than one tree separated
 by whitespace. Multiple tree arise when one Arabic sentence/tree was 
 translated into or corresponds to mutliple English sentences/tree.

4.2.7 WA .wa file

 Each line contains a list of space delimited alignments for the 
 corresponding sentence. Each alignment is in the following format:

     s-t(linktype)

 where s and t are a list of comma delimited source (Arabic) and translation 
 (English) token IDs respectively. s or t can be empty indicating a 
 not-translated token. Linktype is either COR or INC, indicating if a 
 translation is correct or incorrect. Additionally, tokens may have a tag 
 enclosed in square brackets. Possible tags are:
 
 GLU ("glued" token)
 TYP (typo)

 Examples of valid alignments:

 2[TYP]-4,6(COR)     # Arabic token 2 (a typo) is aligned to English tokens 
                       4 and 6. Correct.
 13[GLU],14-10(INC)  # Arabic tokens 13 (tagged as so-called "glue") and 14 
                       are aligned to English token 10. Incorrect.
 10-(COR)            # Arabic token 10 is not translated/has no English 
                       correspondent. Correct. 
 -19[TYP](COR)       # English token 19 (a typo) is not translated/has
                       no Arabic correspondent. Correct.

 Annotators had the option of not annotating a sentence.  In these cases,
 the word "rejected" appear on the given line. This typically happens 
 when the sentence material was empty, quite short, contained mostly 
 punctuation, or the sentences were translations of one another.

5. Data Directory Structure

 - data/parallel_word_aligned_treebank/nw/source/{raw,tokenized}: 
   raw(un-tokenized) and tokenized Arabic source data  

 - data/parallel_word_aligned_treebank/nw/translation/{raw,tokenized}: 
   raw(un-tokenized) and tokenized English translation data   

 - data/parallel_word_aligned_treebank/nw/tree/{ATB,EATB}: Arabic and 
   English treebank-annotated data 

 - data/parellel_word_aligned_treebank/nw/WA: word-level alignments

6. Documentation

 - docs/file.sha1: sha1 checksums for data files 
 - docs/GALE_Arabic_alignment_guidelines_v6.0.pdf: annotation guidelines
 - docs/README.txt: this file; general documentation for this release. 
 
7. Data Validation

The following data consistency checks were performed:

 -Bilingual annotators checked by hand several files to ensure that
  their word alignment annotations were faithfully recorded in the
  output format.
 -It was verified that all files associated with a given document
  contain the same number of sentence segments.
 -It was verified that all tokens for a given sentence were annotated and
  that those annotations appear in the .wa file.
 -It was verified that all token numbers referenced in a given .wa file have 
  correspondents in the .tkn files.
 -It was verified that all token numbers references in the .tree files 
  has correspondents in the .tkn files.

8. Copyright Information

   Portions (c) 2002 An Nahar, (c) 2002, 2011 Trustees of the University of 
   Pennsylvania

9. Acknowledgements

This work was supported in part by the Defense Advanced Research 
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of 
this publication does not necessarily reflect the position or the policy 
of the Government, and no official endorsement should be inferred. 

Special thanks to: Seth Kulick, Wajdi Zaghouani, Justin Mott, Mike Ciul   

Special thanks to: Khalda Ahmed; Nahed Gayed; Nancy Gayed and Manal Gobran. 

10. Contact Information

   If you have questions about this data release, please contact the
   following personnel at the LDC.

  Project manager: Xuansong Li <xuansong@ldc.upenn.edu>
  Technical lead: Stephen Grimes <sgrimes@ldc.upenn.edu>
  Lead annotator: Safa Ismael <safa@ldc.upenn.edu>
  Lead annotator: Dalal Zakhary <rzakhary@ldc.upenn.edu>
  Project consultant: Stephanie Strassel <strassel@ldc.upenn.edu>
-------------------------------------------------------------------
README created Feb 9 2011 by Stephen Grimes
README updated Mar 28 2011 by Xuansong Li
README updated Jun 8 2011 by Stephen Grimes
README updated Jun 27 2011 by Stephen Grimes