GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web

          Authors: Xuansong Li, Stephen Grimes, Safa Ismael and
                         Stephanie Strassel

                      Linguistic Data Consortium

1. Introduction

This file contains documentation for the corpus GALE Arabic-English Word 
Alignment Training Part 1 -- Newswire and Web. The corpus includes 
word aligned newsire and web data. 

2. Source Data Profile

2.1 Data Source

 The file names indicate the source where the data were first harvested.
 The data were chiefly from news agency in case of newswire data, and 
 from internet in case of web data. Given a file name with 
 "AFP_ARB_20061104", "AFP" means the short form for news agency "Agence 
 France Presse ", "ARB" stands for "Arabic language", and "20061104" 
 indicates the date: 11/04/2006.  

 2.2 Annotation Data Profile

 Language  Genre  Files  Words  Tokens  Segments
 -----------------------------------------------
 Arabic     WB     119   59696   81620    4383 
 Arabic     NW     717  198621  263060    8423 
 -----------------------------------------------
 Total             836  258317  344680   12806

 Note: Word count is based on untokenized Arabic source; token count is
 based on tokenized Arabic source.

3. Annotation

3.1 Tokenization Correction Task

 Arabic source tokens need to be corrected when they are incorrectly 
 tokenized by the MADA tokenization system (developed by Columbia University). 
 The correction annotation tasks include:

 - Identifying incorrectly tokenized tokens
 - Correcting tokens according to tokenization correction guidelines

3.2 WA Annotation Task

Word alignment annotation consists of the following tasks:

 - Identifying different types of links: translated (correct or incorrect)
   and not translated (correct or incorrect)
 - Identifying sentence segments not suitable for annotation. Annotators
   may reject segment for blank segments, incorrectly-segmented segments,
   segments with foreign languages, or when the source and translation are
   in the same language.
 - Tagging unmatched words which are attached to other words or phrases
 

3.3 WA Annotation Guidelines

3.3.1 Tokenization Correction Guidelines

 For WA annotation on sentence-based data, an extra tokenization correction
 process is needed to correct the incorrectly tokenized tokens produced
 by MADA tokenization system. The Arabic source files were tokenized by MADA
 into ATB-style unvocalized tokens. Based on the correction guidelines
 provided by Columbia University, LDC compiled tokenization correction
 guidelines for tokenization correction annotation.

 The guidelines are available in the docs directory of this release.

 ./docs/ArabicTokenizationGuidelinesV1.1.pdf

3.3.2 Word alignment Guidelines

 LDC's word alignment guidelines are adapted from previous task
 specifications including those used in the BLINKER project.

 No changes have been made to the alignment guidelines since last delivery.
 The guidelines used for this corpus are available in the docs directory of
 this release. The guidelines can also be accessed from:

http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Arabic_alignment_guidelines_v6.0.pdf

3.4 WA Annotation Process

- Annotator training to familiarize Arabic WA annotation team with WA
guidelines
- Annotation to produce first pass WA annotation on Arabic files.
- Second pass by senior annotators to review and correct first pass WA
  annotation.
- Quality control by lead annotator for WA annotation consistency on
  all files.
- Automatic and manual sanity checks to ensure file format consistency.

4. File Format Description

4.1 Overview

 Files that are distributed in this release include three types of files -
 raw, tokenized, and WA (word alignment). The raw format contains
 the original Arabic/English sentences without any annotation. 

4.2 Details

4.2.1 Arabic (source) .raw

 Generally one sentence per line without markup. Text is encoded in utf-8.

4.2.2 English (translation) .raw

 One or more sentences per line without markup. 

4.2.3 Arabic (source) .tkn

 For parallel word alignment files, the tokenized Arabic source files 
 contain one segment per line. Tokenization was produced by using MADA 
 followed by manual annotator correction prior to word alignment. The tokens 
 are space-delimited and in utf8 encoding. There is no explicit token 
 numbering in this case, but the .wa files reference implicit token
 numbers starting from 1, 2, ... etc.

4.2.4 English (translation) .tkn

 As with Arabic tokenized files in this release, the English tokenized
 files have different structure depending on which portion of the release
 they belong to. For files in the parallel word alignment portion of the 
 release, the tokens are simply space-delimited.

4.2.5 WA .wa file

 Each line contains a list of space delimited alignments for the
 corresponding sentence. Each alignment is in the follow format:

     s-t(linktype)

 where s and t are a list of comma delimited source and translation
 token IDs respectively. s or t can be empty indicating a not translated
 token. Valid values for linktype are:

 COR    translated correct
 TIN    translated incorrect
 MTA    meta token: transcription/translation markup

 Additionally, token number may optionally be followed by a tag enclosed in
 square brackets. Possible tags are:

 GLU    "glued" token
 TYP    typo
 TOK    tokenization error
 MET    meta data: transcription/translation markups 
 MRK    similar to MET, but markup is attached to content token

 Examples of valid alignments:

 2[TYP]-4,6(COR)     # Arabic token 2 (a typo) is aligned to English tokens
                       4 and 6. Correct.
 13[GLU],14-10(INC)  # Arabic tokens 13 (tagged as so-called glue) and 14
 are
                       aligned to English token 10. Incorrect.
 10-(COR)            # Arabic token 10 is not translated/has no English
                       correspondent. Correct.

 -19[TYP](COR)       # English token 19 (a typo) is not translated/has
                       no Arabic correspondent. Correct.

 5[MET]-(MTA)        # Arabic token 5 is a meta token

 Annotators had the option of not annotating a sentence.  In these cases,
 the word "rejected" appears instead of word alignments.  This typically
 happens when automatic sentence alignment failed -- either one of the
 sentences was empty, or they were not translations of one another.


5. Data Directory Structure

 - data/parallel_word_aligned/{nw,wb}/source/{raw,tokenized}: raw
 (untokenized)
   and tokenized Arabic source data
 - data/parallel_word_aligned/{nw,wb}/translation/{raw,tokenized}: raw
   (untokenized) and tokenized English translation data
 - data/parallel_word_aligned/{nw,wb}/WA: character-level aligned and tagged data

6. Documentation

 - docs/files.sha1: sha1 checksums for data files
 - docs/README.txt: a general documentation file about this release.
 - docs/ArabicTokenizationGuidelinesV1.1.pdf: tokenization correction
   guidelines
 - docs/GALE_Arabic_alignment_guidelines_v6.0.pdf: Arabic-English alignment
   guidelines

7. Data Validation

7.1 Data Consistency

The following data consistency checks were performed:

 -Bilingual annotators checked by hand several files to ensure that
  their word alignment annotations were faithfully recorded in the
  output format.
 -It was verified that all files associated with a given document
  contain the same number of sentence segments.
 -It was verified that all tokens for a given sentence were annotated and
  that those annotations appear in the .wa file.
 -It was verified that all token numbers referenced in the .wa file have a
  corresponding token in the .tkn file.
 -For treebank data, it was verified the syntax trees are well-formed and
  that each token has a part-of-speech tag.

7.2 Sanity Checks

 A set of independent sanity checks have been performed by a technical
 staff member of LDC.

8. Acknowledgements

This work was supported in part by the Defense Advanced Research 
Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of 
this publication does not necessarily reflect the position or the policy 
of the Government, and no official endorsement should be inferred.
Special thanks to annotators: Khalda Ahmed; Nahed Gayed; Nancy Gayed and
Manal Gobran.  
 
9. Contact Information

 If you have questions about this data release, please contact the
 following personnel at the LDC.

 Project manager: Xuansong Li <xuansong@ldc.upenn.edu>
 Technical lead: Stephen Grimes <sgrimes@ldc.upenn.edu>
 Lead annotator: Safa Ismael <safa@ldc.upenn.edu>
 Project consultant: Stephanie Strassel <strassel@ldc.upenn.edu>
--------------------------------------------------------------------------
README Created April 8, 2011 Xuansong Li
README Updated April 15, 2011 Stephen Grimes
README Updated June 29, 2011 Xuansong Li