BOLT Egyptian-English Word Alignment Training -- Discussion Forum 

                 Linguistic Data Consortium

 Authors: Xuansong Li, Katherine Peterson, Stephen Grimes, Stephanie Strassel
            

1. Introduction

 The DARPA BOLT Program created new techniques for automated translation
 and linguistic analysis that can be applied to informal genres of text
 and speech common to online and in-person communications. LDC supports the
 BOLT Program by collecting informal data sources including discussion forums,
 text messaging and chat and conversational telephone speech in English,
 Chinese and Egyptian Arabic, and applying annotations including translation,
 word alignment, Treebanking, PropBanking, co-reference and queries/responses.

 Word alignment data is contained in this release. In machine translation, word
 alignment is a crucial intermediate stage indicating corresponding word
 relations in parallel text. Word alignment data for this release is built on
 Treebank annotation. Tokens resulted from treebank annotation (including empty
 categories/traces) are directly extracted for word-level alignment.

 The word alignment annotation in this release is linguistic-orientated and
 supported by linguistic theories, aiming to reach a variety of users from NLP
 fields as well as other research/education fields. For MT research, the data is
 intended for all MT performers with varying MT models. The annotation data format
 is designed to allow MT users to flexibly tailor or customize the annotation for
 different uses. Please see section 8 of this document as to how correctly use 
 the data for your model. 
 
 This release includes the corpus BOLT Egyptian-English Word Alignment Training -- 
 Discussion Forum. This file contains documentation for this corpus.  

2. Source Data Profile

2.1 Data Source and Selection

 Data used for word alignment are discussion forum posts harvested on-line at 
 LDC in 2012 for the BOLT project. Threads were collected based on the results 
 of manual data scouting by native speaker annotators. Scouts are instructed 
 to seek content that is in the Egyptian language; original (written by the 
 post's author rather than quoted), interactive, and informal.

 The harvested data was further selected and sentence-segmented at LDC for 
 translation by professional translation agencies. A manual selection procedure 
 was used to choose data appropriate for translation. Selection criteria included
 linguistic features and and topic features. After selection, selected posts were 
 segmented into sentence units (SU). Then files were assigned to professional 
 translators for careful translation. Translators followed LDC's BOLT Translation 
 guidelines. After translations were completed, bilingual LDC staff performed
 quality control by selecting a proportional sample from each delivery and
 scrutinizing it for mistakes. 

 Egyptian source tree tokens for word alignment were automatically extracted from 
 tree files of BOLT Egyptian Arabic Treebank annotation done on the source 
 discussion forum data harvested by LDC in 2012. The treebank annotation was  
 performed in 2012 and 2013. The Egyptian treebank annotation went through two 
 stages: POS-tagging and syntactic annotation. 

2.2 WA Annotation Data Profile

 Language       Genre        Files     Words    Tree-tokens    Segments
 ------------------------- ------------------------------------------------
 Egyptian Arabic  discussion forum    724   400,448    593723     31454  

 Note: Word count is based on untokenized Arabic source; token count is
 based on ATB-tokenized Arabic source. The tree-token count includes empty 
 categories.

3. Annotation

3.1 WA Annotation Task

Word alignment annotation consists of the following tasks:

 - Identifying different types of links: translated (correct or incorrect) and 
   not translated (correct or incorrect)
 - Identifying sentence segments not suitable for annotation. Annotators
   may reject segment for blank segments, incorrectly-segmented segments,
   segments with foreign languages, or when the source and translation are
   in the same language.
 - Tagging unmatched words which are attached to other words or phrases
 - Tagging incorrect tokenizations missed during tokenization correction
   using tag TOK
 - Aligning or tagging alternate translations using ALT tag

3.2 WA Annotation Guidelines

 LDC's Egyptian Arabic word alignment guidelines were adapted from previous
 GALE WA task specifications GALE_Arabic_alignment_guidelines_v6.0.pdf

 The guidelines used for this corpus are available in the docs directory 
 of this release.

3.3 WA Annotation Process

- Annotator training to familiarize Egyptian Arabic WA annotation team with WA 
  guidelines
- Annotation to produce first pass WA annotation on Egyptian Arabic files.
- Second pass by senior annotators to review and correct first pass WA
  annotation.
- Quality control by lead annotator for annotation consistency on all files.
- Automatic and manual sanity checks to ensure file format consistency.

4. File Format Description

 Files that are distributed in this release include two types of files - 
 tokenized, and WA (word alignment). The tokenized format was originated 
 from treebank data.  

4.1 Egyptian Arabic (source) .tkn

 The tokenized Egyptian Arabic source files contain one segment per line.
 The tokens are space-delimited and in utf-8 encoding. The tree tokens resulted 
 from treebanking annotation, with empty categories (traces) added.  

4.2 English (translation) .tkn

 As with Arabic tokenized files in this release, the English tokenized files 
 contain space-delimited tokens with one segment per line. Tokenization was 
 produced by using a tokenization script on the raw translation files.

4.3 WA .wa file

 The format of the alignment file is similar to GIZA++ word alignment format, but
 with some enhancements.

 Each line contains a list of space delimited alignments for the  corresponding 
 sentence. Each alignment is in the follow format:

     s-t(linktype)

 where s and t are a list of comma delimited source and translation token IDs 
 respectively. s or t can be empty indicating a not translated token. Valid values 
 for linktype are:

 COR    translated correct
 TIN    translated incorrect
 MTA    meta token: treebank trace or transcription/translation markup

 Additionally, token number may optionally be followed by a tag enclosed in
 square brackets. Possible tags are:

 GLU    "glued" token
 TYP    typo
 TOK    tokenization error
 MET    transcription/translation/treebank-traces markups
 ALT    translation alternations
 MRK    transcription/translation/treebank-traces markup present along
        with the token

 Examples of valid alignments:

 2[TYP]-4,6(COR)     # Arabic token 2 (a typo) is aligned to English tokens
                       4 and 6. Correct.
 13[GLU],14-10(INC)  # Arabic tokens 13 (tagged as so-called glue) and 14
                       are aligned to English token 10. Incorrect.
 10-(COR)            # Arabic token 10 is not translated/has no English
                       correspondent. Correct.
 -19[TYP](COR)       # English token 19 (a typo) is not translated/has
                       no Arabic correspondent. Correct.
 5[MET]-(MTA)        # Arabic token 5 is a meta token

 Annotators had the option of not annotating a sentence.  In these cases,
 the word "rejected" appears instead of word alignments.  This typically
 happens when automatic sentence alignment failed -- either one of the
 sentences was empty, or they were not translations of one another.

5. Data Directory Structure

 - data/source/tokenized: tokenized Egyptian Arabic source data
 - data/translation/tokenized: tokenized English translation data
 - data/WA: word-level alignment

6. Documentation

 - docs/filelist.txt: the list of files showing package structures.  
 - docs/BOLT_Eyptian_Arabic_WA_Guidelines_v1.0.pdf: annotation guidelines
 - docs/AEPC_parallel_aligned_TB.pdf: reference paper for WA
 - docs/LREC2010_Arabic_parallel_aligned_TB.pdf: reference paper for WA
 - docs/LREC2012_LDC_parallel_aligned_BOLT.pdf: reference paper for WA
 - docs/LREC2012_wa_autoalign.pdf: reference paper for WA

7. Data Validation and Sanity Checks

 A set of independent data validation and sanity checks have been performed 
 by LDC's technical staff, particularly:
 --Bilingual annotators checked by hand several files to ensure that their
   word alignment annotations were faithfully recorded in the output format.
 --It was verified that all files across directories have the same filename
   stems and have the same number of files
 --It was verified that all files associated with a given document contain
   the same number of sentence segments.
 --It was verified that all tokens for a given sentence were annotated and
   that those annotations appear in the .wa file.
 --It was verified that all token numbers referenced in the .wa file have a
   corresponding token in the .tkn file.

8. Use of Word Alignment for Your MT Model

 In MT modelling, linguistic-rule-based alignment annotation may not always
 be the MT-preferred approach. To satisfy MT preferences, users can modify
 the annotation by detaching and re-attaching tagged morphemes to automatically
 derive MT-style annotations. For instance, a model may not favor the preposition
 "of" being attached and co-aligned to NP2 in the structure (NP1 (PNP2)), as
 annotated in the alignment data with the example "island (NP1) of Japan (NP2)".
  If the model favours "of" being attached to NP1 instead of NP2, then the
 annotation can be customized in two steps: 1) detach all preposition "of" from
 NP2 via word tag and 2) re-attach "of" to NP1. Such automatic customization of
 annotation is possible because all the unaligned and attached words are tagged
 with appropriate word tag.

9. Acknowledgements

 This material is based upon work supported by the Defense Advanced Research
 Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does
 not necessarily reflect the position or the policy of the Government, and no
 official endorsement should be inferred.

10. Copyright Info

(c) 201, 2013, 2014, 2015 Trustees of the University of Pennsylvania.

11. Contact Information

 If you have questions about this release, please contact the following
 personnel at the LDC.

 Xuansong Li <xuansong@ldc.upenn.edu>
 Stephen Grimes <sgrimes@ldc.upenn.edu>
 Stephanie Strassel <strassel@ldc.upenn.edu>

--------------------------------------------------------------------------
README Created April 12, 2015 by Xuansong Li