BOLT Egyptian-English Word Alignment Training -- Discussion Forum Linguistic Data Consortium Authors: Xuansong Li, Katherine Peterson, Stephen Grimes, Stephanie Strassel 1. Introduction The DARPA BOLT Program created new techniques for automated translation and linguistic analysis that can be applied to informal genres of text and speech common to online and in-person communications. LDC supports the BOLT Program by collecting informal data sources including discussion forums, text messaging and chat and conversational telephone speech in English, Chinese and Egyptian Arabic, and applying annotations including translation, word alignment, Treebanking, PropBanking, co-reference and queries/responses. Word alignment data is contained in this release. In machine translation, word alignment is a crucial intermediate stage indicating corresponding word relations in parallel text. Word alignment data for this release is built on Treebank annotation. Tokens resulted from treebank annotation (including empty categories/traces) are directly extracted for word-level alignment. The word alignment annotation in this release is linguistic-orientated and supported by linguistic theories, aiming to reach a variety of users from NLP fields as well as other research/education fields. For MT research, the data is intended for all MT performers with varying MT models. The annotation data format is designed to allow MT users to flexibly tailor or customize the annotation for different uses. Please see section 8 of this document as to how correctly use the data for your model. This release includes the corpus BOLT Egyptian-English Word Alignment Training -- Discussion Forum. This file contains documentation for this corpus. 2. Source Data Profile 2.1 Data Source and Selection Data used for word alignment are discussion forum posts harvested on-line at LDC in 2012 for the BOLT project. Threads were collected based on the results of manual data scouting by native speaker annotators. Scouts are instructed to seek content that is in the Egyptian language; original (written by the post's author rather than quoted), interactive, and informal. The harvested data was further selected and sentence-segmented at LDC for translation by professional translation agencies. A manual selection procedure was used to choose data appropriate for translation. Selection criteria included linguistic features and and topic features. After selection, selected posts were segmented into sentence units (SU). Then files were assigned to professional translators for careful translation. Translators followed LDC's BOLT Translation guidelines. After translations were completed, bilingual LDC staff performed quality control by selecting a proportional sample from each delivery and scrutinizing it for mistakes. Egyptian source tree tokens for word alignment were automatically extracted from tree files of BOLT Egyptian Arabic Treebank annotation done on the source discussion forum data harvested by LDC in 2012. The treebank annotation was performed in 2012 and 2013. The Egyptian treebank annotation went through two stages: POS-tagging and syntactic annotation. 2.2 WA Annotation Data Profile Language Genre Files Words Tree-tokens Segments ------------------------- ------------------------------------------------ Egyptian Arabic discussion forum 724 400,448 593723 31454 Note: Word count is based on untokenized Arabic source; token count is based on ATB-tokenized Arabic source. The tree-token count includes empty categories. 3. Annotation 3.1 WA Annotation Task Word alignment annotation consists of the following tasks: - Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect) - Identifying sentence segments not suitable for annotation. Annotators may reject segment for blank segments, incorrectly-segmented segments, segments with foreign languages, or when the source and translation are in the same language. - Tagging unmatched words which are attached to other words or phrases - Tagging incorrect tokenizations missed during tokenization correction using tag TOK - Aligning or tagging alternate translations using ALT tag 3.2 WA Annotation Guidelines LDC's Egyptian Arabic word alignment guidelines were adapted from previous GALE WA task specifications GALE_Arabic_alignment_guidelines_v6.0.pdf The guidelines used for this corpus are available in the docs directory of this release. 3.3 WA Annotation Process - Annotator training to familiarize Egyptian Arabic WA annotation team with WA guidelines - Annotation to produce first pass WA annotation on Egyptian Arabic files. - Second pass by senior annotators to review and correct first pass WA annotation. - Quality control by lead annotator for annotation consistency on all files. - Automatic and manual sanity checks to ensure file format consistency. 4. File Format Description Files that are distributed in this release include two types of files - tokenized, and WA (word alignment). The tokenized format was originated from treebank data. 4.1 Egyptian Arabic (source) .tkn The tokenized Egyptian Arabic source files contain one segment per line. The tokens are space-delimited and in utf-8 encoding. The tree tokens resulted from treebanking annotation, with empty categories (traces) added. 4.2 English (translation) .tkn As with Arabic tokenized files in this release, the English tokenized files contain space-delimited tokens with one segment per line. Tokenization was produced by using a tokenization script on the raw translation files. 4.3 WA .wa file The format of the alignment file is similar to GIZA++ word alignment format, but with some enhancements. Each line contains a list of space delimited alignments for the corresponding sentence. Each alignment is in the follow format: s-t(linktype) where s and t are a list of comma delimited source and translation token IDs respectively. s or t can be empty indicating a not translated token. Valid values for linktype are: COR translated correct TIN translated incorrect MTA meta token: treebank trace or transcription/translation markup Additionally, token number may optionally be followed by a tag enclosed in square brackets. Possible tags are: GLU "glued" token TYP typo TOK tokenization error MET transcription/translation/treebank-traces markups ALT translation alternations MRK transcription/translation/treebank-traces markup present along with the token Examples of valid alignments: 2[TYP]-4,6(COR) # Arabic token 2 (a typo) is aligned to English tokens 4 and 6. Correct. 13[GLU],14-10(INC) # Arabic tokens 13 (tagged as so-called glue) and 14 are aligned to English token 10. Incorrect. 10-(COR) # Arabic token 10 is not translated/has no English correspondent. Correct. -19[TYP](COR) # English token 19 (a typo) is not translated/has no Arabic correspondent. Correct. 5[MET]-(MTA) # Arabic token 5 is a meta token Annotators had the option of not annotating a sentence. In these cases, the word "rejected" appears instead of word alignments. This typically happens when automatic sentence alignment failed -- either one of the sentences was empty, or they were not translations of one another. 5. Data Directory Structure - data/source/tokenized: tokenized Egyptian Arabic source data - data/translation/tokenized: tokenized English translation data - data/WA: word-level alignment 6. Documentation - docs/filelist.txt: the list of files showing package structures. - docs/BOLT_Eyptian_Arabic_WA_Guidelines_v1.0.pdf: annotation guidelines - docs/AEPC_parallel_aligned_TB.pdf: reference paper for WA - docs/LREC2010_Arabic_parallel_aligned_TB.pdf: reference paper for WA - docs/LREC2012_LDC_parallel_aligned_BOLT.pdf: reference paper for WA - docs/LREC2012_wa_autoalign.pdf: reference paper for WA 7. Data Validation and Sanity Checks A set of independent data validation and sanity checks have been performed by LDC's technical staff, particularly: --Bilingual annotators checked by hand several files to ensure that their word alignment annotations were faithfully recorded in the output format. --It was verified that all files across directories have the same filename stems and have the same number of files --It was verified that all files associated with a given document contain the same number of sentence segments. --It was verified that all tokens for a given sentence were annotated and that those annotations appear in the .wa file. --It was verified that all token numbers referenced in the .wa file have a corresponding token in the .tkn file. 8. Use of Word Alignment for Your MT Model In MT modelling, linguistic-rule-based alignment annotation may not always be the MT-preferred approach. To satisfy MT preferences, users can modify the annotation by detaching and re-attaching tagged morphemes to automatically derive MT-style annotations. For instance, a model may not favor the preposition "of" being attached and co-aligned to NP2 in the structure (NP1 (PNP2)), as annotated in the alignment data with the example "island (NP1) of Japan (NP2)". If the model favours "of" being attached to NP1 instead of NP2, then the annotation can be customized in two steps: 1) detach all preposition "of" from NP2 via word tag and 2) re-attach "of" to NP1. Such automatic customization of annotation is possible because all the unaligned and attached words are tagged with appropriate word tag. 9. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 10. Copyright Info (c) 201, 2013, 2014, 2015 Trustees of the University of Pennsylvania. 11. Contact Information If you have questions about this release, please contact the following personnel at the LDC. Xuansong Li Stephen Grimes Stephanie Strassel -------------------------------------------------------------------------- README Created April 12, 2015 by Xuansong Li