GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web Authors: Xuansong Li, Stephen Grimes, Safa Ismael and Stephanie Strassel Linguistic Data Consortium 1. Introduction This file contains documentation for the corpus GALE Arabic-English Word Alignment Training Part 1 -- Newswire and Web. The corpus includes word aligned newsire and web data. 2. Source Data Profile 2.1 Data Source The file names indicate the source where the data were first harvested. The data were chiefly from news agency in case of newswire data, and from internet in case of web data. Given a file name with "AFP_ARB_20061104", "AFP" means the short form for news agency "Agence France Presse ", "ARB" stands for "Arabic language", and "20061104" indicates the date: 11/04/2006. 2.2 Annotation Data Profile Language Genre Files Words Tokens Segments ----------------------------------------------- Arabic WB 119 59696 81620 4383 Arabic NW 717 198621 263060 8423 ----------------------------------------------- Total 836 258317 344680 12806 Note: Word count is based on untokenized Arabic source; token count is based on tokenized Arabic source. 3. Annotation 3.1 Tokenization Correction Task Arabic source tokens need to be corrected when they are incorrectly tokenized by the MADA tokenization system (developed by Columbia University). The correction annotation tasks include: - Identifying incorrectly tokenized tokens - Correcting tokens according to tokenization correction guidelines 3.2 WA Annotation Task Word alignment annotation consists of the following tasks: - Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect) - Identifying sentence segments not suitable for annotation. Annotators may reject segment for blank segments, incorrectly-segmented segments, segments with foreign languages, or when the source and translation are in the same language. - Tagging unmatched words which are attached to other words or phrases 3.3 WA Annotation Guidelines 3.3.1 Tokenization Correction Guidelines For WA annotation on sentence-based data, an extra tokenization correction process is needed to correct the incorrectly tokenized tokens produced by MADA tokenization system. The Arabic source files were tokenized by MADA into ATB-style unvocalized tokens. Based on the correction guidelines provided by Columbia University, LDC compiled tokenization correction guidelines for tokenization correction annotation. The guidelines are available in the docs directory of this release. ./docs/ArabicTokenizationGuidelinesV1.1.pdf 3.3.2 Word alignment Guidelines LDC's word alignment guidelines are adapted from previous task specifications including those used in the BLINKER project. No changes have been made to the alignment guidelines since last delivery. The guidelines used for this corpus are available in the docs directory of this release. The guidelines can also be accessed from: http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Arabic_alignment_guidelines_v6.0.pdf 3.4 WA Annotation Process - Annotator training to familiarize Arabic WA annotation team with WA guidelines - Annotation to produce first pass WA annotation on Arabic files. - Second pass by senior annotators to review and correct first pass WA annotation. - Quality control by lead annotator for WA annotation consistency on all files. - Automatic and manual sanity checks to ensure file format consistency. 4. File Format Description 4.1 Overview Files that are distributed in this release include three types of files - raw, tokenized, and WA (word alignment). The raw format contains the original Arabic/English sentences without any annotation. 4.2 Details 4.2.1 Arabic (source) .raw Generally one sentence per line without markup. Text is encoded in utf-8. 4.2.2 English (translation) .raw One or more sentences per line without markup. 4.2.3 Arabic (source) .tkn For parallel word alignment files, the tokenized Arabic source files contain one segment per line. Tokenization was produced by using MADA followed by manual annotator correction prior to word alignment. The tokens are space-delimited and in utf8 encoding. There is no explicit token numbering in this case, but the .wa files reference implicit token numbers starting from 1, 2, ... etc. 4.2.4 English (translation) .tkn As with Arabic tokenized files in this release, the English tokenized files have different structure depending on which portion of the release they belong to. For files in the parallel word alignment portion of the release, the tokens are simply space-delimited. 4.2.5 WA .wa file Each line contains a list of space delimited alignments for the corresponding sentence. Each alignment is in the follow format: s-t(linktype) where s and t are a list of comma delimited source and translation token IDs respectively. s or t can be empty indicating a not translated token. Valid values for linktype are: COR translated correct TIN translated incorrect MTA meta token: transcription/translation markup Additionally, token number may optionally be followed by a tag enclosed in square brackets. Possible tags are: GLU "glued" token TYP typo TOK tokenization error MET meta data: transcription/translation markups MRK similar to MET, but markup is attached to content token Examples of valid alignments: 2[TYP]-4,6(COR) # Arabic token 2 (a typo) is aligned to English tokens 4 and 6. Correct. 13[GLU],14-10(INC) # Arabic tokens 13 (tagged as so-called glue) and 14 are aligned to English token 10. Incorrect. 10-(COR) # Arabic token 10 is not translated/has no English correspondent. Correct. -19[TYP](COR) # English token 19 (a typo) is not translated/has no Arabic correspondent. Correct. 5[MET]-(MTA) # Arabic token 5 is a meta token Annotators had the option of not annotating a sentence. In these cases, the word "rejected" appears instead of word alignments. This typically happens when automatic sentence alignment failed -- either one of the sentences was empty, or they were not translations of one another. 5. Data Directory Structure - data/parallel_word_aligned/{nw,wb}/source/{raw,tokenized}: raw (untokenized) and tokenized Arabic source data - data/parallel_word_aligned/{nw,wb}/translation/{raw,tokenized}: raw (untokenized) and tokenized English translation data - data/parallel_word_aligned/{nw,wb}/WA: character-level aligned and tagged data 6. Documentation - docs/files.sha1: sha1 checksums for data files - docs/README.txt: a general documentation file about this release. - docs/ArabicTokenizationGuidelinesV1.1.pdf: tokenization correction guidelines - docs/GALE_Arabic_alignment_guidelines_v6.0.pdf: Arabic-English alignment guidelines 7. Data Validation 7.1 Data Consistency The following data consistency checks were performed: -Bilingual annotators checked by hand several files to ensure that their word alignment annotations were faithfully recorded in the output format. -It was verified that all files associated with a given document contain the same number of sentence segments. -It was verified that all tokens for a given sentence were annotated and that those annotations appear in the .wa file. -It was verified that all token numbers referenced in the .wa file have a corresponding token in the .tkn file. -For treebank data, it was verified the syntax trees are well-formed and that each token has a part-of-speech tag. 7.2 Sanity Checks A set of independent sanity checks have been performed by a technical staff member of LDC. 8. Acknowledgements This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Special thanks to annotators: Khalda Ahmed; Nahed Gayed; Nancy Gayed and Manal Gobran. 9. Contact Information If you have questions about this data release, please contact the following personnel at the LDC. Project manager: Xuansong Li Technical lead: Stephen Grimes Lead annotator: Safa Ismael Project consultant: Stephanie Strassel -------------------------------------------------------------------------- README Created April 8, 2011 Xuansong Li README Updated April 15, 2011 Stephen Grimes README Updated June 29, 2011 Xuansong Li