GALE Arabic-English Parallel Aligned Treebank -- Newswire Training Authors: Xuansong Li, Stephen Grimes, Safa Ismaeal,Dalal Zakhary, Stephanie Strassel, Mohamed Maamouri, Ann Bies Linguistic Data Consortium 1. Introduction This file contains documentation for the GALE Arabic-English Parallel Aligned Treebank Newswire Training release. Data were sourced from Arabic newswire sources and translated to English. Arabic and English Treebank annotations were performed independently, and finally the parallel texts were word aligned to create this release. These data match Arabic treebanked data appearing in ATB3. 2. Source Data Profile 2.1 Data Selection Newswire and broadcast news files in this release were selected for annotation from among the superset of data that had been previously treebank annotated for both Arabic and English (translated from Arabic). During data selection, files with mismatched source and translation segments were excluded. Files with bad format and atypical newswire were also excluded. 2.2 Data Source The newswire data is from 2002 An Nahar. 2.3 Annotation Data Profile Language Genre Files Words Tokens Segments ------------------------------------------------ Arabic NW 364 182351 267520 7711 Note: Word count is based on untokenized Arabic source; token count is based on ATB-tokenized Arabic source. 3. Annotation 3.1 WA Annotation Task Word alignment annotation consists of the following tasks: - Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect) - Identifying sentence segments not suitable for annotation. Annotators may reject segment for blank segments, incorrectly-segmented segments, segments with foreign languages, or when the source and translation are in the same language. - Tagging unmatched words which are attached to other words or phrases 3.2 WA Annotation Guidelines LDC's word alignment guidelines are adapted from previous task specifications including those used in the BLINKER project. For unaligned words or phrases having no locally-related constituent to attach to, they are "aligned" as not-translated. For words or phrases having locally-related constituents to attach to, they are tagged as "GLU", which shows local word relations among dependency constituents. 3.3 WA Annotation Process This corpus was annotated using the following process: - Annotator familiarization with word alignment guidelines. - First pass word alignment annotation. - Second pass annotation by senior annotators to review and correct first passannotation. - Quality control by a lead annotator to check for annotation consistency on all files. - Automatic and manual sanity checks to ensure file format consistency. 4. File Format Description 4.1 Overview This release contains four types of files - raw, tokenized, treebank, and "wa". The "raw" format contains the original Arabic/English sentences without any annotation. The "tokenized" format includes the tokenized version of the raw data. These tokens are determined through treebank annotation and may contain Empty Category tokens, which are discussed below. The "treebank" and "wa" files are treebank and word alignment annotations on the "tokenized" files. Seven files are associated with each document, namely Arabic/English raw, Arabic/English tokenized, Arabic/English treebank, and a WA file. All seven files associated with a given document name have the same number of lines; that is, annotations of a specific sentence segment share the same line number across all seven files. - raw: Arabic/English sentences - tokenized: Arabic/English tokenized sentences; - the tokens were taken directly from ATB and EATB without modification - each token has an ID which ATB, EATB and WA files can refer to - ATB, EATB and WA files do NOT contain actual tokens, but only the token IDs referencing tokens in the tokenized version - treebank: Arabic/English treebank files - treebank files in this release appear in bracked penntreebank format - each token has a POS tag and all higher-level nodes are labeled - because for Arabic each token has mutiple forms, the trees contain token numbers instead of strings; the vocalized, unvocalized, and input string versions can be obtained by finding the token ID in the tokenized file - wa: word alignment file - word alignment format is descirbed in detail below 4.2 Details 4.2.1 Arabic (source) .raw Generally one sentence per line without markup. Text is encoded in utf-8. 4.2.2 English (translation) .raw Generally one sentence per line, but there may be more sentences corresponding to translator's preference to introduce sentence boundaries. No markup. Text is encoded in utf-8. 4.2.3 Arabic (source) .tkn Each line contains a space delimited list of tokens corresponding to a line in the .raw file. Each token entry contains 8 fields separated by the semicolon character, ";". Because semicolon was used as field delimiter, any semicolons in the text appear as "-SC-" in this file. The 8 fields are as follows: - TokenID: integer sequentially numbered from 1 - Start: start character offset into .raw file - End: start character offset into .raw file - VOC_STRING: the Arabic utf-8 of the vocalized form; this is the form of the token annotators saw during WA annotation - VOCALIZED: the vocalized Buckwalter form of the word, taken from the solution - IS_TRANS: Buckwalter transliteration of INPUT_STRING - UNVOCALIZED: the unvocalized Buckwalter form of the word - INPUT_STRING: utf-8 characters from original .raw file Empty Category tokens: Certain treebank tree leaves have the POS label -NONE-; these are Empty Category tokens. These correspond to positions in the syntax tree but have no string equivalent in the input. We give equal start and end character offsets (see below) for these tokens. Their entries for VOC_STRING, VOCALIZED, IS_TRANS, UNVOCALIZED, and INPUT STRING are all identical. Known Empty Category tokens for Arabic are: * # Pro-drop subjects and passive traces *0* # Null complementizer or zero WH- pronoun *ICH* # Rightward movement (for the most part, also *RNR*, etc.) *RNR* # Right node raising *T* # WH-traces or any topicalization Empty category tokens are always explicity marked as not translated (correct). Character Offsets: The start and end character offsets are structured as follows. The offsets refer to between-character positions. Hence for any two consecutive tokens, it is the case that the End offset of the preceding token is equal to the Start offset of the following token. The first start offset is numbered 0. For "null" tokens such as syntactic traces, we have adopted the policy of also giving these tokens offset in which Start=End; this zero-width pointer references a between- token position in the source text. 4.2.4 English (translation) .tkn The English .tkn is similar in structure to the Arabic .tkn file but there are four semicolon-delimited fields instead of eight. Any tokens originally containing semicolons have -SC- appearing instead of the semicolon. The four fields are as follows: - TokenID: integer ID of token, sequentially numbered from 1. - Start: start character offset into .raw file - End: end character offset into .raw file - Token: the representation of the token from the original EATB annotated tree See 4.2.3 for information on Empty Category tokens. Known Empty Category tokens for English are: * *0* *?* *EXP* *ICH* *NOT* *RNR* *T* *U* 4.2.5 Arabic (ATB) .tree Trees are represented in Penn Treebank format (labeled brackets). Trees are taken from treebank releases but strings in tree leaves were replaced by token IDs corresponding to the numbers in the tokenized (.tkn) file. Most lines have one tree, but some may contain more than one tree separated by whitespace. 4.2.6 English (EATB) .tree Annotations were performed on an updated, unreleased version of the English Arabic Translation Treebank (EATB). The updates were requested by GALE sites an largely pertain to handling of hyphenation issues. The final updated version of the EATB corpus will be released in the coming months upon completion of the site-requested revisions. Trees are represented in Penn Treebank format (using labeled brackets). The trees are similar to the EATB release, except here tree leaves were replaced by token IDs corresponding to the token numbers in the tokenized (.tkn) file. Most lines have one tree, but some may contain more than one tree separated by whitespace. Multiple tree arise when one Arabic sentence/tree was translated into or corresponds to mutliple English sentences/tree. 4.2.7 WA .wa file Each line contains a list of space delimited alignments for the corresponding sentence. Each alignment is in the following format: s-t(linktype) where s and t are a list of comma delimited source (Arabic) and translation (English) token IDs respectively. s or t can be empty indicating a not-translated token. Linktype is either COR or INC, indicating if a translation is correct or incorrect. Additionally, tokens may have a tag enclosed in square brackets. Possible tags are: GLU ("glued" token) TYP (typo) Examples of valid alignments: 2[TYP]-4,6(COR) # Arabic token 2 (a typo) is aligned to English tokens 4 and 6. Correct. 13[GLU],14-10(INC) # Arabic tokens 13 (tagged as so-called "glue") and 14 are aligned to English token 10. Incorrect. 10-(COR) # Arabic token 10 is not translated/has no English correspondent. Correct. -19[TYP](COR) # English token 19 (a typo) is not translated/has no Arabic correspondent. Correct. Annotators had the option of not annotating a sentence. In these cases, the word "rejected" appear on the given line. This typically happens when the sentence material was empty, quite short, contained mostly punctuation, or the sentences were translations of one another. 5. Data Directory Structure - data/parallel_word_aligned_treebank/nw/source/{raw,tokenized}: raw(un-tokenized) and tokenized Arabic source data - data/parallel_word_aligned_treebank/nw/translation/{raw,tokenized}: raw(un-tokenized) and tokenized English translation data - data/parallel_word_aligned_treebank/nw/tree/{ATB,EATB}: Arabic and English treebank-annotated data - data/parellel_word_aligned_treebank/nw/WA: word-level alignments 6. Documentation - docs/file.sha1: sha1 checksums for data files - docs/GALE_Arabic_alignment_guidelines_v6.0.pdf: annotation guidelines - docs/README.txt: this file; general documentation for this release. 7. Data Validation The following data consistency checks were performed: -Bilingual annotators checked by hand several files to ensure that their word alignment annotations were faithfully recorded in the output format. -It was verified that all files associated with a given document contain the same number of sentence segments. -It was verified that all tokens for a given sentence were annotated and that those annotations appear in the .wa file. -It was verified that all token numbers referenced in a given .wa file have correspondents in the .tkn files. -It was verified that all token numbers references in the .tree files has correspondents in the .tkn files. 8. Copyright Information Portions (c) 2002 An Nahar, (c) 2002, 2011 Trustees of the University of Pennsylvania 9. Acknowledgements This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Special thanks to: Seth Kulick, Wajdi Zaghouani, Justin Mott, Mike Ciul Special thanks to: Khalda Ahmed; Nahed Gayed; Nancy Gayed and Manal Gobran. 10. Contact Information If you have questions about this data release, please contact the following personnel at the LDC. Project manager: Xuansong Li Technical lead: Stephen Grimes Lead annotator: Safa Ismael Lead annotator: Dalal Zakhary Project consultant: Stephanie Strassel ------------------------------------------------------------------- README created Feb 9 2011 by Stephen Grimes README updated Mar 28 2011 by Xuansong Li README updated Jun 8 2011 by Stephen Grimes README updated Jun 27 2011 by Stephen Grimes