GALE Arabic-English Parallel Aligned Treebank -- Broadcast News Training Part II Authors: Xuansong Li, Stephen Grimes, Safa Ismaeal, Stephanie Strassel Mohamed Maamouri, Ann Bies Linguistic Data Consortium 1. Introduction This file contains documentation for the GALE Arabic-English Parallel Aligned Treebank Newswire Training Part 2 release. Data were sourced from Arabic broadcast news and converstation sources newswire sources and translated to English. Arabic and English Treebank annotations were performed independently, and finally the parallel texts were word aligned to create this release. These data match Arabic treebanked data appearing in parts of ATB7, ATB8, and ATB10 and the EATB releases. 2. Source Data Profile 2.1 Data Selection During data selection, files with mismatched source and translation segments were excluded. Files with bad format and atypical newswire or broadcast news style were avoided. 2.2 Data Source 2007-2008 Abu Dhabi TV, 2008 Al Baghdadya TV, 2008 Al Fayha, 2008 Al Iraqiyah, 2007 Aljazeera, 2007 Al Ordiniyah, 2008 Al Sharqiya, 2008 Dubai TV, 2008 Oman TV, 2008 Saudi TV 2.3 Annotation Data Profile Language Genre Files Words Tokens Segments ------------------------------------------------ Arabic BN 31 110690 141058 7102 Note: Word count is based on untokenized Arabic source; token count is based on ATB-tokenized Arabic source. 3. Annotation 3.1 WA Annotation Task Word alignment annotation consists of the following tasks: - Identifying different types of links: translated (correct or incorrect) and not translated (correct or incorrect) - Identifying sentence segments not suitable for annotation. Annotators may reject segment for blank segments, incorrectly-segmented segments, segments with foreign languages, or when the source and translation are in the same language. - Tagging unmatched words which are attached to other words or phrases 3.2 WA Annotation Guidelines LDC's word alignment guidelines are adapted from previous task specifications including those used in the BLINKER project. The updated guidelines used for this corpus are available in the docs directory of this release. The guidelines can also be accessed from: http://projects.ldc.upenn.edu/gale/task_specifications/GALE_Arabic_alignment_guidelines_v6.0.pdf Arabic guidelines changes in this release: - Vocative particle will be left as not translated and correct in case there is no counterpart. - The word "and" is linked to the comma. - In response to site's request, all the unaligned/unmatched words are tagged. For unaligned words or phrases having no locally-related constituent to attach to, they are tagged as not-translated correct or incorrect. For unaligned words or phrases having locally-related constituents to attach to, they are tagged as "GLU", which shows local word relations among dependency constituents. This is represented by an asterix (*) sign in the guidelines: -English subject pronouns omitted in Arabic are unmatched and tagged as "GLU". -Unmatched verb "to be" is tagged as "GLU" for Arabic nominal sentence. -Unmatched pronouns and relative nouns when linked to their referents are tagged as "GLU". -Unmatched possessives ('s and ') when glued to their owner are tagged as "GLU". -For cases of one preposition in one side while no counterpart in the other, the extra preposition glued to its object would be tagged as "GLU". -Two or more prepositions in one language while one preposition in the other side; the unmatched preposition would be tagged as "GLU"; the same is applicable for pronouns except for relative pronouns. 3.3 WA Annotation Process This corpus was annotated using the following process: - Annotator training to familiarize Arabic WA annotation team with guidelines - Annotation to produce first pass annotation on Arabic files. - Second pass by senior annotators to review and correct first pass annotation. - Quality control by lead annotator for annotation consistency on all files. - Automatic and manual sanity checks to ensure file format consistency. 4. File Format Description 4.1 Overview Files that are distributed in this release include four types of files - raw, tokenized, treebank, and WA (word alignment). The aligned parallel treebank portion of the release contains seven files for each document. The parallel word aligned portion of this release contains five files for each document (no treebank files), and furthermore the format of the tokenized files differs from that found in the aligned parallel treebank portion (see below). 4.2 Details 4.2.1 Arabic (source) .raw Generally one sentence per line without markup. Text is encoded in utf-8. 4.2.2 English (translation) .raw One or more sentences per line without markup. Raw English files for the parallel aligned treebank portion were reduced by the EATB team from utf8 to ASCII, and we include the reduced ascii files here to enforce accuracy of the begin and end offset characters provided by EATB and found in the .tkn files. 4.2.3 Arabic (source) .tkn For parallel word alignment (non-treebanked) files, the tokenized Arabic source files contain one segment per line. The tokens are space-delimited and in utf8 encoding. The tokenized files for the parallel aligned treebank portion of the release contain more structure: each space-delimited token entry contains six semi-colon delimited fields. Because semicolon was used as field delimiter, any semicolons in the text appear as "-SC-" in this file. The 6 fields are as follows: - TokenID: integer sequentially numbered from 1 - Start: start character offset into .raw file - End: start character offset into .raw file - Vocalized token (Buckwalter) - Input string (utf8) - Unvocalized (Buckwalter) Treebank trace tokens: Treebank tree leaves having the POS label -NONE- correspond to trace positions in the syntactic tree and contain the "*" character. We give equal start and end positions for these tokens equal to the end offset of the previous token (even though, of course, these tokens have no equivalent in the .raw file). For these token, their Vocalized, Input String, and Unvocalized forms are all identical. Known empty tokens for Arabic are: * # Pro-drop subjects and passive traces *0* # Null complementizer or zero WH- pronoun *ICH* # Rightward movement (for the most part, also *RNR*, etc.) *RNR* # Right node raising *T* # WH-traces or any topicalization For Arabic, we continue the practice of marking empty tokens as not translated (correct). 4.2.4 English (translation) .tkn As with Arabic tokenized files in this release, the English tokenized files have different structure depending on which portion of the release they belong to. For files in the parallel word alignment (non-treebank) portion of the release, the tokens are simply space-delimited. Tokenization was produced by using MADA followed by manual annotator correction prior to word alignment. For files in the parallel aligned treebank portion of the release, each line contains a space delimited list of tokens, and each token contains four semi-colon delimited fields. Any tokens originally containing semicolons have -SC- appearing instead of a semicolon. The four fields are as follows: - TokenID: integer ID of token, sequentially numbered from 1. - Start: start character offset into .raw file - End: end character offset into .raw file - Token: the token string from the EATB annotated tree Treebank trace tokens: Treebank trace tokens have the POS label -NONE-. These tokens have no corresponding material in the input file. We give equal start and end positions for these tokens. Known empty tokens are: * *0* *?* *EXP* *ICH* *NOT* *RNR* *T* *U* 4.2.5 Arabic (ATB) .tree Trees are represented in Penn Treebank format (labeled brackets). The trees leaves contain token IDs corresponding to the numbers in the tokenized (.tkn) file. Most lines have one tree, but it is possible some have more. 4.2.6 English (EATB) .tree Trees are represented in Penn Treebank format (labeled brackets). The trees leaves contain token IDs corresponding to the numbers in the tokenized (.tkn) file. Most lines have one tree, but it is possible some have more. Multiple trees may be created based on the translator decision to break an Arabic sentence into multiple English sentences. 4.2.7 WA .wa file The format of the alignment file is similar to GIZA++ word alignment format, but with some enhancements. Each line contains a list of space delimited alignments for the corresponding sentence. Each alignment is in the follow format: s-t(linktype) where s and t are a list of comma delimited source and translation token IDs respectively. s or t can be empty indicating a not translated token. Valid values for linktype are: COR translated correct TIN translated incorrect MTA meta token: treebank trace or transcription/translation markup Additionally, token number may optionally be followed by a tag enclosed in square brackets. Possible tags are: GLU "glued" token TYP typo TOK tokenization error MET meta data: transcription/translation markups or treebank Empty Token MRK similar to MET, but markup is attached to content token Examples of valid alignments: 2[TYP]-4,6(COR) # Arabic token 2 (a typo) is aligned to English tokens 4 and 6. Correct. 13[GLU],14-10(INC) # Arabic tokens 13 (tagged as so-called glue) and 14 are aligned to English token 10. Incorrect. 10-(COR) # Arabic token 10 is not translated/has no English correspondent. Correct. -19[TYP](COR) # English token 19 (a typo) is not translated/has no Arabic correspondent. Correct. 5[MET]-(MTA) # Arabic token 5 is a meta token Annotators had the option of not annotating a sentence. In these cases, the word "rejected" appears instead of word alignments. This typically happens when automatic sentence alignment failed -- either one of the sentences was empty, or they were not translations of one another. 5. Data Directory Structure - data/parallel_word_aligned_treebank/bn/source/{raw,tokenized}: raw(un-tokenized) and tokenized Arabic source data - data/parallel_word_aligned_treebank/bn/translation/{raw,tokenized}: raw(un-tokenized) and tokenized English translation data - data/parallel_word_aligned_treebank/bn/tree/{ATB,EATB}: Arabic and English Treebank annotated data - data/parallel_word_aligned_treebank/bn/WA: word-level aligned data 6. Documentation - docs/GALE_Arabic_alignment_guidelines_v6.0.pdf: annotator guidelines for word alignment - docs/files.sha1: the sha1 checksums for data files in package - docs/README.txt: this file 7. Data Validation The following data consistency checks were performed: - Bilingual annotators checked by hand several files to ensure that their word alignment annotations were faithfully recorded in the output format. - It was verified that all files associated with a given document contain the same number of sentence segments. - It was verified that all tokens for a given sentence were annotated and that those annotations appear in the .wa file. - It was verified that all token numbers referenced in the .wa file have a corresponding token in the .tkn file. - For treebank data, it was verified the syntax trees are well-formed and that each token has a part-of-speech tag. 8. Copyright Information Portions (c) 2007-2008 Abu Dhabi TV, (c) 2008 Al Baghdadya TV, (c) 2008 Al Fayha, (c) 2008 Al Iraqiyah, (c) 2007 Aljazeera, (c) 2007 Al Ordiniyah, (c) 2008 Al Sharqiya, (c) 2008 Dubai TV, (c) 2008 Oman TV, (c) 2008 Saudi TV, (c) 2011 Trustees of the University of Pennsylvania 9. Acknowledgements This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Special thanks to: Seth Kulick, Wajdi Zaghouani, Justin Mott, Mike Ciul Special thanks to: Khalda Ahmed; Nahed Gayed; Nancy Gayed, Manal Gobran. 10. Contact Information If you have questions about this data release, please contact the following personnel at the LDC. Project manager: Xuansong Li Technical lead: Stephen Grimes Lead annotator: Safa Ismael Project consultant: Stephanie Strassel -------------------------------------------------------------------------- README Created Feb 8, 2011 Stephen Grimes README Updated Mar 28, 2011 Xuansong Li README Updated Jun 8, 2011 Stephen Grimes README Updated June 29, 2011 Xuansong Li