English Translation Treebank: An-Nahar Newswire CatalogID: LDC2012T02 Release date: March 15, 2012 Linguistic Data Consortium Authors: Ann Bies, Justin Mott, Colin Warner, Seth Kulick 1. Introduction This publication of English Translation Treebank consists of 461,489 tokens in 599 files of individual An-Nahar newswire stories translated from Arabic to English and annotated for part-of-speech and syntactic structure. This data is consistent with the most current updated English Treebank Annotation Guidelines (the guidelines used for GALE English Treebanks, both LDC English Translation Treebanks and OntoNotes). This corpus is part of an on-going effort to produce parallel Arabic and English Treebanks at LDC. The files in this publication are parallel with the same file IDs in the Arabic Treebank part 3 - v3.2, LDC Catalog No.: LDC2010T08. 2. Annotation 2.1 Tasks and Guidelines The guidelines followed for both part-of-speech and treebank annotation are essentially Penn Treebank II style, with changes in the tokenization of hyphenated words, part-of-speech and tree changes necessitated by those tokenization changes, and updates to the syntactic annotation to comply with the most updated annotation guidelines (including the "Treebank-PropBank merge" or "Treebank IIa" and "treebank c" changes). The original Penn Treebank II guidelines, along with addenda detailing the more recent changes, and the tokenization specifications can be found at http://projects.ldc.upenn.edu/gale/task_specifications/EnglishXBank/. All co-reference indices are shown on the syntactic node label, including reference indices on the node labels for the empty categories (as in all LDC English Translation Treebank and Arabic Treebank publications). 2.2 Annotation Process The English source files (translated from the Arabic) were first automatically part-of-speech tagged and parsed, and then the tags and parses were hand corrected. The QC process consists of a series of specific searches for over 100 types of potential inconsistency and parser or annotation error. Any errors found in these searches were hand corrected. Lead annotators for this process were Justin Mott and Colin Warner. Additional annotators were Sudha Arunachalam, Amalle Dublon, Grace Mrowicki, Casey Schroeder, and Sandhya Sundaresan. 3. Source Data Profile 3.1 Data Selection Process This data was chosen in order to create a parallel treebank with the Annahar section of the Arabic Treebank: Arabic Treebank part 3 - v3.2, LDC Catalog No.: LDC2010T08. 3.2 Data Sources and Epochs The data consists of the English translation (provided by LDC) of Arabic newswire stories from An-Nahar. The news stories in this selection are dated the 15th of each month in 2002. 4. Annotated Data Profile This data consists of 599 files, each a newswire story translated from Arabic into English, and a total of 461,489 tokens, all of which have been annotated for part-of-speech and syntactic structure. 5. Data Directory Structure A listing of all of the files in this publication can be found in docs/file.tbl. A listing of the data files can be found in docs/file.ids. The data directory structure is as follows: ./docs ./data/source -- the source input data files (English translation) ./data/ag_xml -- the annotation files in AG format, including all POS and treebank annotation as well as any comments from the annotators ./data/penntree -- the annotation files in Penn Treebank bracketed list style ./data 6. File Format Description a) Penn Style Trees b) AG xml 7. Data Validation automatic tokenization => human correction => automatic pre-tag and pre-parse => human annotation => QC correction => automatic scripts for "treebank c" revisions 8. DTDs The DTD files for the AG are kept in the same directory where the AG xml files are, as well as in the docs/ directory. 9. Copyright Information Portions (c) 2002 An Nahar News Agency, (c) 2004, 2005, 2006, 2007, 2010, 2011, 2012 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Ann Bies, Senior Research Coordinator, Linguistic Data Consortium, bies@ldc.upenn.edu 11. Update Log This index was updated on March 1, 2012 by Ann Bies.