Arabic Treebank Part 1 - V4.1 CatalogID: LDC2010T13 Release date: November 15, 2010 Linguistic Data Consortium Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Fatma Gaddeche, Wigdan Mekki, Sondos Krouna, Basma Bouziri, Wajdi Zaghouani 1. Introduction This version of the Arabic Treebank Part 1 - V4.1 is an incremental update to the December 2008 release of Arabic Treebank Part 1 - V4.0 to the GALE community (LDC2008E61), and a significant revision over the previous general catalog release of ATB1-v3.0 (LDC2005T02). This version of Arabic Treebank Part 1 - V4.1 represents a revision of the ATB1 annotation for the full ATB part 1 (AFP) corpus. The full ATB1 corpus has been revised according to the new Arabic Treebank annotation guidelines, both manually (all of the syntactic tree annotation) and automatically (the MPG annotation). The revised and updated Arabic Treebank ATB part 1 consists of 734 newswire stories from the Agence France Presse (AFP) (previously released as Arabic Treebank: Part 1 v 3.0 (POS with full vocalization + Syntactic Analysis), LDC Catalog No. : LDC2005T02). This release includes significant new improvements over both the ATB1-v3.0 and ATB1-v4.0 releases in both the organization of the data and certain aspects of the annotation. These improvements are detailed in docs/readme-files.txt In this full ATB1 corpus, there are a total of 145,386 source tokens before clitics are split and 167,280 tree tokens after clitics are separated for the treebank annotation. This current release contains the part-of-speech/morphology/gloss annotation and the syntactic treebank annotation of these files. The treebank annotation has been revised in accordance with the new Arabic Treebank Annotation Guidelines. In addition to a partial manual revision, certain automatic changes have been made to the part-of-speech/morphology/gloss tags. This corpus is part of an on-going effort to produce parallel Arabic and English Treebanks at LDC. 224 of the files in this release are parallel with the 224 files in the English Translation Treebank -- EATB Part 5 v2.0 (LDC2010E20). This release conforms to the format conventions initiated with the releases of Arabic Treebank part 5 - v1.0, LDC2009E72 (ATB5) and Arabic Treebank Part 6 V1.0 - GALE Phase 4 dev09, LDC2009E108 (ATB6), which are detailed in docs/readme-files.txt and in the docs/KulickBiesMaamouri-LREC2010.pdf paper: Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank. Seth Kulick, Ann Bies and Mohamed Maamouri. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Malta May 19-21, 2010. Available: docs/KulickBiesMaamouri-LREC2010.pdf In addition, two papers written about the revision and enhancement process that resulted in the revised annotation guidelines are available on the LDC website: Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth Kulick. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf Poster: http://papers.ldc.upenn.edu/LREC2008/Enhancement-poster.ppt Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: http://papers.ldc.upenn.edu/LREC2008/Diacritic_Annotation_ATB.pdf Poster: http://papers.ldc.upenn.edu/LREC2008/Diacritization-poster.ppt 2. Annotation 2.1 Tasks and Guidelines The Arabic Treebank project consists of two distinct phases: (a) Part-of-Speech (=POS) tagging which divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking (=ArabicTB) which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. Tim Buckwalter's transliteration system, which we use for this corpus, is described at http://www.qamus.org/transliteration.htm. The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. 2.2 Annotation Process In the original annotation, Tim Buckwalter's morphological analyzer (BAMA) was used to generate a candidate list of POS values for each word/token and our annotators picked the appropriate one manually. (Please note that some words do not exist in this lexicon.) The POS annotation task is just to select the correct POS tag. Once POS is done, we automatically separated the clitics based on the POS selection. We use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for numerical data, PUNC for punctuation, and FOREIGN or LATIN for non-Arabic alphabetic data. For the current version, we implemented automatic checks on the part-of-speech tags with consequent further manual revision when necessary to ensure the consistency of the part-of-speech tags with the current guidelines and with the currently in use LDC Standard Arabic Morphological Analyzer LDC2009E73 (SAMA 3.1). This is discussed in more detail below, in 7. Data Validation. The SAMA 3.1 Morphological Analyzer will be released as a general publication in the coming months, with the catalog ID LDC2010L01. The Arabic morphological tagset was then reduced to a smaller POS set, and the files were automatically parsed. The parses were then hand corrected by human annotators. For this release, the syntactic treebank annotation from ATB1-v3.0 was manually revised according to the new Arabic Treebank Annotation Guidelines. Significant changes were made to NP structure and to the classification of verbs with clausal arguments, along with improvements to the annotation in general. Annotators for this process were Fatma Gaddeche, Basma Bouziri and Badia Laadioui. The QC process consists of a series of specific searches for several types of potential inconsistency and annotation error. Any errors found in these searches were hand corrected. 3. Source Data Profile 3.1 Data Selection Process This corpus of Arabic Treebank Part 1 - v4.1 consists of 734 newswire stories from the Agence France Presse (AFP). There are a total of 145,386 source tokens before clitics are split and 167,280 tree tokens after clitics are separated for the treebank annotation. This corpus is part of an on-going effort to produce parallel Arabic and English Treebanks at LDC. 224 of the files in this release are parallel with the 224 files in the English Translation Treebank -- EATB Part 5 v2.0 (LDC2010E20). 3.2 Data Sources and Epochs Source texts were selected from the Agence France Presse (AFP) newswire archives for July-November 2000 (files dated 20000715 to 20001115). There are 734 stories (specified by the file ID), dated on the 15th day of each month ranging from July to November in 2000. 4. Annotated Data Profile This corpus of Arabic Treebank Part 1 - v4.1 consists of 734 newswire stories from the Agence France Presse (AFP). There are a total of 145,386 source tokens before clitics are split and 167,280 tree tokens after clitics are separated for the treebank annotation, all of which have been annotated for morphology/part-of-speech and syntactic structure. The source file IDs are listed in docs/file.ids. A listing of all of the files in this release can be found in docs/file.tbl. The data formats, including the integrated format, are extensively documented in docs/readme-files.txt and in the paper at docs/KulickBiesMaamouri-LREC2010.pdf. 5. Data Directory Structure The directory structure for this data (and for the /data directory) is in docs/file.tbl. In the docs/ directory: - KulickBiesMaamouri-LREC2010.pdf - Paper describing data formats and the integration of Treebank and SAMA tokens. - ag-1.1.dtd - This is the dtd file for the AG XML. - errata.txt - Lists errata and the URL where any additional errata from this corpus will be listed. - file.ids - A list of file ids in the corpus. - file.tbl - Directory structure for everything in this package. - readme-files.txt - An extensive description of the new integrated format, also describes the various formats of the data. - tags-count.txt - A list of the POS/morphological tags after the clitics are separated and after treebank annotation, along with the number of occurrences of each tag. - atb1-v4.1-taglist-conversion-to-PennPOS-forrelease.lisp - Lisp code mapping the full morphological tags to a much smaller list, similar to the Penn POS tagset, strictly for convenience. 6. File Format Description An extensive description of the file formats (and the types of files present for each of the IDs in docs/file.ids) is in docs/readme-files.txt and in docs/KulickBiesMaamouri-LREC2010.pdf, including a description of the modifications that have been made to the format of the data in the various .txt and .tree files compared with the ATB1-v3.0 release (and all ATB releases prior to ATB5) and detailed information on the integrated format. 7. Data Validation The data went through the following annotation procedure: POS procedure: - All words went through the morphological analyzer. - All words are included in the first pass of POS where annotators select one out of many choices provided by the morphological analyzer. - All files went through a second pass of POS annotation where annotators review the annotation done in the previous POS pass. - In current new annotation (and for this revision) we then implement automatic checks on the part-of-speech tags with consequent further manual revision when necessary to ensure the consistency of the part-of-speech tags with the current guidelines. TB procedure: - Words/tokens from the POS annotation are processed to separate clitics in preparation for TB annotation. After clitic separation, the number of tokens increases from 145,386 to 167,280. - The sentences were pre-parsed to improve productivity, and the parses were then hand corrected. - Annotators went through one pass of annotation with the help of diagnostic QC searches to catch potential patterns of annotation errors. For this current ATB1-v4.1 release, the following additional steps were taken: TB procedure: - The syntactic annotation guidelines were significantly revised, in particular with respect to noun phrase structure (idafa), verb phrase structure for verbs taking clausal complements, verb phrase structure for non-inflectional verbs, and the structure surrounding the function words with new tokenization. The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. - The treebank annotation from ATB1-v3.0 was manually revised according to the new Arabic Treebank Annotation Guidelines. Significant changes were made to NP structure and to the classification of verbs with clausal arguments, along with improvements to the annotation in general. - Additional QC searches were run on the full ATB1, including some relating to the relation between POS tags and TB nodes, and the results were hand corrected. POS procedure: - The part-of-speech/morphological guidelines were significantly revised, in particular with respect to the classification and tokenization of closed class function words, classes of nouns (the addition of NOUN_QUANT and NOUN_NUM), classes of adjectives (the addition of ADJ_COMP and ADJ_NUM), and classes of non-inflectional verbs (the addition of PSEUDO_VERB and VERB). The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. - We have made various further automatic changes to the POS tags, described below. - A limited number of manual corrections were made to the POS tags for this version as well. The annotators on this project were Fatma Gaddeche (Lead Annotator), Basma Bouziri, Badia Laadioui, Wigdan El Mekki, Ichraf Amghouz, Zohra Bentaouit, Fatima Chebchoub, Fatima El Himyani, Rachida Fathallah, Alexa Firat, Tasneem Ghandour, Niama Laadioui, Mohamed Mansour, Sarah Tlili, Gordon Witty, and Dalal Zakhary. Quality assurance & annotation checking for this release: The current version of SAMA, 3.1, has significant differences from the version of SAMA/BAMA current at the time this Treebank was originally annotated. Therefore, the goal has been to update the morphological annotations to be consistent both with SAMA 3.1 and with the correct part-of-speech/tree interaction as discussed in the guidelines. "Consistency" here means that the morphological solution for a token in the treebank is also one of the solutions for that token in SAMA 3.1. For the initial revision of this corpus, each treebank token mentioned, explicitly or implicitly (e.g., all the ADJ_COMP words, determined by an examination of vowel patterns and then manual filtering) in the morphological guidelines was annotated with a list of its possible tags. For tokens that had only one possible tag, of course the current tag was modified, if necessary, to be that tag. Second, for those words with more than one tag, we focused on some of the most common and important cases, and did manual annotation of those tokens. In some cases, it was also possible to assign the tag based on the tree context, since the tree annotation had already been done. This revision has now been carried further by explicitly testing every token in the treebank against the possible SAMA solutions for that token. Where there have been discrepancies, the results have been manually inspected and in many cases changed. See docs/readme-files.txt for an detailed description of this procedure and docs/errata.txt for a listing of some of the remaining discrepancies between the treebank and SAMA. The Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. 8. DTDs One for the AG XML files, ag-1.1.dtd. 9. Copyright Information Portions (c) 2000 Agence France Presse, (c) 2003, 2004, 2005, 2008, 2010 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu Ann Bies, bies@ldc.upenn.edu Seth Kulick, skulick@ldc.upenn.edu 11. Update Log This index was updated on July 2, 2010 by Ann Bies.