Arabic Treebank part 3 - v3.2 CatalogID: Release date: January 28, 2010 Linguistic Data Consortium Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Fatma Gaddeche, Wajdi Zaghouani 1. Introduction This version of the Arabic Treebank part 3 - v3.2 is an incremental update to the January 2009 release of Arabic Treebank part 3 - v3.1 to the GALE community (LDC2008E22), and a significant revision over the previous general catalog release of ATB3-v2.0 (LDC2005T20). This version of Arabic Treebank part 3 - v3.2 represents a revision of the ATB3 annotation for the full ATB part 3 (ANNAHAR) corpus. The full ATB3 corpus has been revised according to the new Arabic Treebank annotation guidelines, both manually (all of the syntactic tree annotation) and automatically (the MPG annotation). The revised and updated Arabic Treebank ATB part 3 consists of 599 newswire stories from the An Nahar News Agency (previously released as Arabic Treebank: Part 3 (full corpus) v 2.0 (MPG + Syntactic Analysis), LDC Catalog No.:LDC2005T20). This release includes all of the files that were previously released to the GALE community as ATB3-v3.1, with additional quality control added. A number of inconsistencies in the 3.1 release data have been corrected here, in particular additional corrections have been made to certain POS tags, with the resulting tree changes, and as a result additional clitics have been separated and some previously incorrectly split tokens have now been merged. This release includes significant new improvements over both the ATB3-v2.0 and ATB3-v3.1 releases in both the organization of the data and certain aspects of the annotation. These improvements are detailed in docs/readme-files.txt One file from the original ATB3-v2.0 release has been removed from the corpus (ANN20020715.0063), as the text is an exact duplicate of another file in the corpus (ANN20020715.0018), taking the total number of files down from 600 to 599. In this full ATB3 corpus, there are a total of 339,710 words/tokens before clitics are split and 402,291 words/tokens after clitics are separated for the treebank annotation. This current release contains the part-of-speech/morphology/gloss annotation and the syntactic treebank annotation of these files. The treebank annotation has been revised in accordance with the new Arabic Treebank Annotation Guidelines. In addition to a partial manual revision, certain automatic changes have been made to the part-of-speech/morphology/gloss tags. Two papers written about the revision and enhancement process for ATB3 are available on the LDC website: Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth Kulick. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper in PDF: http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf Poster: http://papers.ldc.upenn.edu/LREC2008/Enhancement-poster.ppt Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper in PDF: http://papers.ldc.upenn.edu/LREC2008/Diacritic_Annotation_ATB.pdf Poster: http://papers.ldc.upenn.edu/LREC2008/Diacritization-poster.ppt 2. Annotation 2.1 Tasks and Guidelines The Arabic Treebank project consists of two distinct phases: (a) Part-of-Speech (=POS) tagging which divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking (=ArabicTB) which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. Tim Buckwalter's transliteration system, which we use for this corpus, is described at http://www.ldc.upenn.edu/myl/morph/buckwalter.html. The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. 2.2 Annotation Process In the original annotation, Tim Buckwalter's morphological analyzer (BAMA) was used to generate a candidate list of POS values for each word/token and our annotators picked the appropriate one manually. (Please note that some words do not exist in this lexicon.) The POS annotation task is just to select the correct POS tag. Once POS is done, we automatically separated the clitics based on the POS selection. We use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for numerical data, PUNC for punctuation, and FOREIGN or LATIN for non-Arabic alphabetic data. For the current version, we implemented automatic checks on the part-of-speech tags with consequent further manual revision when necessary to ensure the consistency of the part-of-speech tags with the current guidelines and with the currently in use LDC Standard Arabic Morphological Analyzer LDC2009E73 (SAMA 3.1). This is discussed in more detail below, in 7. Data Validation. The Arabic morphological tagset was then reduced to a smaller POS set, and the files were automatically parsed. The parses were then hand corrected by human annotators. For this release, the syntactic treebank annotation from ATB3-v2.0 was manually revised according to the new Arabic Treebank Annotation Guidelines. Significant changes were made to NP structure and to classification of verbs with clausal arguments, along with improvements to the annotation in general. Annotators for this process were Wigdan Mekki and Fatma Gaddeche. The QC process consists of a series of specific searches for several types of potential inconsistency and annotation error. Any errors found in these searches were hand corrected. 3. Source Data Profile 3.1 Data Selection Process This corpus of Arabic Treebank part 3 - v3.2 consists of 599 newswire stories from the An Nahar News Agency. There are a total of 339,710 words/tokens before clitics are split and 402,291 words/tokens after clitics are separated for the treebank annotation. One file from the original ATB3-v2.0 release has been removed from the corpus (ANN20020715.0063), as the text is an exact duplicate of another file in the corpus (ANN20020715.0018), taking the total number of files down from 600 to 599. 3.2 Data Sources and Epochs Source texts were selected from An Nahar News Agency in the GIGAWORD ARABIC TEXT CORPUS published by LDC in 2003 (LDC2003T12). For more details, please see that release. There are 599 stories (specified by the DOC ID), dated on the 15th day of each month ranging from January to December in 2002. 4. Annotated Data Profile This corpus of Arabic Treebank part 3 - v3.2 consists of 599 newswire stories from the An Nahar News Agency. There are a total of 339,710 words/tokens before clitics are split and 402,291 words/tokens after clitics are separated for the treebank annotation. The source file IDs are listed in docs/file.ids. We have also modified somewhat the format of the data in the various .txt and .tree files, as extensively documented in docs/readme-files.txt. 5. Data Directory Structure The directory structure for this data (and for the /data directory) is in docs/file.tbl. In the docs/ directory: - ag.dtd, metadata.dtd - These are dtd files for the AG XML. - file.ids - A list of file ids in the corpus. - file.tbl - Directory structure for everything in this package. - readme-files.txt - An extensive description of the modifications that have been made to the format of the data in the various .txt and .tree files. - tags-count.txt - A list of the POS/morphological tags after the clitics are separated and after treebank annotation, along with the number of occurrences of each tag. - atb3-v3.0-taglist-conversion-to-PennPOS-forrelease.lisp - Lisp code mapping the full morphological tags to a much smaller list, similar to the Penn POS tagset, strictly for convenience. 6. File Format Description An extensive description of the file formats (and the types of files present for each of the IDs in docs/file.ids) is in docs/readme-files.txt, including a description of the modifications that have been made to the format of the data in the various .txt and .tree files compared with previous releases. 7. Data Validation The original ATB3-v2.0 data went through the following annotation procedure: POS procedure: - All words went through Tim Buckwalter's morphological analyzer. - All words are included in the first pass of POS where annotators select one out of many choices provided by the morphological analyzer. - All files went through a second pass of POS annotation where annotators review the annotation done in the previous POS pass. TB procedure: - Words/tokens from the POS annotation are processed to separate clitics in preparation for TB annotation. After clitic separation, the number of words/tokens increases from 339,710 to 402,291. - The sentences were pre-parsed to improve productivity. - Annotators went through at least two pass of annotation with the help of diagnostic QC searches to catch potential patterns of annotation errors. For this current ATB3-v3.2 release, the following additional steps were taken: TB procedure: - The syntactic annotation guidelines were significantly revised, in particular with respect to noun phrase structure (idafa), verb phrase structure for verbs taking clausal complements, verb phrase structure for non-inflectional verbs, and the structure surrounding the function words with new tokenization. The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. - The treebank annotation from ATB3-v2.0 was manually revised according to the new Arabic Treebank Annotation Guidelines. Significant changes were made to NP structure and to classification of verbs with clausal arguments, along with improvements to the annotation in general. - Additional QC searches were run on the full ATB3, including some relating to the relation between POS tags and TB nodes, and the results were hand corrected. POS procedure: - The part-of-speech/morphological guidelines were significantly revised, in particular with respect to the classification and tokenization of closed class function words, classes of nouns (the addition of NOUN_QUANT and NOUN_NUM), classes of adjectives (the addition of ADJ_COMP and ADJ_NUM), and classes of non-inflectional verbs (the addition of PSEUDO_VERB and VERB). The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. - We have made various further automatic changes to the POS tags, described below. - A limited number of manual corrections were made to the POS tags for this version as well. The annotators on this project were Fatma Gaddeche (Lead Annotator), Ichraf Amghouz, Luma Ateyah, Basma Bouziri, Fatima Chebchoub, Rachida Fathallah, Tasneem Ghandour, Badia Laadioui, Niama Laadioui, and Wigdan Mekki. Quality assurance & annotation checking for this release: The current version of SAMA, 3.1, has significant differences from the version of SAMA/BAMA current at the time this Treebank was originally annotated. Therefore, the goal has been to update the morphological annotations to be consistent both with SAMA 3.1 and with the correct part-of-speech/tree interaction as discussed in the guidelines. "Consistency" here means that the morphological solution for a token in the treebank is also one of the solutions for that token in SAMA 3.1. For the initial revision of this corpus, each treebank token mentioned, explicitly or implicitly (e.g., all the ADJ_COMP words, determined by an examination of vowel patterns and then manual filtering) in the morphological guidelines was annotated with a list of its possible tags. For tokens that had only one possible tag, of course the current tag was modified, if necessary, to be that tag. Second, for those words with more than one tag, we focused on some of the most common and important cases, and did manual annotation of those tokens. In some cases, it was also possible to assign the tag based on the tree context, since the tree annotation had already been done. This revision has now been carried further by explicitly testing every token in the treebank against the possible SAMA solutions for that token. Where there have been discrepencies, the results have been manually inspected and in many cases changed. See docs/readme-files.txt for an detailed description of this procedure and docs/errata.txt for a listing of some of the remaining discrepencies between the treebank and SAMA. 8. DTDs Two for the AG XML files. 9. Copyright Information Portions © 2002 An Nahar, © 2003, 2004, 2005, 2007, 2008, 2009, 2010 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu Ann Bies, bies@ldc.upenn.edu Seth Kulick, skulick@ldc.upenn.edu 11. Update Log This index was updated on January 28, 2010 by Ann Bies.