Arabic Treebank - Weblog V1.0 CatalogID: Release date: November 18, 2014 Linguistic Data Consortium Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Dalila Tabassi, Michael Ciul 1. Introduction This corpus of Arabic Treebank consists of part-of-speech/morphological annotation and syntactic tree annotation for 243,117 source tokens (308,996 tree tokens, after clitic splitting) in 2349 files of web text from various sources. Most of this data was previously released as subcorpora in earlier versions to the GALE community; this publication consolidates the Arabic Treebank weblog data. The corpora that were released to the GALE community previously had the catalog numbers LDC2009E108 (ATB6, only the weblog portions are included here), LDC2011E16 (ATB11), LDC2011E18 (ATB13), and LDC2012E10 (ATB15). Previously unreleased annotation is also included in this combined release (fileIDs listed in docs/previously-unreleased-files.ids). This publication contains part-of-speech/morphology/gloss annotation and syntactic treebank annotation that is in accordance with the Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines, available in the docs directory of this release. These are the same annotation guidelines used for the recent updated and revised newswire and broadcast news corpora that have been released (Arabic Treebank Part 1 - V4.1, CatalogID: LDC2010T13; Arabic Treebank Part 2 v 3.1, CatalogID: LDC2011T09; Arabic Treebank part 3 - v3.2, CatalogID: LDC2010T08; Arabic Treebank - Broadcast News v1.0, CatalogID: LDC2012T07). This release conforms to the format conventions initiated with the releases of Arabic Treebank part 5 - v1.0, LDC2009E72 (ATB5) and Arabic Treebank Part 6 V1.0 - GALE Phase 4 dev09, LDC2009E108 (ATB6), which are detailed in docs/readme-files.txt and in the docs/KulickBiesMaamouri-LREC2010.pdf paper: Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank. Seth Kulick, Ann Bies and Mohamed Maamouri. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Malta May 19-21, 2010. Available: docs/KulickBiesMaamouri-LREC2010.pdf In addition, two papers written about the revision and enhancement process of the newswire corpora that resulted in the revised annotation guidelines are available on the LDC website: Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth Kulick. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf Poster: http://papers.ldc.upenn.edu/LREC2008/Enhancement-poster.ppt Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: http://papers.ldc.upenn.edu/LREC2008/Diacritic_Annotation_ATB.pdf Poster: http://papers.ldc.upenn.edu/LREC2008/Diacritization-poster.ppt This corpus is part of an on-going effort to produce parallel Arabic and English Treebanks at LDC. Most of the files in this release are parallel with the same file IDs in the English Translation Treebank Weblog corpus, which will be published in the near future (the subcorpora of which are currently available to the GALE community). This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 2. Annotation 2.1 Tasks and Guidelines The Arabic Treebank project consists of two distinct phases: (a) Part-of-Speech (=POS) tagging which divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking (=ArabicTB) which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. Tim Buckwalter's transliteration system, which we use for this corpus, is described at http://www.qamus.org/transliteration.htm. The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available in the docs directory of this release. 2.2 Annotation Process The LDC Standard Arabic Morphological Analyzer LDC2009E73 (SAMA 3.1) was used to generate a candidate list of POS values for each word/token, and our annotators picked the appropriate one manually. (Please note that some words do not exist in this lexicon.) The POS annotation task is to select the correct POS tag. Once POS is done, we automatically separate the clitics based on the POS selection. We use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for numerical data, PUNC for punctuation, and FOREIGN or LATIN for non-Arabic alphabetic data. Dialect forms are uniformly given the POS tag DIALECT. Since this corpus was designed as a Modern Standard Arabic corpus, further morphological analysis of dialect tokens (of the type used for the BOLT dialect corpora, for example) was not used for this data. We then implemented automatic checks on the part-of-speech tags with consequent further manual revision when necessary to ensure the consistency of the part-of-speech tags with the current guidelines. The Arabic morphological tagset was then reduced to a smaller POS set, and the files were automatically parsed. The parses were then hand corrected by human annotators according to the Arabic Treebank Annotation Guidelines. The QC process consists of a series of specific searches for several hundred types of potential inconsistency and annotation error. Any errors found in these searches were hand corrected in two passes. The previously unreleased files (fileIDs listed in docs/previously-unreleased-files.ids) received only a single round of QC. The annotators for this project were Luma Ateyah, Basma Bouziri, Olfa Bayouth, Maha Ben Hadj Aleya, Sameh Benna, Asma Berrima, Basma Bouziri, Faiez Dhieb, Rachida Fathallah, Fatma Gaddeche, Esma Maamouri Ghrib, Aicha Graja, Sondos Krouna, Badia Laadioui, Leila Laghrissi, Soumeya Mekki, Wigdan Mekki, Mouna Rezig, and Dalila Tabassi. 3. Source Data Profile 3.1 Data Selection Process This corpus of Arabic Treebank - Weblog consists of 2349 files of web text from various sources. There are a total of 243,117 source tokens before clitics are split and 308,996 tree tokens after clitics are separated for the treebank annotation. All of this data has been annotated for morphology/part-of-speech and syntactic structure. This corpus is part of an on-going effort to produce parallel Arabic and English Treebanks at LDC. The files in this release are parallel with the same file IDs in the English Translation Treebank Broadcast News corpus, which will be published in the near future (the subcorpora of which are currently available to the GALE community). 3.2 Data Sources and Epochs The data consists of Arabic web text from various sources collected by LDC. 4. Annotated Data Profile This data consists of 2349 files of web text from various sources. There are a total of 243,117 source tokens before clitics are split and 308,996 tree tokens after clitics are separated for the treebank annotation. All of this data has been annotated for morphology/part-of-speech and syntactic structure. 5. Data Directory Structure The source file IDs are listed in docs/file.ids. A listing of all of the files in this release can be found in docs/file.tbl. The data formats, including the integrated format, are documented in docs/readme-files.txt. In the data/ directory: - integrated/ - The goal of this format is to bring together in one place: 1) the information about the source tokens from the pos/before files, including the explicit mapping between the source and tree tokens, 2) the information about the tree tokens from the pos/after files, 3) the tree structure. For details about this content, see docs/readme-files.txt. - penntree/ - the annotation files in Penn Treebank bracketed list style. - pos/ - the POS annotation for this corpus. For details about this content, see docs/readme-files.txt. - tdf/ - the SU files converted into the .tdf format necessary for the operation of the POS annotation tool. - xml/ - the annotation graph files, in the format used by our syntactic annotation tool. In the docs/ directory: - ag-1.1.dtd - This is the dtd file for the AG XML. - ATB-POSGuidelines-v3.8.pdf - Morphological and part-of-speech annotation guidelines. - ATB-SyntacticGuidelines-v4.95-20110630.pdf - Syntactic annotation guidelines. - atb-wb-taglist-conversion-to-PennPOS-forrelease.lisp - Lisp code mapping the full morphological tags to a much smaller list, similar to the Penn POS tagset, strictly for convenience. - file.ids - A list of file ids in the corpus. - file.tbl - Directory structure for everything in this package. - KulickBiesMaamouri-LREC2010.pdf - Paper describing data formats and the integration of Treebank and SAMA tokens. - not-included.txt - A listing of character sequences from the source files that are not included as tokens. See readme-files.txt for further explanation. - previously-unreleased-files.ids - A list of file ids of the annotated files included in this corpus that have not been previously released. - readme-files.txt - Additional details about the data and data formats, including information about the data/pos/before content, data/integrated files and data/tdf files, along with information about the relationship with SAMA. - tags-count.txt - A list of the POS/morphological tags after the clitics are separated and after treebank annotation, along with the number of occurrences of each tag. 6. File Format Description A description of the file formats (and the types of files present for each of the IDs in docs/file.ids) is in docs/readme-files.txt and in docs/KulickBiesMaamouri-LREC2010.pdf, including a description of the modifications that have been made to the format of the data in the various .txt and .tree files compared with ATB releases prior to ATB5 and detailed information on the integrated format. 7. Data Validation The data went through the following annotation procedure: POS procedure: - All words were submitted to the morphological analyzer. (Note that for some tokens, there was no solution in the analyzer.) - All words were then included for POS annotation, where annotators either selected one out of many choices provided by the morphological analyzer, or reviewed the annotation done in a previous POS pass. - All files went through a second pass of POS annotation where annotators review the annotation done in the previous POS pass. TB procedure: - Words/tokens from the POS annotation are processed to separate clitics in preparation for TB annotation. After clitic separation, the number of words/tokens increases from 243,117 to 308,996. - The sentences were pre-parsed to improve productivity, and the parses were then hand corrected. - All files went through a second pass of TB annotation where annotators review the annotation done in the previous TB pass. - Annotators went through a stage of annotation with the help of diagnostic QC searches to catch potential patterns of annotation errors. Quality assurance & annotation checking for this release: Every token in the treebank has been explicitly tested against the possible SAMA 3.1 solutions for that token. Where there have been discrepancies, the results have been manually inspected and in many cases changed. See docs/readme-files.txt for an detailed description of this procedure and docs/errata.txt for a listing of some of the remaining discrepancies between the treebank and SAMA. The Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available in the docs/ directory of this release. 8. DTDs One for the AG XML files, ag-1.1.dtd, located in both docs/ and with the .xml files in data/xml/treebank/. 9. Copyright Information Portions (c) 2004-2014 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu Ann Bies, bies@ldc.upenn.edu Seth Kulick, skulick@ldc.upenn.edu 11. Update Log This index was updated on November 18, 2014 by Ann Bies.