BOLT Egyptian Arabic Treebank - Discussion Forum V1.0 CatalogID: Release date: January 9, 2017 Linguistic Data Consortium Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Dalila Tabassi, Michael Ciul 1. Introduction This corpus of Egyptian Arabic Treebank consists of part-of-speech/morphological annotation and syntactic tree annotation for 400,448 source tokens (508,548 tree tokens, after clitic splitting) of Egyptian Arabic (ARZ) in 730 files of discussion forum text from various sources. Most of this data was previously released as subcorpora in earlier versions to the BOLT community; this publication consolidates the Egyptian Arabic Treebank discussion forum (DF) data. The corpora that were released to the BOLT community previously had the catalog numbers LDC2012E93 (ARZ-DF Part 1), LDC2012E98 (ARZ-DF Part 2), LDC2012E89 (ARZ-DF Part 3), LDC2012E99 (ARZ-DF Part 4), LDC2012E107 (ARZ-DF Part 5), LDC2012E125 (ARZ-DF Part 6), LDC2013E12 (ARZ-DF Part 7), and LDC2013E21 (ARZ-DF Part 8). This combined release also updates the synchronization of the tokens in the corpus with the morphological analyzers: SAMA 3.1 Morphological Analyzer (LDC2010L01), for the Modern Standard Arabic (MSA) tokens, and the CALIMA v0.5 Morphological Analyzer, for the Egyptian Arabic (ARZ) tokens. 85.3% of the ARZ source tokens in this combined corpus (294641/345273) are a complete match with CALIMA v0.5. Details can be found in docs/readme-files.txt in this release. This publication contains part-of-speech/morphology/gloss annotation and syntactic treebank annotation that is in accordance with the Penn Arabic Treebank (PATB) annotation guidelines. The Penn Arabic Treebank MSA Morphological and Syntactic Annotation Guidelines are both available in the docs directory of this release (docs/ATB-POSGuidelines-v3.8.pdf and docs/ATB-SyntacticGuidelines-v4.95-20110630.pdf). These are the same annotation guidelines used for the recent updated and revised newswire and broadcast news corpora that have been released (Arabic Treebank Part 1 - V4.1, CatalogID: LDC2010T13; Arabic Treebank Part 2 v 3.1, CatalogID: LDC2011T09; Arabic Treebank part 3 - v3.2, CatalogID: LDC2010T08; Arabic Treebank - Broadcast News v1.0, CatalogID: LDC2012T07). The LDC Egyptian Arabic Treebank Morphological and Syntactic Annotation Guidelines are also available in the docs directory of this release (docs/ARZ-POSGuidelines-v1.2.pdf and docs/ARZ-SyntacticGuidelines-v0.2.pdf). The Egyptian Arabic annotation guidelines target informal data, especially SMS/Chat data, but apply to informal data such as discussion forum data as well. Due to the nature of this Egyptian Arabic corpus, the relationship between the source tokens and the morphological analyzer is now more complicated, containing references to both the SAMA 3.1 Morphological Analyzer (LDC2010L01), for the MSA tokens, and the CALIMA v0.5 Morphological Analyzer, for the ARZ tokens. The POS annotation was done simultaneously with development of the morphological analyzer. Therefore some inevitable inconsistencies resulted in the previously released BOLT e-corpora data between the part-of-speech/vocalization/lemma solutions and morphological analyzer solutions. These are now reconciled in this combined release. Detailed information about the correspondence can be found in docs/readme-files.txt. This release conforms to the format conventions initiated with the releases of Arabic Treebank part 5 - v1.0, LDC2009E72 (ATB5) and Arabic Treebank Part 6 V1.0 - GALE Phase 4 dev09, LDC2009E108 (ATB6), which are detailed in docs/readme-files.txt and in the docs/KulickBiesMaamouri-LREC2010.pdf paper: Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank. Seth Kulick, Ann Bies and Mohamed Maamouri. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Malta May 19-21, 2010. Available: docs/KulickBiesMaamouri-LREC2010.pdf (and also available on the LDC website at https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2010-consistent-flexible-integration.pdf) Two papers written about the revision and enhancement process of the newswire corpora that resulted in the revised ATB annotation guidelines are available on the LDC website: Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth Kulick. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2008-enhancing-arabic-treebank.pdf Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2008-diacritic-annotation-atb.pdf In addition, a paper written about the development of the Egyptian Arabic Treebank is also available on the LDC website: Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development. Mohamed Maamouri, Ann Bies, Seth Kulick, Michael Ciul, Nizar Habash and Ramy Eskander. In Proceedings of LREC 2014: 9th Edition of the Language Resources and Evaluation Conference, Reykjavik, May 26-31. Available: Paper: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2014-developing-egyptian-arabic-treebank.pdf This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily re.ect the position or the policy of the Government, and no official endorsement should be inferred. 2. Annotation 2.1 Tasks and Guidelines The Arabic Treebank project consists of two distinct phases: (a) Part-of-Speech (=POS) tagging which divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking (=ArabicTB) which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. Tim Buckwalter's transliteration system, which we use for this corpus, is described at http://www.qamus.org/transliteration.htm. The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available in the docs directory of this release. The LDC Egyptian Arabic Treebank Morphological and Syntactic Annotation Guidelines are also available in the docs directory of this release. 2.2 Annotation Process The original morphological/POS annotation for the BOLT e-corpora was done using SAMA 3.1 as the morphological analyzer for the MSA tokens for all of the subcorpora. The morphological analyzer for the ARZ tokens was CALIMA v0.3 for the original BOLT subcorpora parts 1-6, and CALIMA v0.4.2 for the original BOLT subcorpora parts 7-8. The change in ARZ analyzer is because the annotation and the development of the Egyptian Arabic morphological analyzer were taking place in parallel, and the later subcorpora took advantage of further development in the analyzer. Once the treebank annotation and CALIMA development were both complete, it was possible to reconcile the annotation to be as consistent as possible with the latest version of the analyzer, CALIMA v0.5, which has been done for the data in this release. Both the LDC Standard Arabic Morphological Analyzer (LDC2010L01, SAMA 3.1), for the MSA tokens, and the CALIMA Morphological Analyzer (v0.3 or v0.4.2, depending on the subcorpus), for the ARZ tokens, were used to generate a candidate list of POS values for each word/token. Our annotators picked the appropriate one manually, or else manually supplied segmentation and POS information if neither analyzer contained the solution for the token. Due to the nature of this Egyptian Arabic corpus (primarily ARZ, but unavoidably including also some MSA), the relationship between the source tokens and the morphological analyzer is more complicated than for the entirely MSA ATB corpora, since this Egyptian Arabic Treebank corpus contains references to both the SAMA 3.1 Morphological Analyzer (LDC2010L01), for the MSA tokens, and the CALIMA v0.5 Morphological Analyzer, for the ARZ tokens. The POS annotation was done simultaneously with development of the morphological analyzer. Therefore some inevitable inconsistencies resulted in the previously released BOLT e-corpora data between the part-of-speech/vocalization/lemma solutions and morphological analyzer solutions. These are now reconciled in this combined release, and this release also updates the synchronization of the tokens in the corpus with the morphological analyzers: SAMA 3.1 Morphological Analyzer (LDC2010L01), for the Modern Standard Arabic (MSA) tokens, and the CALIMA v0.5 Morphological Analyzer, for the Egyptian Arabic (ARZ) tokens. 85.3% of the ARZ source tokens in this combined corpus (294641/345273) are a complete match with CALIMA v0.5. Details can be found in docs/readme-files.txt in this release. We use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for numerical data, PUNC for punctuation, and FOREIGN or LATIN for non-Arabic alphabetic data. We then implemented automatic checks on the part-of-speech tags with consequent further manual revision when necessary to ensure the consistency of the part-of-speech tags with the current guidelines. The morphological tagset was then reduced to a smaller POS set, to facilitate future automatic dialectal parsing. Once POS annotation was complete, we automatically separated the clitics based on the POS selection. Human annotators provided full syntactic trees manually, according to the MSA and ARZ Treebank Annotation Guidelines. The QC process consisted of a series of specific searches for several hundred types of potential inconsistency and annotation error. Any errors found in these searches were hand corrected in two passes. The annotators for this project were Nancy Abdelhalim, Olfa Bayouth, Maha Ben Hadj Aleya, Sameh Benna, Asma Berrima, Faiez Dhieb, Seham El Kareh, Soha Sobhy Ali Abd El-Raheem, Radwa Essam Abd Elmonaem Elsawy, Omnia Abdelmonem Elsayed, Rachida Fathallah, Fatma Gaddeche, Esma Maamouri Ghrib, Aicha Graja, Nadia Hamrouni, Nermine Khalil, Nawred Khazri, Sondos Krouna, Badia Laadioui, Leila Laghrissi, Omnia Taha Mahfouz, Reham Mohamed Marzouk, Soumeya Mekki, Fatma Elaaty Mohamed, Reem Nabil Mohammed, Sherine Hassan Mustapha, Mouna Rezig, Mahytab Mohammed Abbas Shouman, and Dalila Tabassi. 3. Source Data Profile 3.1 Data Selection Process This corpus of Egyptian Arabic Treebank Discussion Forum consists of 730 files of discussion forum text from various sources. There are a total of 400,448 source tokens before clitics are split and 508,548 tree tokens, after clitics are separated for the treebank annotation. All of this data has been annotated for morphology/part-of-speech and syntactic structure. The files selected for this treebank corpus were chosen from the files of Egyptian Arabic (ARZ) discussion forum that had already undergone SU annotation at LDC. 3.2 Data Sources and Epochs The data consists of Arabic discussion forum text from various sources collected by LDC. 4. Annotated Data Profile This data consists of 730 files of discussion forum text from various sources. There are a total of 400,448 source tokens before clitics are split and 508,548 tree tokens, after clitics are separated for the treebank annotation. All of this data has been annotated for morphology/part-of-speech and syntactic structure. 5. Data Directory Structure The source file IDs are listed in docs/file.ids. A listing of all of the files in this release can be found in docs/file.tbl. The data formats, including the integrated format, are documented in docs/readme-files.txt. In the data/ directory: - integrated/ - The goal of this format is to bring together in one place: 1) the information about the source tokens from the pos/before files, including the explicit mapping between the source and tree tokens, 2) the information about the tree tokens from the pos/after files, 3) the tree structure. For details about this content, see docs/readme-files.txt. - penntree/ - the annotation files in Penn Treebank bracketed list style. - pos/ - the POS annotation for this corpus. For details about this content, see docs/readme-files.txt. - su_xml/ - the SU annotated files used to supply the source data and tokens as input for the POS annotation. - tdf/ - the SU files converted into the .tdf format necessary for the operation of the POS annotation tool. - xml/ - the annotation graph files, in the format used by our syntactic annotation tool. In the docs/ directory: - ag-1.1.dtd - This is the dtd file for the AG XML. - ATB-POSGuidelines-v3.8.pdf - Morphological and part-of-speech annotation guidelines. - ATB-SyntacticGuidelines-v4.95-20110630.pdf - Syntactic annotation guidelines. - atb-arz-df-taglist-conversion-to-PennPOS-forrelease.lisp - Lisp code mapping the full morphological tags to a much smaller list, similar to the Penn POS tagset, strictly for convenience. - file.ids - A list of file ids in the corpus. - file.tbl - Directory structure for everything in this package. - KulickBiesMaamouri-LREC2010.pdf - Paper describing data formats and the integration of Treebank and SAMA tokens. - not-included.txt - A listing of character sequences from the source files that are not included as tokens. See readme-files.txt for further explanation. - readme-files.txt - Additional details about the data and data formats, including information about the data/pos/before content, data/integrated files and data/tdf files, along with information about the relationship with SAMA. - tags-count.txt - A list of the POS/morphological tags after the clitics are separated and after treebank annotation, along with the number of occurrences of each tag. - token-mapping.txt - A mapping making explicit the linkage between the annotation files, the .tdf files, and the .su.xml file. See readme-files.txt for details. 6. File Format Description A description of the file formats (and the types of files present for each of the IDs in docs/file.ids) is in docs/readme-files.txt and in docs/KulickBiesMaamouri-LREC2010.pdf, including a description of the modifications that have been made to the format of the data in the various .txt and .tree files compared with ATB releases prior to ATB5 and detailed information on the integrated format. 7. Data Validation The data went through the following annotation procedure: POS procedure: - All words were submitted to the morphological analyzers. (Note that for some tokens, there was no solution in either analyzer; most are addressed by the last step below.) - All words were then included for POS annotation, where annotators either selected one out of many choices provided by the morphological analyzers, or reviewed the annotation done in a previous POS pass. - Tokens with no solution in either SAMA or CALIMA were annotated using a new "wildcard" feature in the annotation tool that allows annotators to supply annotation for a stem that is not in the analyzer in accordance with the CALIMA/SAMA scheme. TB procedure: - Words/tokens from the POS annotation are processed to separate clitics in preparation for TB annotation. After clitic separation, the number of words/tokens increases from 400,448 to 508,548. - All sentences were manually annotated for syntactic structure. - Annotators went through a stage of annotation with the help of diagnostic QC searches to catch potential patterns of annotation errors. Quality assurance & annotation checking for this release: Every token in the treebank has been explicitly tested against the possible SAMA 3.1 and CALIMA v0.5 solutions for that token. See docs/readme-files.txt for a detailed analysis of the relationship with the CALIMA and SAMA analyzers. The Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available in the docs/ directory of this release. The LDC Egyptian Arabic Morphological and Syntactic Annotation Guidelines are also available in the docs/ directory of this release. 8. DTDs One for the AG XML files, ag-1.1.dtd, located both in docs/ and with the .xml files in data/xml/treebank/. 9. Copyright Information Portions (c) 2011-2017 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu Ann Bies, bies@ldc.upenn.edu Seth Kulick, skulick@ldc.upenn.edu 11. Update Log This index was updated on January 9, 2017 by Ann Bies.