BOLT Egyptian Arabic Treebank - SMS/Chat V1.0 CatalogID: Release date: Linguistic Data Consortium Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Dalila Tabassi, Michael Ciul 1. Introduction This corpus of Egyptian Arabic Treebank consists of part-of-speech/morphological annotation and syntactic tree annotation for 349,414 source tokens (435,677 tree tokens, after clitic splitting) of Egyptian Arabic (ARZ) in 1367 files of SMS/Chat originally written in Arabizi (or Romanized/Latin characters) script from various sources annotated for morphology/part-of-speech/gloss. Prior to treebank annotation, this data was transliterated from Arabizi to Arabic script, with manual correction -- this corrected transliteration was used as the input for the treebank annotation pipeline. This data was previously released as subcorpora in earlier versions to the BOLT community; this publication consolidates the Egyptian Arabic Treebank SMS/Chat (SMS) data. The corpora that were released to the BOLT community previously had the catalog numbers LDC2013E120(ARZ-SMS Part 1), LDC2013E133(ARZ-SMS Part 2), LDC2014E17(ARZ-SMS Part 3), LDC2014E43(ARZ-SMS Part 4), LDC2014E63(ARZ-SMS Part 5), LDC2014E77(ARZ-SMS Part 6) LDC2014E95(ARZ-SMS Part 7), and LDC2015E26(ARZ-SMS Part 8). This publication contains part-of-speech/morphology/gloss annotation and syntactic treebank annotation that is in accordance with the Penn Arabic Treebank (PATB) annotation guidelines. The Penn Arabic Treebank MSA Morphological and Syntactic Annotation Guidelines are both available in the docs directory of this release (docs/ATB-POSGuidelines-v3.8.pdf and docs/ATB-SyntacticGuidelines-v4.95-20110630.pdf). These are the same annotation guidelines used for other PATB releases. This publication also includes the LDC Egyptian Arabic Treebank Morphological and Syntactic Annotation Guidelines in the docs directory (docs/ARZ-POSGuidelines-v1.2.pdf and docs/ARZ-SyntacticGuidelines-v0.2.pdf). These guidelines target informal data, especially SMS/Chat data, but also apply to informal data such as discussion forum data, and so were also used for the BOLT Egyptian Arabic Treebank Discussion Forum publication (LDC2018T23). This release conforms to the format conventions detailed in docs/readme-files.txt and in the docs/KulickBiesMaamouri-LREC2010.pdf paper: Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank. Seth Kulick, Ann Bies and Mohamed Maamouri. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Malta May 19-21, 2010. Available: docs/KulickBiesMaamouri-LREC2010.pdf (and also available on the LDC website at https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2010-consistent-flexible-integration.pdf) Due to the nature of this Egyptian Arabic corpus, the relationship between the source tokens and the morphological analyzer is more complicated than with the MSA data, containing references to both the SAMA 3.1 Morphological Analyzer (LDC2010L01), for the MSA tokens, and the CALIMA v0.5 Morphological Analyzer, for the ARZ tokens. Detailed information about the correspondence can also be found in docs/readme-files.txt. Two papers written about the revision and enhancement process of the newswire corpora that resulted in the revised ATB annotation guidelines are available on the LDC website: Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth Kulick. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2008-enhancing-arabic-treebank.pdf Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2008-diacritic-annotation-atb.pdf In addition, a paper written about the development of the Egyptian Arabic Treebank is also available on the LDC website: Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development. Mohamed Maamouri, Ann Bies, Seth Kulick, Michael Ciul, Nizar Habash and Ramy Eskander. In Proceedings of LREC 2014: 9th Edition of the Language Resources and Evaluation Conference, Reykjavik, May 26-31. Available: Paper: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2014-developing-egyptian-arabic-treebank.pdf This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 2. Annotation 2.1 Tasks and Guidelines The Arabic Treebank project consists of two distinct phases: (a) Part-of-Speech (=POS) tagging, which divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking (=ArabicTB) which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. Tim Buckwalter's transliteration system, which we use for this corpus, is described at http://www.qamus.org/transliteration.htm. The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available in the docs directory of this release. The LDC Egyptian Arabic Treebank Morphological and Syntactic Annotation Guidelines are also available in the docs directory of this release. 2.2 Annotation Process Both the LDC Standard Arabic Morphological Analyzer (LDC2010L01, SAMA 3.1), for the MSA tokens, and the CALIMA Morphological Analyzer v0.5 for the ARZ tokens, were used to generate a candidate list of POS values for each word/token. Our annotators picked the appropriate one manually, or else manually supplied segmentation and POS information if neither analyzer contained the solution for the token. Due to the nature of this Egyptian Arabic corpus (which contains primarily ARZ, but unavoidably includes also some MSA data), the relationship between the source tokens and the morphological analyzer is more complicated than for the entirely MSA ATB corpora, since this Egyptian Arabic Treebank corpus contains references to both morphological analyzers. Details can be found in docs/readme-files.txt in this release. We use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for numerical data, PUNC for punctuation, and FOREIGN for non-Arabic alphabetic data. After POS annotation, we then implemented automatic checks on the part-of-speech tags with consequent further manual revision when necessary to ensure the consistency of the part-of-speech tags with the current guidelines. Once POS annotation was complete, we automatically separated the clitics based on the POS selection. Human annotators provided full syntactic trees manually, according to the MSA and ARZ Treebank Annotation Guidelines. The QC process consisted of a series of specific searches for several hundred types of potential inconsistency and annotation error. Any errors found in these searches were hand corrected in two passes. The annotators for this project were Nancy Abdelhalim, Olfa Bayouth, Maha Ben Hadj Aleya, Sameh Benna, Asma Berrima, Faiez Dhieb, Seham El Kareh, Soha Sobhy Ali Abd El-Raheem, Radwa Essam Abd Elmonaem Elsawy, Omnia Abdelmonem Elsayed, Rachida Fathallah, Fatma Gaddeche, Esma Maamouri Ghrib, Aicha Graja, Nadia Hamrouni, Nermine Khalil, Nawred Khazri, Sondos Krouna, Badia Laadioui, Leila Laghrissi, Omnia Taha Mahfouz, Reham Mohamed Marzouk, Soumeya Mekki, Fatma Elaaty Mohamed, Reem Nabil Mohammed, Sherine Hassan Mustapha, Mouna Rezig, Mahytab Mohammed Abbas Shouman, and Dalila Tabassi. 3. Source Data Profile 3.1 Data Selection Process This corpus of Egyptian Arabic Treebank SMS/Chat consists of 1367 files of SMS/Chat data originally written in Arabizi (or Romanized/Latin characters) script from various sources. There are a total of 349,414 source tokens before clitics were split and 435,677 tree tokens, after clitics are separated for the treebank annotation. All of this data has been annotated for morphology/part-of-speech and syntactic structure. The files selected for this treebank corpus were chosen from the files of Egyptian Arabic (ARZ) SMS/Chat data that had already undergone SU annotation at LDC. 3.2 Data Sources and Epochs The data was collected by LDC in 2013, and transliterated from Arabizi to Arabic script, with manual correction. This corrected transliteration was used as the input for the treebank annotation pipeline. The Arabic data has been released as BOLT Egyptian Arabic SMS/Chat and Transliteration (LDC2017T07). 4. Annotated Data Profile This data consists of 1367 files of SMS/Chat text from various sources. There are a total of 349,414 tokens before clitics were split and 435,677 tree tokens, after clitics are separated for the treebank annotation. All of this data has been annotated for morphology/part-of-speech and syntactic structure. 5. Data Directory Structure The source file IDs are listed in docs/file.ids. A listing of all of the files in this release can be found in docs/file.tbl. The data formats, including the integrated format, are documented in docs/readme-files.txt. In the data/ directory: - integrated/ - The goal of this format is to bring together in one place: 1) the information about the source tokens from the pos/before files, including the explicit mapping between the source and tree tokens, 2) the information about the tree tokens from the pos/after files, 3) the tree structure. For details about this content, see docs/readme-files.txt. - penntree/ - the annotation files in Penn Treebank bracketed list style. - pos/ - the POS annotation for this corpus. For details about this content, see docs/readme-files.txt. - su_xml/ - the SU annotated files used to supply the source data and tokens as input for the POS annotation. - tdf/ - the SU files converted into the .tdf format necessary for the operation of the POS annotation tool. - xml/ - the annotation graph files, in the format used by our syntactic annotation tool. In the docs/ directory: - ag-1.1.dtd - This is the dtd file for the AG XML. also included in data/xml/treebank. - ATB-POSGuidelines-v3.8.pdf - Morphological and part-of-speech annotation guidelines. - ATB-SyntacticGuidelines-v4.95-20110630.pdf - Syntactic annotation guidelines. - ARZ-POSGuidelines-v1.2.pdf - Egyptian Arabic Treebank Morphological annotation guidelines. - ARZ-SyntacticGuidelines-v0.2.pdf - Egyptian Arabic Syntactic annotation guidelines. - atb-arz-sms-taglist-conversion-to-PennPOS-forrelease.txt - A mapping of the full morphological tags to a much smaller list, similar to the Penn POS tagset, strictly for convenience. - file.ids - A list of file ids in the corpus. - file.tbl - Directory structure for everything in this package. - KulickBiesMaamouri-LREC2010.pdf - Paper describing data formats and the integration of Treebank and SAMA tokens. - README.txt - This file. - readme-files.txt - Additional details about the data and data formats, including information about the data/pos/before content, data/integrated files and data/tdf files, along with information about the relationship with SAMA. - tags-count.txt - A list of the POS/morphological tags after the clitics are separated and after treebank annotation, along with the number of occurrences of each tag. 6. File Format Description A description of the file formats (and the types of files present for each of the IDs in docs/file.ids) is in docs/readme-files.txt and in docs/KulickBiesMaamouri-LREC2010.pdf, including a description of the modifications that have been made to the format of the data in the various .txt and .tree files compared with ATB releases prior to ATB5 and detailed information on the integrated format. 7. Data Validation The data went through the following annotation procedure: POS procedure: - All words were submitted to the morphological analyzers. (Note that for some tokens, there was no solution in either analyzer; most are addressed by the last step below.) - All words were then included for POS annotation, where annotators either selected one out of many choices provided by the morphological analyzers, or reviewed the annotation done in a previous POS pass. - Tokens with no solution in either SAMA or CALIMA were annotated using a "wildcard" feature in the annotation tool that allows annotators to supply annotation for a stem that is not in the analyzer in accordance with the CALIMA/SAMA scheme. - Tags added with the wildcard feature that failed certain QC tests were converted to a NO_FUNC. In addition, tokenization problems due to transliteration errors were annotated with TYPO or NO_FUNC. TB procedure: - Words/tokens from the POS annotation are processed to separate clitics in preparation for TB annotation. After clitic separation, the number of words/tokens increases from 349,414 to 435,677. - All sentences were manually annotated for syntactic structure. - Annotators went through a stage of annotation with the help of diagnostic QC searches to catch potential patterns of annotation errors. - Tokens with NO_FUNC or otherwise problematic POS tags that required merging or splitting that was not possible for this release were placed under an X node in the tree. Quality assurance & annotation checking for this release: Every token in the treebank has been explicitly tested against the possible SAMA 3.1 and CALIMA v0.5 solutions for that token. See docs/readme-files.txt for a detailed analysis of the relationship with the CALIMA and SAMA analyzers. The Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available in the docs/ directory of this release. The LDC Egyptian Arabic Morphological and Syntactic Annotation Guidelines are also available in the docs/ directory of this release. 8. DTDs One for the AG XML files, ag-1.1.dtd, located both in docs/ and with the .xml files in data/xml/treebank/. 9. Copyright Information Portions (c) 2011-2019 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu Ann Bies, bies@ldc.upenn.edu Seth Kulick, skulick@ldc.upenn.edu 11. Update Log This index was updated on March 22, 2019 by Seth Kulick