BOLT Egyptian Arabic Treebank - CTS CatalogID: LDC2021T12 Release date: June 15, 2021 Linguistic Data Consortium Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Dalila Tabassi, Michael Ciul 1. Introduction This release of the Egyptian Arabic Treebank consists of part-of-speech/morphological annotation and syntactic tree annotation for 153,171 source tokens (182,965 tree tokens after clitic splitting) of Egyptian Arabic (ARZ) in 176 files of Conversational Telephone Speech (CTS) data that was collected and transcribed by LDC, annotated for morphology/part-of-speech/gloss and syntactic structure. This data was previously released as subcorpora in earlier versions to the BOLT community; this publication consolidates the Egyptian Arabic Treebank CTS data. The corpora that were released to the BOLT community previously had the catalog numbers LDC2014E120(ARZ-CTS Part 1), LDC2015E04(ARZ-CTS Part 2), and LDC2015E16(ARZ-CTS Part 3). The source data for this annotation was selected from CTS data that was originally collected for CallHome, transcribed and SU annotated. This source data has not been released at the time of this documentation. This publication contains part-of-speech/morphology/gloss annotation and syntactic treebank annotation that is in accordance with the Penn Arabic Treebank (PATB) annotation guidelines. The Penn Arabic Treebank MSA Morphological and Syntactic Annotation Guidelines are both available in the docs directory of this release (docs/ATB-POSGuidelines-v3.8.pdf and docs/ATB-SyntacticGuidelines-v4.95-20110630.pdf). These are the same annotation guidelines used for the PATB releases. This publication also includes the LDC Guidelines for Treebank Annotation of Speech Effects and Disfluency for the Penn Arabic Treebank V1.0 (atb-bn-guidelines-v1.0.pdf). While this is not specifically for Egyptian Arabic, it was still followed for the annotation of this data. This release conforms to the format conventions detailed in docs/readme-files.txt and in the docs/KulickBiesMaamouri-LREC2010.pdf paper: Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank. Seth Kulick, Ann Bies and Mohamed Maamouri. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Malta May 19-21, 2010. Available: docs/KulickBiesMaamouri-LREC2010.pdf (and also available on the LDC website at Due to the nature of this Egyptian Arabic corpus, the relationship between the source tokens and the morphological analyzer is more complicated than with the MSA data, containing references to both the SAMA 3.1 Morphological Analyzer (LDC2010L01), for the MSA tokens, and the CALIMA v0.5 Morphological Analyzer, for the ARZ tokens. Detailed information about the correspondence can also be found in docs/readme-files.txt. Two papers written about the revision and enhancement process of the newswire corpora that resulted in the revised ATB annotation guidelines are available on the LDC website: Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth Kulick. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: In addition, a paper written about the development of the Egyptian Arabic Treebank is also available on the LDC website: Developing an Egyptian Arabic Treebank: Impact of Dialectal Morphology on Annotation and Tool Development. Mohamed Maamouri, Ann Bies, Seth Kulick, Michael Ciul, Nizar Habash and Ramy Eskander. In Proceedings of LREC 2014: 9th Edition of the Language Resources and Evaluation Conference, Reykjavik, May 26-31. Available: Paper: This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 2. Annotation 2.1 Tasks and Guidelines The Arabic Treebank project consists of two distinct phases: (a) Part-of-Speech (=POS) tagging, which divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking (=ArabicTB) which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. Tim Buckwalter's transliteration system, which we use for this corpus, is described at The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available in the docs directory of this release. 2.2 Annotation Process Both the LDC Standard Arabic Morphological Analyzer (LDC2010L01, SAMA 3.1), for the MSA tokens, and the CALIMA Morphological Analyzer v0.5 for the ARZ tokens, were used to generate a candidate list of POS values for each word/token. Our annotators picked the appropriate one manually, or else manually supplied segmentation and POS information if neither analyzer contained the solution for the token. Due to the nature of this Egyptian Arabic corpus (which contains primarily ARZ, but unavoidably includes also some MSA data), the relationship between the source tokens and the morphological analyzer is more complicated than for the entirely MSA ATB corpora, since this Egyptian Arabic Treebank corpus contains references to both morphological analyzers. Details can be found in docs/readme-files.txt in this release. We use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for numerical data, PUNC for punctuation, and FOREIGN for non-Arabic alphabetic data. We then implemented automatic checks on the part-of-speech tags with consequent further manual revision when necessary to ensure the consistency of the part-of-speech tags with the current guidelines. Once POS annotation was complete, we automatically separated the clitics based on the POS selection. Human annotators provided full syntactic trees manually, according to the MSA and ARZ Treebank Annotation Guidelines. The QC process consisted of a series of specific searches for several hundred types of potential inconsistency and annotation error. Any errors found in these searches were hand corrected in two passes. The annotators for this project were Nancy Abdelhalim, Olfa Bayouth, Maha Ben Hadj Aleya, Sameh Benna, Asma Berrima, Faiez Dhieb, Seham El Kareh, Soha Sobhy Ali Abd El-Raheem, Radwa Essam Abd Elmonaem Elsawy, Omnia Abdelmonem Elsayed, Rachida Fathallah, Esma Maamouri Ghrib, Aicha Graja, Nadia Hamrouni, Nermine Khalil, Nawred Khazri, Sondos Krouna, Badia Laadioui, Leila Laghrissi, Omnia Taha Mahfouz, Reham Mohamed Marzouk, Soumeya Mekki, Fatma Elaaty Mohamed, Reem Nabil Mohammed, Sherine Hassan Mustapha, Mouna Rezig, Mahytab Mohammed Abbas Shouman, and Dalila Tabassi. 3. Source Data Profile 3.1 Data Selection Process This release of the Arabic Treebank consists of 153,171 source tokens (182,965 tree tokens after clitic splitting) of Egyptian Arabic (ARZ) in 176 files of Conversational Telephone Speech (CTS) data that was collected and transcribed by LDC for CallHome, annotated for morphology/part-of-speech/gloss and syntactic structure. The files selected for this treebank corpus were chosen from the files of Egyptian Arabic (ARZ) CTS that underwent SU annotation at LDC. 3.2 Data Sources and Epochs The source data for this annotation was selected from CTS data that was originally collected for CallHome, transcribed and SU annotated. This source data has not been released at the time of this documentation. 4. Annotated Data Profile This data consists of 176 files of CTS text from transcribed from data originally collected for Callhome. There are a total of 153,171 tokens before clitics were split and 182,965 tree tokens, after clitics were separated for the treebank annotation. All of this data has been annotated for morphology/part-of-speech and syntactic structure. 5. Data Directory Structure The source file IDs are listed in docs/file.ids. A listing of all of the files in this release can be found in docs/file.tbl. The data formats, including the integrated format, are documented in docs/readme-files.txt. In the data/ directory: - integrated/ - The goal of this format is to bring together in one place: 1) the information about the source tokens from the pos/before files, including the explicit mapping between the source and tree tokens, 2) the information about the tree tokens from the pos/after files, 3) the tree structure. For details about this content, see docs/readme-files.txt. - penntree/ - the annotation files in Penn Treebank bracketed list style. - pos/ - the POS annotation for this corpus. For details about this content, see docs/readme-files.txt. - su_xml/ - the SU annotated files used to supply the source data and tokens as input for the POS annotation. - tdf/ - the SU files converted into the .tdf format necessary for the operation of the POS annotation tool. - xml/ - the annotation graph files, in the format used by our syntactic annotation tool. In the docs/ directory: - ag-1.1.dtd - This is the dtd file for the AG XML. also included in data/xml/treebank - ATB-POSGuidelines-v3.8.pdf - Morphological and part-of-speech annotation guidelines. - ATB-SyntacticGuidelines-v4.95-20110630.pdf - Syntactic annotation guidelines. - atb-bn-guidelines-v1.0.pdf - Speech Effects guidelines. - atb-arz-cts-taglist-conversion-to-PennPOS-forrelease.txt - A mapping of the full morphological tags to a much smaller list, similar to the Penn POS tagset, strictly for convenience. - file.ids - A list of file ids in the corpus. - file.tbl - Directory structure for everything in this package. - KulickBiesMaamouri-LREC2010.pdf - Paper describing data formats and the integration of Treebank and SAMA tokens. - readme-files.txt - Additional details about the data and data formats, including information about the data/pos/before content, data/integrated files and data/tdf files, along with information about the relationship with SAMA. - tags-count.txt - A list of the POS/morphological tags after the clitics are separated and after treebank annotation, along with the number of occurrences of each tag. 6. File Format Description A description of the file formats (and the types of files present for each of the IDs in docs/file.ids) is in docs/readme-files.txt. 7. Data Validation The data went through the following annotation procedure: POS procedure: - All words were submitted to the morphological analyzers. (Note that for some tokens, there was no solution in either analyzer; most are addressed by the last step below.) - All words were then included for POS annotation, where annotators either selected one out of many choices provided by the morphological analyzers, or reviewed the annotation done in a previous POS pass. - Tokens with no solution in either SAMA or CALIMA were annotated using a "wildcard" feature in the annotation tool that allows annotators to supply annotation for a stem that is not in the analyzer in accordance with the CALIMA/SAMA scheme. - Tags added with the wildcard feature that failed certain QC tests were converted to a NO_FUNC. In addition, tokenization problems due to transliteration errors were annotated with TYPO or NO_FUNC. TB procedure: - Words/tokens from the POS annotation are processed to separate clitics in preparation for TB annotation. After clitic separation, the number of words/tokens increased from 153,171 to 182,965. - All sentences were manually annotated for syntactic structure. - Annotators went through a stage of annotation with the help of diagnostic QC searches to catch potential patterns of annotation errors. - Tokens with NO_FUNC or otherwise problematic POS tags that required merging or splitting that was not possible for this release were placed under an X node in the tree. Quality assurance & annotation checking for this release: Every token in the treebank has been explicitly tested against the possible SAMA 3.1 and CALIMA v0.5 solutions for that token. See docs/readme-files.txt for a detailed analysis of the relationship with the CALIMA and SAMA analyzers. The Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available in the docs/ directory of this release. 8. DTDs One for the AG XML files, ag-1.1.dtd, located both in docs/ and with the .xml files in data/xml/treebank/. 9. Copyright Information Portions (c) 2011-2021 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Mohamed Maamouri, manager and senior researcher, Ann Bies, Seth Kulick, 11. Update Log This index was updated on April 10, 2019 by Seth Kulick