Arabic Treebank - Broadcast News V1.0 CatalogID: LDC2012T07 Release date: April 16, 2012 Linguistic Data Consortium Authors: Mohamed Maamouri, Ann Bies, Seth Kulick, Sondos Krouna, Dalila Tabassi, Michael Ciul 1. Introduction This corpus of Arabic Treebank consists of part-of-speech/morphological annotation and syntactic tree annotation for 432,976 source tokens (517,080 tree tokens, after clitic splitting) of transcribed Arabic broadcast news speech. All of this data was previously released as subcorpora in earlier versions to the GALE community; this publication consolidates the Arabic Treebank broadcast news data, and includes final corrections to some tokens and part-of-speech tags. This publication contains part-of-speech/morphology/gloss annotation and syntactic treebank annotation that is in accordance with the Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines, available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. These are the same annotation guidelines used for the updated and revised newswire corpora that have been recently released (Arabic Treebank Part 1 - V4.1, CatalogID: LDC2010T13; Arabic Treebank Part 2 v 3.1, CatalogID: LDC2011T09; Arabic Treebank part 3 - v3.2, CatalogID: LDC2010T08). The transcription of this broadcast news data was produced by LDC. This release conforms to the format conventions initiated with the releases of Arabic Treebank part 5 - v1.0, LDC2009E72 (ATB5) and Arabic Treebank Part 6 V1.0 - GALE Phase 4 dev09, LDC2009E108 (ATB6), which are detailed in docs/readme-files.txt and in the docs/KulickBiesMaamouri-LREC2010.pdf paper: Consistent and Flexible Integration of Morphological Annotation in the Arabic Treebank. Seth Kulick, Ann Bies and Mohamed Maamouri. In Proceedings of the Seventh International Conference on Language Resources and Evaluation (LREC 2010), Malta May 19-21, 2010. Available: docs/KulickBiesMaamouri-LREC2010.pdf In addition, two papers written about the revision and enhancement process of the newswire corpora that resulted in the revised annotation guidelines are available on the LDC website: Enhancing the Arabic Treebank: A Collaborative Effort toward New Annotation Guidelines. Mohamed Maamouri, Ann Bies, Seth Kulick. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: http://papers.ldc.upenn.edu/LREC2008/Enhancing_Arabic_Treebank.pdf Poster: http://papers.ldc.upenn.edu/LREC2008/Enhancement-poster.ppt Diacritic Annotation in the Arabic Treebank and its Impact on Parser Evaluation. Mohamed Maamouri, Seth Kulick, Ann Bies. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC 2008), Marrakech, Morocco, May 28-30, 2008. Available: Paper: http://papers.ldc.upenn.edu/LREC2008/Diacritic_Annotation_ATB.pdf Poster: http://papers.ldc.upenn.edu/LREC2008/Diacritization-poster.ppt This corpus is part of an on-going effort to produce parallel Arabic and English Treebanks at LDC. The files in this release are parallel with the same file IDs in the English Translation Treebank Broadcast News corpus, which will be published in the near future (the subcorpora of which are currently available to the GALE community). This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this publication does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 2. Annotation 2.1 Tasks and Guidelines The Arabic Treebank project consists of two distinct phases: (a) Part-of-Speech (=POS) tagging which divides the text into lexical tokens, and gives relevant information about each token such as lexical category, inflectional features, and a gloss (referred to as POS for convenience, although it includes morphological and gloss information not traditionally included with part-of-speech annotation), and (b) Arabic Treebanking (=ArabicTB) which characterizes the constituent structures of word sequences, provides categories for each non-terminal node, and identifies null elements, co-reference, traces, etc. Tim Buckwalter's transliteration system, which we use for this corpus, is described at http://www.qamus.org/transliteration.htm. The revised Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. Guidelines for the annotation of speech data (based on the English Treebank Switchboard Guidelines for speech effects) are also available at http://projects.ldc.upenn.edu/ArabicTreebank/. 2.2 Annotation Process The LDC Standard Arabic Morphological Analyzer LDC2009E73 (SAMA 3.1) was used to generate a candidate list of POS values for each word/token and our annotators picked the appropriate one manually. (Please note that some words do not exist in this lexicon.) The POS annotation task is to select the correct POS tag. Once POS is done, we automatically separate the clitics based on the POS selection. We use the following tags for non-Arabic data: NOUN_NUM or ADJ_NUM for numerical data, PUNC for punctuation, and FOREIGN or LATIN for non-Arabic alphabetic data. Dialect forms are given the POS tag DIALECT. We then implemented automatic checks on the part-of-speech tags with consequent further manual revision when necessary to ensure the consistency of the part-of-speech tags with the current guidelines. The Arabic morphological tagset was then reduced to a smaller POS set, and the files were automatically parsed. The parses were then hand corrected by human annotators according to the Arabic Treebank Annotation Guidelines. The QC process consists of a series of specific searches for several types of potential inconsistency and annotation error. Any errors found in these searches were hand corrected. The annotators for this project were Luma Ateyah, Olfa Bayouth, Maha Ben Hadj Aleya, Sameh Benna, Asma Berrima, Basma Bouziri, Faiez Dhieb, Rachida Fathallah, Fatma Gaddeche, Esma Maamouri Ghrib, Aicha Graja, Sondos Krouna, Badia Laadioui, Leila Laghrissi, Soumeya Mekki, Wigdan Mekki, Mouna Rezig, and Dalila Tabassi. 3. Source Data Profile 3.1 Data Selection Process This corpus of Arabic Treebank - Broadcast News consists of 120 broadcast news (BN) stories. There are a total of 432,976 source tokens before clitics are split and 517,080 tree tokens after clitics are separated for the treebank annotation. This corpus is part of an on-going effort to produce parallel Arabic and English Treebanks at LDC. The files in this release are parallel with the same file IDs in the English Translation Treebank Broadcast News corpus, which will be published in the near future (the subcorpora of which are currently available to the GALE community). 3.2 Data Sources and Epochs The data consists of Arabic broadcast news stories dating from 2005-2008 (from Abu Dhabi TV, Al Alam News Channel, Al Arabiya, Al Baghdadya News, Al Fayha, Al Hurra, Al Iraqiyah, Aljazeera, Al Ordiniyah, Al Sharqiya, Dubai News, Iraqiyah, Kuwait TV, Oman TV, PAC, Ltd., Saudi TV, Sawa News, and Syria TV). 4. Annotated Data Profile This data consists of 120 broadcast news (BN) stories. There are a total of 432,976 source tokens before clitics are split and 517,080 tree tokens after clitics are separated for the treebank annotation, all of which have been annotated for morphology/part-of-speech and syntactic structure. The source file IDs are listed in docs/file.ids. A listing of all of the files in this release can be found in docs/file.tbl. The data formats, including the integrated format, are documented in docs/readme-files.txt and in the paper at docs/KulickBiesMaamouri-LREC2010.pdf. 5. Data Directory Structure The directory structure for this data (and for the /data directory) is in docs/file.tbl. In the docs/ directory: - KulickBiesMaamouri-LREC2010.pdf - Paper describing data formats and the integration of Treebank and SAMA tokens. - ag-1.1.dtd - This is the dtd file for the AG XML. - errata.txt - Lists errata and the URL where any additional errata from this corpus will be listed. - file.ids - A list of file ids in the corpus. - file.tbl - Directory structure for everything in this package. - readme-files.txt - An extensive description of the new integrated format, also describes the various formats of the data. - tags-count.txt - A list of the POS/morphological tags after the clitics are separated and after treebank annotation, along with the number of occurrences of each tag. - atb-bn-taglist-conversion-to-PennPOS-forrelease.lisp - Lisp code mapping the full morphological tags to a much smaller list, similar to the Penn POS tagset, strictly for convenience. 6. File Format Description A description of the file formats (and the types of files present for each of the IDs in docs/file.ids) is in docs/readme-files.txt and in docs/KulickBiesMaamouri-LREC2010.pdf, including a description of the modifications that have been made to the format of the data in the various .txt and .tree files compared with ATB releases prior to ATB5 and detailed information on the integrated format. 7. Data Validation The data went through the following annotation procedure: POS procedure: - All words went through the morphological analyzer. - All words are included in the first pass of POS where annotators select one out of many choices provided by the morphological analyzer. - All files went through a second pass of POS annotation where annotators review the annotation done in the previous POS pass. TB procedure: - Words/tokens from the POS annotation are processed to separate clitics in preparation for TB annotation. After clitic separation, the number of words/tokens increases from 432,976 to 517,080. - The sentences were pre-parsed to improve productivity, and the parses were then hand corrected. - All files went through a second pass of TB annotation where annotators review the annotation done in the previous TB pass. - Annotators went through a final pass of annotation with the help of diagnostic QC searches to catch potential patterns of annotation errors. Quality assurance & annotation checking for this release: Every token in the treebank has been explicitly tested against the possible SAMA 3.1 solutions for that token. Where there have been discrepancies, the results have been manually inspected and in many cases changed. See docs/readme-files.txt for an detailed description of this procedure and docs/errata.txt for a listing of some of the remaining discrepancies between the treebank and SAMA. The Penn Arabic Treebank (PATB) Morphological and Syntactic Annotation Guidelines are available on the LDC website at http://projects.ldc.upenn.edu/ArabicTreebank/. 8. DTDs One for the AG XML files, ag-1.1.dtd. 9. Copyright Information Portions (c) 2009-2012 Trustees of the University of Pennsylvania 10. Contact Information Contact info for key project personnel: Mohamed Maamouri, manager and senior researcher, maamouri@ldc.upenn.edu Ann Bies, bies@ldc.upenn.edu Seth Kulick, skulick@ldc.upenn.edu 11. Update Log This index was updated on April 16, 2012 by Ann Bies.