BOLT Egyptian Arabic PropBank -- DF, SMS/Chat, and CTS University of Colorado Authors: Martha Palmer, Jena D. Hwang, Aous Mansouri, Claire Bonial, Tim O'Gorman, James Gung 1. Introduction The DARPA BOLT Program created new techniques for automated translation and linguistic analysis that can be applied to informal genres of text and speech common to online and in-person communications. The BOLT data team led by Linguistic Data Consortium was responsible for collecting informal data sources including discussion forums, text messaging and chat and conversational telephone speech in English, Chinese and Egyptian Arabic, and applying annotations including translation, word alignment, Treebank, PropBank, co-reference and queries/responses. This corpus contains the data produced for the PropBank annotation task led by University of Colorado. PropBank annotation provides a layer of semantic annotation on top of the phrase structure of Treebank. Each predicate type in a tree is annotated in terms of its sense and semantic roles. This annotation aims to provide consistent semantic role labels across different syntactic realizations of the same predicate and to assign functional tags to all non-core arguments of the predicate. PropBank data for this release is performed on BOLT Treebank annotation. Tokens resulted from Treebank annotation are directly used for PropBank annotation. Data covers three genres: DF(Discussion Forums), SMS/Chat, and CTS (Conversational Telephone Speech). Annotation data in this release were originally released in the following e-corpora: BOLT Phase 1 Egyptian Arabic Propbank DF Part 1 V3.0 (LDC2012E122) BOLT Phase 1 Egyptian Arabic Propbank DF Part 2 V3.0 (LDC2012E129) BOLT Phase 1 Egyptian Arabic Propbank DF Part 3 V3.0 (LDC2013E22) BOLT Phase 1 Egyptian Arabic Propbank DF Part 4 V3.0 (LDC2013E23) BOLT Phase 1 Egyptian Arabic Propbank DF Part 5 V3.0 (LDC2013E24) BOLT Phase 1 Egyptian Arabic Propbank DF Part 6 V2.0 (LDC2013E72) BOLT Phase 1 Egyptian Arabic Propbank DF Part 7 V2.0 (LDC2013E73) BOLT Phase 1 Egyptian Arabic Propbank DF Part 8 (LDC2013E108) BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 1 (LDC2014E32) BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 2 (LDC2014E62) BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 3 (LDC2014E98) BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 4 (LDC2014E118) BOLT Phase 2 Egyptian Arabic Propbank SMS/Chat Part 5 (LDC2014E128) BOLT Phase 3 Egyptian Arabic Propbank CTS Part 1 v2 (LDC2015E32) BOLT Phase 3 Egyptian Arabic Propbank CTS Part 2 (LDC2015E55) 2. Source and Annotation Data 2.1 Source Data DF source data was manually harvested online by native speakers and subsequently triaged down to a selected portion and sentence-segmented for translation and annotation. SMS and chat source data are collected via live collection platforms and donations. CTS source data was originally collected for the Arabic and Chinese CallHome and CallFriend program, where the collected audio source files were first transcribed and then translated by professional transcription/translation agencies. Source data used for PropBank annotation are tokens from BOLT Egyptian Arabic Treebank data, which were originally released in the following e-corpora: Arabic Treebank ARZ Part 1 (LDC2012E28) BOLT Phase 1 Egyptian Arabic Treebank DF Part 1 (LDC2012E93) BOLT Phase 1 Egyptian Arabic Treebank DF Part 2 (LDC2012E98) BOLT Phase 1 Egyptian Arabic Treebank DF Part 3 (LDC2012E89) BOLT Phase 1 Egyptian Arabic Treebank DF Part 4 (LDC2012E99) BOLT Phase 1 Egyptian Arabic Treebank DF Part 5 (LDC2012E107) BOLT Phase 1 Egyptian Arabic Treebank DF Part 6 (LDC2012E125) BOLT Phase 1 Egyptian Arabic Treebank DF Part 7 (LDC2013E12) BOLT Phase 1 Egyptian Arabic Treebank DF Part 8 (LDC2013E21) BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 1 (LDC2013E120) BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 2 (LDC2013E133) BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 3 (LDC2014E17) BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 4 (LDC2014E43) BOLT Phase 2 Egyptian Arabic Treebank SMS/Chat Part 5 (LDC2014E63) BOLT Phase 3 Egyptian Arabic Treebank CTS Part 1 V2.0 (LDC2014E120) BOLT Phase 3 Egyptian Arabic Treebank CTS Part 2 V2.0 (LDC2015E04) 2.2 Annotation Data Profile Language Genre .propFile FrameFiles Rolesets Predicates SourceTokens ------------------------------------------------------------------------- Egyptian SMS/Chat 1083 n/a n/a 31397 198007 Egyptian DF 730 n/a n/a 59127 400448 Egyptian CTS 112 n/a n/a 15763 99201 ------------------------------------------------------------------------- Total 1925 9626 12866 106287 697656 Note: sourceTokens = tree tokens 3. Annotation 3.1 Annotation Guidelines The annotation guidelines are included in this package, and can be found at docs/APB-Annotation-Guidelines.pdf. The guidelines were largely developed under the OntoNotes effort, which was part of the DARPA GALE program. They were extended as part of the BOLT program effort to better cover the new data genres and Egyptian dialect. The annotation data is stored in data/. PropBank annotation is supported with the Jubilee interface implemented by the University of Colorado, where any node in the tree can be selected and assigned tags. 3.2 Annotator Training New Annotators are trained on a set of trial data and are put under careful supervision until they reach an adequate level of consistency before they start production level annotation. Additionally, all annotators are required to attend a bi-weekly meeting to discuss questions and issues encountered during annotation. These meetings also serve as help for new annotators and as a refresher course for seasoned annotators. 3.3 Annotation Stages For propBank annotation, predicate argument structure annotation is carried out in two phases. In the first phase, a frame file for a predicate is created by examining all instances of the predicate in the Treebank data and distinguishing two or more senses, which are called Framesets or Rolesets. In the second phase, the predicate argument structure of all instances of the predicate are annotated, using the Frame File as a reference. The arguments of each predicate receive an argument label in the form of ArgN, where N is an integer between 0 and 6. These numbered arguments represent core arguments that are defined in relation to the predicate. Each core argument plays a unique role with regard to the predicate. Core arguments are as consistent as possible with respect to thematic roles. Arg0 is used for the most agentive role a given predicate can take. Arg1 is used for the proto-patient, or most patient-like argument. Arg2 is most often used to mark a beneficiary, Arg3 is most often used to show a start point, and Arg4 is most often used for the end point. Args2-4 are less consistent, as not all verbs with more than 2 core roles require a start/end point role or a beneficiary, so these are used in other ways as dictated by a given predicate. 3.4 Annotation Quality Control All of the annotations are the result of double blind annotation followed by adjudication of disagreements. 4. Data Structure and File Format 4.1 .prop Files The proposition format is described in docs/APB-data-format.txt. 4.2 Frame Files The frame files are in XML format. The definition is included in docs/verb.dtd. ARZ frame files can be distinguished from the MSA frame files by the topmost comment "EGYPTIAN ARABIC". 4.3 Using Multi-layer Annotation Data PropBank annotations make use of corresponding .tree files for each document, and the annotations use the sentence divisions and tokenizations from those tree files. The way that the .prop files relate to the .tree files is detailed in the .prop file description at docs/APB-data-format.txt. PropBank annotations in this package can be used together with other type of BOLT annotatin data as the same source tokens are annotated in multi-levels, including treebank, word alignment, co-reference annotations. Tokens are numbered in the same way, and identical filebase/filestem names are used across annotations. Each type of annotation adds its own file extensions. So users can find other types of annotations according to the same filestem names. 5. Package Directory Structure --docs --README.txt --APB-Annotation-Guidelines.pdf --APB-data-format.txt --filelist.txt --data --annotation --cts/{01,02}/*.prop --df/{01,02,03,04,05,06,07,08}/*.prop --sms_chat/{01,02,03,04,05}/*.prop --metadata --frames/*.xml --dtds --verb.dtd 6. Documentation -docs/README.txt: this file. -docs/APB-data-format.txt: this document explains data format of the Egyptian Arabic PropBank annotation -docs/APB-Annotation-Guidelines.pdf: Egyptian Arabic PropBank annotation guidelines -docs/filelist.txt: the list of files showing the directory structure of this package -dtds/verb.dtd: this document specifies frame file format. 7. Data Validation and Sanity Check - Validate XML files against DTD (in the docs/) - Verify filename stems consistent with tree filename stems - Verify encoding as UTF-8 - XML frame files were validated against docs/frameset.dtd and were checked for frame internal consistency (e.g. misspelling, extraneous characters, general correctness). 8. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Annotation contributors: James Babani, Yahya Aseri, Maha Foster, trainers and adjudicators Stephanie Strassel, Xuansong Li, and Stephen Grimes from LDC made contributions to propBank data via drafting documentation, sanity-checking data, specifying data format, and streamlining data release process. 9. Copyright Info (c) 2012, 2013, 2014, 2015, 2016, 2017 Trustees of the University of Pennsylvania. 10. Contact Information If you have questions about this data release, please contact the following personnel: Martha Palmer Stephanie Strassel Xuansong Li -------------------------------------------------------------------------- README Created December 21, 2016 by Xuansong Li and Tim O'Gorman