BOLT English PropBank and Sense -- DF, SMS/Chat, and CTS University of Colorado Authors: Martha Palmer, Jena D. Hwang, Claire Bonial, Tim O'Gorman, James Gung, Kevin Stowe, Meredith Green 1. Introduction The DARPA BOLT Program created new techniques for automated translation and linguistic analysis that can be applied to informal genres of text and speech common to online and in-person communications. The BOLT data team led by Linguistic Data Consortium was responsible for collecting informal data sources including discussion forums, text messaging and chat and conversational telephone speech in English, Chinese and Egyptian Arabic, and applying annotations including translation, word alignment, Treebank, PropBank, co-reference and queries/responses. This corpus contains the data produced for the PropBank annotation task led by University of Colorado. This release consists of two types of annotations: PropBank annotation and verbnet sense disambiguation annotation. PropBank annotation provides a layer of semantic annotation on top of the phrase structure of Treebank. Each predicate verb in a tree is annotated in terms of its sense and semantic roles. This annotation aims to provide consistent semantic role labels across different syntactic realizations of the same verb and to assign functional tags to all non-core arguments of the verb. Verbnet sense disambiguation provides additional insight into sense distinctions of a verbal predicate by disambiguating the Verbnet 3.2 class that verb fits into. Annotation in this release is performed on BOLT treebank annotation. Tokens resulted from treebank annotation are directly used for annotation. PropBank data covers three genres: DF(Discussion Forums), SMS/Chat, and CTS (Conversational Telephone Speech). Verbnet sense disambiguation data covers two genres: DF and SMS/Chat. PropBank data in this release were previously released as E-corpora: BOLT Phase 1 English Propbank DF Part 1 (LDC2012E123) BOLT Phase 1 English Propbank DF Part 2 (LDC2012E128) BOLT Phase 1 English Propbank DF Part 3 (LDC2013E05) BOLT Phase 1 English Propbank DF Part 4 (LDC2013E34) BOLT Phase 1 English Propbank DF Part 5 (LDC2013E74) BOLT Phase 1 English Propbank DF Part 6 (LDC2013E102) BOLT Phase 1 English Propbank DF Part 7 (LDC2013E129) BOLT Phase 2 English Propbank SMS/Chat Part 1 (LDC2014E22) BOLT Phase 2 English Propbank SMS/Chat Part 2 (LDC2014E97) BOLT Phase 2 English Propbank SMS/Chat Part 3 (LDC2015E35) BOLT Phase 2 English Propbank SMS/Chat Part 4 (LDC2015E36) BOLT Phase 2 English Propbank SMS/Chat Part 5 (LDC2015E37) BOLT Phase 3 English Propbank CTS Part 1 (LDC2015E38) BOLT Phase 3 English Propbank CTS Part 2 (LDC2015E56) BOLT Phase 3 English Propbank CTS Part 3 (LDC2015E57) 2. Source and Annotation Data 2.1 Source Data DF source data is manually harvested online by native speakers and subsequently triaged down to a selected portion and sentence-segmented for translation and annotation. SMS and chat source data are collected via live collection platforms and donations. CTS source data was originally collected for the Arabic and Chinese CallHome and CallFriend program, where the collected audio source files were first transcribed and then translated by professional transcription/translation agencies. The source token used for annotation are from the BOLT English Treebank data, which consists of two types of source: the English source and English translation source (as indicated in the following by "ECTB" and "EATB"). BOLT English Treebank data were originally released as following e-corpora: BOLT Phase 1 English Treebank DF Part 1 (LDC2012E92) BOLT Phase 1 English Treebank DF Part 2 (LDC2012E97) BOLT Phase 1 English Treebank DF Part 3 (LDC2012E114) BOLT Phase 1 English Treebank DF Part 4 (LDC2013E17) BOLT Phase 1 English Treebank DF Part 5 (LDC2013E40) BOLT Phase 1 English Treebank DF Part 6 --ECTB (LDC2013E50) BOLT Phase 1 English Treebank DF Part 7 --ECTB (LDC2013E76) BOLT Phase 2 English Treebank SMS/Chat Part 1 (LDC2013E127) BOLT Phase 2 English Treebank SMS/Chat Part 2 (LDC2014E03) BOLT Phase 2 English Treebank SMS/Chat Part 3 -- ECTB (LDC2014E44) BOLT Phase 2 English Treebank SMS/Chat Part 4 -- ECTB (LDC2014E78) BOLT Phase 2 English Treebank SMS/Chat Part 5 -- EATB (LDC2014E107) BOLT Phase 3 English Treebank CTS Part 1 -- ECTB (LDC2015E15) BOLT Phase 3 English Treebank CTS Part 2 -- ECTB (LDC2015E25) BOLT Phase 3 English Treebank CTS Part 3 -- EATB (LDC2015E30) Correspondingly, this package is constructed as 01,02,03... under genre directory to mirror e-corpus source data e-releases. 2.2 PropBank Annotation Profile Language Genre PropFile Frame PredicateDecision Roleset SourceToken ---------------------------------------------------------- English SMS/Chat 877 n/a n/a n/a 276914 English DF 850 n/a n/a n/a 415159 English CTS 29 n/a n/a n/a 109718 ---------------------------------------------------------- Total 1762 7312 160677 10687 801791 Note: sourceTokens = tree tokens 2.3 Sense Annotation Profile Language Genre VNclassFile SenseFile PredicateDecision SourceToken ----------------------------------------------------------------------- English SMS/Chat n/a 151 n/a 50292 English DF n/a 774 n/a 415159 ---------------------------------------------------------------------- Total 326 925 5289 465451 Note: sourceToken = tree tokens 3. Annotation 3.1 Annotation Guidelines The ProbBank annotation guidelines are included in this package, and can be found at docs/Propbank-Annotation-Guidelines.pdf. The guidelines were largely developed under the OntoNotes effort, which was part of the DARPA GALE project. They were extended as part of this BOLT effort to better cover new data genres. The BOLT propBank effort has focused on expanding predicate annotation beyond the verb and includes annotation on verbs, eventive nouns, adjectives, and light verb constructions. A major focus for English PropBank has been to unify Frame Files across these different parts of speech. This means that the frame used for 'bathe' is always identical to that used for 'bath'. The goal of this expansion is to provide event semantic representations for the entire sentence, specifically pieces most often missed when looking solely at verbs. PropBank annotation of data in the df/ and sms_chat/01 folders was done under the "Propbank 2.0" format used in other prior Propbank releases, such as OntoNotes. That annotation was converted to the new "unified" ("Propbank 3.0") format described in the docs/ folder, in which predicates are not differentiated by parts of speech. Other previously released data has also been converted to Unified Propbank, and the updated versions of those pointers can be found at http://propbank.github.io/ . PropBank annotation is supported with the Jubilee interface implemented by the University of Colorado, where any node in the tree can be selected and assigned tags. The sense disambiguation annotation guidelines are included in this package, and can be found at docs/VerbNet_Guidelines.pdf. VerbNet annotation is supported with the STAMP interface implemented by the University of Colorado. The annotation data is stored in data/. 3.2 Annotator Training Majority of our annotators are experienced from previous PropBank and VerbNet annotation projects (e.g. Gale OntoNotes and Semlink annotation). New annotators have been trained on a set of trial data till they reach an adequate level of consistency before they start production-level annotation. 3.3 Annotation Stages For propBank annotation, predicate argument structure annotation is carried out in two phases. In the first phase, a frame file for a predicate is created by examining all instances of the predicate in the Treebank data and distinguishing two or more senses, which are called Framesets or Rolesets. In the second phase, the predicate argument structure of all instances of the predicate are annotated, using the Frame File as a reference. The arguments of each predicate receive an argument label in the form of ArgN, where N is an integer between 0 and 6. These numbered arguments represent core arguments that are defined in relation to the predicate. Each core argument plays a unique role with regard to the predicate. Core arguments are as consistent as possible with respect to thematic roles. Arg0 is used for the most agentive role a given predicate can take. Arg1 is used for the proto-patient, or most patient-like argument. Arg2 is most often used to mark a beneficiary, Arg3 is most often used to show a start point, and Arg4 is most often used for the end point. Args2- 4 are less consistent, as not all verbs with more than 2 core roles require a start/end point role or a beneficiary, so these are used in other ways as dictated by a given predicate. For verb sense disambiguation, Annotation follows PropBank annotation of the same texts, allowing adjudicated gold knowledge of the correct verbal predicate. Each predicate is double annotated and adjudicated with the correct VerbNet class. 3.4 Annotation Quality Control For propBank annotation, all of the annotations for non 'be' verbs are the result of double blind annotation followed by adjudication of disagreements. All instances of verb 'be' are first deterministically annotated using a number of heuristics. The 'be' instances also are manually single annotated. The adjudicator resolves the disagreements between the human and the non-human annotation. Auxiliary senses of verbs such as "have" and "do" in which the gold Treebank annotation unambiguously treats them as auxiliaries were automatically tagged as such. If any ambiguity exists, those terms were double annotated and adjudicated. In verb sense disambiguation annotation, all data is adjudicated, and all classes that do not achieve 90% ITA are re-evaluated and re-annotated. 4. Data Structure and File Format 4.1 .prop Files The proposition format is described in docs/EPB-data-format.txt. 4.2 .sense Files The data format for .vn.sense files is included in docs/Verbnet-data-format.txt 4.3 Frame Files The frame files are in XML format. The definition is included in dtds/frameset.dtd. 4.4 Verbnet Class Files Files are in XML format. The definition is included in docs/vn_class-3.dtd. 4.5 Using Pointers and Scripts Sufficient information for using pointers is provided in docs/EPB-data-format.txt. Official conversions of the Propbank pointers into a stand-off "CoNLL-style" format, similar to that released in the CoNLL-2012 task, will be provided at http://propbank.github.io/. 4.6 Using Multi-layer Annotation Data PropBank and Sense annotations in this package can be used together with other type of BOLT annotation data as the same source tokens are annotated in multi-levels, including treebank, word alignment, co-reference annotations. Tokens are numbered in the same way, and identical filebase/filestem names are used across annotations. Each type of annotation adds its own file extensions. So users can find other types of annotations according to the same filebase/filestem names. 4.7 Data Complication One non-public release of Treebank data (part 6 of the BOLT Discussion Forum data, see the filelist included in docs/part6_tree_filelist_EPB.txt) originally included a version of "meta_removed" trees in which many, but not all, META phrase nodes were removed, due to an error in the processing scripts. While this error was caught and has been corrected in the released data, Propbank labels of that portion of the BOLT data were annotated on those original trees containing that error, and therefore the token indices and tree locations of the corresponding .prop files only match those original trees. In order to maintain usability of that section for Propbank reference, the original version of those trees are included in this release, so that the corresponding .prop files will have valid references, and those files have been given the special extension of "pb_version" rather than "meta_included" or "meta_removed". However, the normal "meta_removed" versions of these trees are a corrected version of those trees, and therefore use of these "pb_version" trees is to be considered deprecated for any purposes other than the use of Propbank data. The file names of the corresponding .prop files have been changed to match this "pb_version" file naming convention. (For details of how Propbank .prop files reference tree files and locations within trees, consult the EPB-data-format.txt file in the documentation.) 5. Package Directory Structure --docs --README.txt --EPB-data-format.txt --Propbank-Annotation-Guidelines.pdf --VerbNet_Guidelines.pdf --Verbnet-data-format.txt --part6_tree_filelist_EPB.txt --filelist.txt --data --propbank --annotation --cts/{01,02,03}/*.prop --df/{01,02,03,04,05,06,07}/*.prop --sms_chat/{01,02,03,04,05}/*.prop --metadata --frames/*.xml --sense --annotation --df/{01,02,03,04,05,06,07}/*.sense --sms_chat/01/*.sense --metadata --verbnet/*.xml --dtds/ --frameset.dtd --vn_class-3.dtd 6. Documentation -docs/EPB-data-format.txt: this document explains the data format of the English Proposition Bank annotation -docs/Propbank-Annotation-Guidelines.pdf: English Proposition Bank annotation guidelines -docs/VerbNet_Guidelines.pdf: guidelines for sense annotation -docs/Verbnet-data-format.txt: this document specifies sense file format. -docs/part6_tree_filelist_EPB.txt: filelist affected by issue described in Section 4.7 (Data Complication) -docs/filelist.txt: the list of files showing package structures -dtds/frameset.dtd: this document specifies frame file format. -dtds/vn_class-3.dtd: this document specifies vn file format. 7. Data Validation and Sanity Check - Validate XML files against DTD (in the docs/) - Verify tokens used for PropBank match tree tokens from treebank annotation - Verify filename stems consistent with tree filename stems - Verify encoding as UTF-8 - Verify pointers to the tree nodes are valid - Verify PropBank labels are valid - Verify PropBank annotation is consistent with the associated frameset - XML frame files were validated against docs/frameset.dtd and were checked for frame internal consistency (e.g. misspelling, extraneous characters, general correctness). 8. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Stephanie Strassel, Xuansong Li, and Stephen Grimes Li from LDC made contributions to propBank data via drafting documentation, sanity-checking data, specifying data format, and streamlining data release process. 9. Copyright Info (c) 2012, 2013, 2014, 2015, 2016, 2017, 2020 Trustees of the University of Pennsylvania. 10. Contact Information If you have questions about this data release, please contact the following personnel: Martha Palmer Tim O'Gorman Kevin Stowe Stephanie Strassel Xuansong Li -------------------------------------------------------------------------- README Created Jan 12, 2017 by Xuansong Li, Tim O'Gorman, and Martha Palmer