BOLT English PropBank and Sense -- DF, SMS/Chat, and CTS

             University of Colorado

 Authors: Martha Palmer, Jena D. Hwang, Claire Bonial, Tim O'Gorman, 
          James Gung, Kevin Stowe, Meredith Green 
         

1. Introduction

 The DARPA BOLT Program created new techniques for automated translation
 and linguistic analysis that can be applied to informal genres of text
 and speech common to online and in-person communications. The BOLT data 
 team led by Linguistic Data Consortium was responsible for collecting 
 informal data sources including discussion forums, text messaging and 
 chat and conversational telephone speech in English, Chinese and Egyptian 
 Arabic, and applying annotations including translation, word alignment, 
 Treebank, PropBank, co-reference and queries/responses.

 This corpus contains the data produced for the PropBank annotation task led 
 by University of Colorado. This release consists of two types of annotations: 
 PropBank annotation and verbnet sense disambiguation annotation. PropBank 
 annotation provides a layer of semantic annotation on top of the phrase 
 structure of Treebank. Each predicate verb in a tree is annotated in terms of 
 its sense and semantic roles. This annotation aims to provide consistent 
 semantic role labels across different syntactic realizations of the same verb 
 and to assign functional tags to all non-core arguments of the verb. Verbnet 
 sense disambiguation provides additional insight into sense distinctions of 
 a verbal predicate by disambiguating the Verbnet 3.2 class that verb fits into.   

 Annotation in this release is performed on BOLT treebank annotation. Tokens 
 resulted from treebank annotation are directly used for annotation. 

 PropBank data covers three genres: DF(Discussion Forums), SMS/Chat, and CTS 
 (Conversational Telephone Speech). Verbnet sense disambiguation data covers 
 two genres: DF and SMS/Chat.

 PropBank data in this release were previously released as E-corpora: 

 BOLT Phase 1 English Propbank DF Part 1 (LDC2012E123)
 BOLT Phase 1 English Propbank DF Part 2 (LDC2012E128)
 BOLT Phase 1 English Propbank DF Part 3 (LDC2013E05)
 BOLT Phase 1 English Propbank DF Part 4 (LDC2013E34)
 BOLT Phase 1 English Propbank DF Part 5 (LDC2013E74)
 BOLT Phase 1 English Propbank DF Part 6 (LDC2013E102)
 BOLT Phase 1 English Propbank DF Part 7 (LDC2013E129)
 BOLT Phase 2 English Propbank SMS/Chat Part 1 (LDC2014E22)
 BOLT Phase 2 English Propbank SMS/Chat Part 2 (LDC2014E97)
 BOLT Phase 2 English Propbank SMS/Chat Part 3 (LDC2015E35)
 BOLT Phase 2 English Propbank SMS/Chat Part 4 (LDC2015E36)
 BOLT Phase 2 English Propbank SMS/Chat Part 5 (LDC2015E37)
 BOLT Phase 3 English Propbank CTS Part 1 (LDC2015E38)
 BOLT Phase 3 English Propbank CTS Part 2 (LDC2015E56)
 BOLT Phase 3 English Propbank CTS Part 3 (LDC2015E57)
 
2. Source and Annotation Data

2.1 Source Data 

 DF source data is manually harvested online by native speakers and 
 subsequently triaged down to a selected portion and sentence-segmented 
 for translation and annotation. SMS and chat source data are collected via 
 live collection platforms and donations. CTS source data was originally 
 collected for the Arabic and Chinese CallHome and CallFriend program, where
 the collected audio source files were first transcribed and then translated 
 by professional transcription/translation agencies. 
 
 The source token used for annotation are from the BOLT English Treebank 
 data, which consists of two types of source: the English source and English 
 translation source (as indicated in the following by "ECTB" and "EATB"). 
 BOLT English Treebank data were originally released as following e-corpora: 
 
 BOLT Phase 1 English Treebank DF Part 1 (LDC2012E92)
 BOLT Phase 1 English Treebank DF Part 2 (LDC2012E97)
 BOLT Phase 1 English Treebank DF Part 3 (LDC2012E114)
 BOLT Phase 1 English Treebank DF Part 4 (LDC2013E17)
 BOLT Phase 1 English Treebank DF Part 5 (LDC2013E40)
 BOLT Phase 1 English Treebank DF Part 6 --ECTB (LDC2013E50)
 BOLT Phase 1 English Treebank DF Part 7 --ECTB (LDC2013E76)     
 BOLT Phase 2 English Treebank SMS/Chat Part 1 (LDC2013E127)
 BOLT Phase 2 English Treebank SMS/Chat Part 2 (LDC2014E03)
 BOLT Phase 2 English Treebank SMS/Chat Part 3 -- ECTB (LDC2014E44)
 BOLT Phase 2 English Treebank SMS/Chat Part 4 -- ECTB (LDC2014E78)
 BOLT Phase 2 English Treebank SMS/Chat Part 5 -- EATB (LDC2014E107)
 BOLT Phase 3 English Treebank CTS Part 1 -- ECTB (LDC2015E15)
 BOLT Phase 3 English Treebank CTS Part 2 -- ECTB (LDC2015E25)
 BOLT Phase 3 English Treebank CTS Part 3 -- EATB (LDC2015E30)

 Correspondingly, this package is constructed as 01,02,03... under genre 
 directory to mirror e-corpus source data e-releases.  
   

2.2 PropBank Annotation Profile

Language  Genre    PropFile  Frame PredicateDecision Roleset SourceToken 
----------------------------------------------------------
English   SMS/Chat  877       n/a    n/a     n/a    276914 
English   DF        850       n/a    n/a     n/a    415159    
English   CTS       29        n/a    n/a     n/a    109718
----------------------------------------------------------
Total               1762      7312   160677  10687  801791   


Note: sourceTokens = tree tokens

2.3 Sense Annotation Profile

Language   Genre    VNclassFile SenseFile PredicateDecision SourceToken 
-----------------------------------------------------------------------
English    SMS/Chat   n/a      151       n/a        50292   
English    DF         n/a      774       n/a        415159
----------------------------------------------------------------------
Total                 326      925       5289       465451   
  
Note: sourceToken = tree tokens

3. Annotation

3.1 Annotation Guidelines

The ProbBank annotation guidelines are included in this package, and can 
be found at docs/Propbank-Annotation-Guidelines.pdf. The guidelines were 
largely developed under the OntoNotes effort, which was part of the DARPA 
GALE project. They were extended as part of this BOLT effort to better cover 
new data genres. 

The BOLT propBank effort has focused on expanding predicate annotation beyond 
the verb and includes annotation on verbs, eventive nouns, adjectives, and 
light verb constructions. A major focus for English PropBank has been to 
unify Frame Files across these different parts of speech. This means that the
frame used for 'bathe' is always identical to that used for 'bath'. The goal 
of this expansion is to provide event semantic representations for the entire 
sentence, specifically pieces most often missed when looking solely
at verbs. 

PropBank annotation of data in the df/ and sms_chat/01 folders was done under 
the "Propbank 2.0" format used in other prior Propbank releases, such as 
OntoNotes. That annotation was converted to the new "unified" ("Propbank 3.0") 
format described in the docs/ folder, in which predicates are not 
differentiated by parts of speech. Other previously released data has also 
been converted to Unified Propbank, and the updated versions of those pointers 
can be found at http://propbank.github.io/ .

PropBank annotation is supported with the Jubilee interface implemented 
by the University of Colorado, where any node in the tree can be selected 
and assigned tags.

The sense disambiguation annotation guidelines are included in this package, 
and can be found at docs/VerbNet_Guidelines.pdf. VerbNet annotation is 
supported with the STAMP interface implemented by the University of Colorado.

The annotation data is stored in data/.


3.2 Annotator Training

Majority of our annotators are experienced from previous PropBank and 
VerbNet annotation projects (e.g. Gale OntoNotes and Semlink annotation). 
New annotators have been trained on a set of trial data till they reach an 
adequate level of consistency before they start production-level annotation.

3.3 Annotation Stages

For propBank annotation, predicate argument structure annotation is carried 
out in two phases. In the first phase, a frame file for a predicate is created 
by examining all instances of the predicate in the Treebank data and 
distinguishing two or more senses, which are called Framesets or Rolesets. 
In the second phase, the predicate argument structure of all instances of 
the predicate are annotated, using the Frame File as a reference. The 
arguments of each predicate receive an argument label in the form of ArgN, 
where N is an integer between 0 and 6. These numbered arguments represent 
core arguments that are defined in relation to the predicate. Each core 
argument plays a unique role with regard to the predicate. Core arguments 
are as consistent as possible with respect to thematic roles. Arg0 is used 
for the most agentive role a given predicate can take. Arg1 is used for the 
proto-patient, or most patient-like argument. Arg2 is most often used to 
mark a beneficiary, Arg3 is most often used to show a start point, and Arg4 
is most often used for the end point. Args2- 4 are less consistent, as not 
all verbs with more than 2 core roles require a start/end point role or a 
beneficiary, so these are used in other ways as dictated by a given predicate.

For verb sense disambiguation, Annotation follows PropBank annotation of the 
same texts, allowing adjudicated gold knowledge of the correct verbal 
predicate. Each predicate is double annotated and adjudicated with the correct 
VerbNet class. 

3.4 Annotation Quality Control

For propBank annotation, all of the annotations for non 'be' verbs are the 
result of double blind annotation followed by adjudication of disagreements. 
All instances of verb 'be' are first deterministically annotated using 
a number of heuristics. The 'be' instances also are manually single 
annotated. The adjudicator resolves the disagreements between the human and 
the non-human annotation. Auxiliary senses of verbs such as "have" and "do" 
in which the gold Treebank annotation unambiguously treats them as auxiliaries 
were automatically tagged as such. If any ambiguity exists, those terms were 
double annotated and adjudicated. 

In verb sense disambiguation annotation, all data is adjudicated, and all 
classes that do not achieve 90% ITA are re-evaluated and re-annotated.

4. Data Structure and File Format

4.1 .prop Files

The proposition format is described in docs/EPB-data-format.txt.

4.2 .sense Files

The data format for .vn.sense files is included in docs/Verbnet-data-format.txt

4.3 Frame Files

The frame files are in XML format. The definition is included in 
dtds/frameset.dtd. 

4.4 Verbnet Class Files

Files are in XML format. The definition is included in docs/vn_class-3.dtd. 

4.5 Using Pointers and Scripts 

Sufficient information for using pointers is provided in 
docs/EPB-data-format.txt. Official conversions of the Propbank pointers into a 
stand-off "CoNLL-style" format, similar to that released in the CoNLL-2012 
task, will be provided at http://propbank.github.io/. 

4.6 Using Multi-layer Annotation Data 

PropBank and Sense annotations in this package can be used together with 
other type of BOLT annotation data as the same source tokens are annotated 
in multi-levels, including treebank, word alignment, co-reference 
annotations. Tokens are numbered in the same way, and identical 
filebase/filestem names are used across annotations. Each type of 
annotation adds its own file extensions. So users can find other types of 
annotations according to the same filebase/filestem names.  

4.7 Data Complication

One non-public release of Treebank data (part 6 of the BOLT Discussion 
Forum data, see the filelist included in docs/part6_tree_filelist_EPB.txt) 
originally included a version of "meta_removed" trees in which many, but 
not all, META phrase nodes were removed, due to an error in the processing 
scripts.  While this error was caught and has been corrected in the 
released data, Propbank labels of that portion of the BOLT data were 
annotated on those original trees containing that error, and therefore the 
token indices and tree locations of the corresponding .prop files only match 
those original trees.  In order to maintain usability of that section for 
Propbank reference, the original version of those trees are included in this 
release, so that the corresponding .prop files will have valid references, 
and those files have been given the special extension of "pb_version" rather 
than "meta_included" or "meta_removed".  However, the normal "meta_removed" 
versions of these trees are a corrected version of those trees, and therefore 
use of these "pb_version" trees is to be considered deprecated for any 
purposes other than the use of Propbank data.  The file names of the 
corresponding .prop files have been changed to match this "pb_version" file 
naming convention. (For details of how Propbank .prop files reference tree 
files and locations within trees, consult the EPB-data-format.txt file in 
the documentation.) 

5. Package Directory Structure

--docs
   --README.txt
   --EPB-data-format.txt
   --Propbank-Annotation-Guidelines.pdf
   --VerbNet_Guidelines.pdf
   --Verbnet-data-format.txt
   --part6_tree_filelist_EPB.txt
   --filelist.txt
--data
  --propbank
    --annotation
      --cts/{01,02,03}/*.prop 
      --df/{01,02,03,04,05,06,07}/*.prop
      --sms_chat/{01,02,03,04,05}/*.prop    
    --metadata
      --frames/*.xml
  --sense 
    --annotation
      --df/{01,02,03,04,05,06,07}/*.sense
      --sms_chat/01/*.sense 
    --metadata
      --verbnet/*.xml 
--dtds/
   --frameset.dtd
   --vn_class-3.dtd

6. Documentation

 -docs/EPB-data-format.txt: this document explains the data format of the 
  English Proposition Bank annotation
 -docs/Propbank-Annotation-Guidelines.pdf: English Proposition Bank 
  annotation guidelines
 -docs/VerbNet_Guidelines.pdf: guidelines for sense annotation
 -docs/Verbnet-data-format.txt: this document specifies sense file format.
 -docs/part6_tree_filelist_EPB.txt: filelist affected by issue described in 
  Section 4.7 (Data Complication)
 -docs/filelist.txt: the list of files showing package structures
 -dtds/frameset.dtd: this document specifies frame file format. 
 -dtds/vn_class-3.dtd: this document specifies vn file format.

7. Data Validation and Sanity Check

 - Validate XML files against DTD (in the docs/)
 - Verify tokens used for PropBank match tree tokens from treebank annotation
 - Verify filename stems consistent with tree filename stems
 - Verify encoding as UTF-8
 - Verify pointers to the tree nodes are valid
 - Verify PropBank labels are valid
 - Verify PropBank annotation is consistent with the associated frameset
 - XML frame files were validated against docs/frameset.dtd and were
   checked for frame internal consistency (e.g. misspelling, extraneous 
   characters, general correctness).

8. Acknowledgements

 This material is based upon work supported by the Defense Advanced Research
 Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does
 not necessarily reflect the position or the policy of the Government, and no
 official endorsement should be inferred.

 Stephanie Strassel, Xuansong Li, and Stephen Grimes Li from LDC made 
 contributions to propBank data via drafting documentation, sanity-checking 
 data, specifying data format, and streamlining data release process.   

9. Copyright Info

(c) 2012, 2013, 2014, 2015, 2016, 2017, 2020 Trustees of the University of 
Pennsylvania.

10. Contact Information

If you have questions about this data release, please contact the following 
personnel:

Martha Palmer <martha.palmer@colorado.edu>
Tim O'Gorman <ogormant@colorado.edu>
Kevin Stowe <kevin.stowe@colorado.edu>
Stephanie Strassel <strassel@ldc.upenn.edu>
Xuansong Li <xuansong@ldc.upenn.edu>

--------------------------------------------------------------------------
README Created Jan 12, 2017 by Xuansong Li, Tim O'Gorman, and Martha Palmer