English Web Treebank Propbank LDC2017T15 October 25, 2016 University of Colorado at Boulder Linguistic Data Consortium Authors: Martha Palmer, Jena D. Hwang, Claire Bonial, Tim O'Gorman, James Gung 1 Introduction This file contains documentation for English Web Treebank Propbank annotations. This release contains proposition annotations for 49,118 predicate instances found in the 254,830 tokens source Treebank corpus (LDC2012T13). All annotations follow the English PropBank guidelines (docs/Propbank-Annotation-Guidelines.pdf). 2 Source Data and Selection Source files used for English PropBank annotation were taken from treebank annotations found in the English Web Treebank (LDC2012T13). 3 Annotation Data Profile Language: English Genre: Webtext, with five subgenres (reviews, weblogs,email,newsgroups and answers) Source Word Count: 254,830 tokens Total Predicates Release: 49,118 predicates 4 Annotation 4.1 Annotation Guidelines and Updates The annotation guidelines for the English Proposition Bank are included in this package. They can be found at docs/Propbank-Annotation-Guidelines.pdf. Instances were annotated using rolesets unified across parts of speech, using the Propbank v3.0 frames, and updated to Propbank v3.1 frames before release. 4.2 Annotator Training Majority of our annotators are experienced from previous PropBank annotation projects (e.g. Gale OntoNotes PropBank, BOLT Propbank annotation, medical propbanking). New annotators have been trained on a set of trial data till they reach an adequate level of consistency before they start production-level annotation. 4.3 Annotation Process and Quality Control All of the annotations for non 'be' verbs in this release are the result of double blind annotation followed by adjudication of disagreements. All instances of verb 'be' are manually annotated in a single pass, and then adjudicated against a deterministc, heuristic-based SRL output. The adjudicator resolves the disagreements between the human and the non-human annotation. Auxiliary senses are also automatically labeled as such when the gold Treebank annotation is unambiguous regarding their function; all edge cases are manually double-annotated. 5 File Format The proposition format is described in docs/EPB-data-format.txt. The frame files are in XML format. The definition is included in docs/frameset.dtd. 6 Data Directory Structure data/annotation/frames: frame files data/annotation/props: files of PropBank annotation data/source: parse files from English Web Treebank (LDC2012T13) used in annotation. 7. Documentation docs/Propbank-Annotation-Guidelines.pdf: English Proposition Bank annotation guidelines docs/EPB-data-format.txt: this document explains the data format of the English Proposition Bank annotation docs/Description-of-PB3-changes.md: overview of changes enacted in the recent switch to "unified" frames. 8. Data Validation 8.1 Data Consistency The annotated propositions went through automatic validation to ensure that (1) pointers to the tree nodes are valid, (2) PropBank labels are valid, and (3) PropBank annotation is consistent with the associated frameset. Additionally, XML frame files were validated against docs/frameset.dtd and were checked for frame internal consistency (e.g. misspelling, extraneous characters, general correctness). 8.2 General Sanity Checks Directories were checked for consistent file counts and the presence of empty files or backup files. 9 Known Problems No 10 Copyright Info (c) 2013, 2014, 2015 Trustees of the University of Pennsylvania. 11 Acknowledgement Katie Conger and Julia Bonn, trainers and adjudicators 12 Contact Information If you have questions about this data release, please contact the following personnel: Martha Palmer Tim O'Gorman --------------------------------------------- README created on October 25, 2016 by Tim O'Gorman