Introduction

The PDTB-3 is the third version of the Penn Discourse TreeBank. From its start, the goal of the project has been to demonstrate textual coherence through reliably annotating a large corpus with low-level discourse relations holding between eventualities and propositions mentioned in a text (in sentences, clauses and some noun phrases), which then serve as the arguments to the relation. The corpus over which the annotation has been done is the 1 million word Wall Street Journal corpus, distributed by the LDC as Treebank-2 (LDC95T7).

Version 2.0. of the PDTB, developed with NSF support and released by the LDC in 2008 (LDC2008T05), contains over 40600 tokens of annotated relations. Largely because the PDTB was based on the simple idea that discourse relations are grounded in an identifiable set of explicit words or phrases (discourse connectives) or simply in the adjacency of two sentences, it has been taken up and used by many researchers in the NLP community and more recently, by researchers in psycholinguistics as well. It has also stimulated the development of similar resources in other languages (Chinese, Czech, Hindi, Modern Standard Arabic, Turkish and French) and domains (biomedical texts, conversational dialogues), the organization of community-level shared tasks on shallow discourse parsing [Xue et al 2015, 2016], and a cross-lingual discourse annotation of parallel texts, the TED-MDB Corpus [Zeyrek et al, 2018], to support both linguistic understanding of coherence in different languages and improvements in machine translation of discourse connectives. Further references to this and other work can be found on the PDTB website.

While version 3.0 of the PDTB (the PDTB-3) contains a variety of corrections to PDTB-2 annotation, its primary contribution lies in the annotation of ∼13K additional relation tokens, about ∼10k of which hold within the same sentence (intra-sentential relations) and about ∼2700 hold across sentences (inter-sentential relations). The additional intra-sentential relations comprise an additional ∼5K tokens that are signalled by an explicit discourse connective (explicit relations), ∼4200 tokens with no explicit connective (implicit relations), ∼780 tokens in which the relations are signalled by phrases and/or lexico-syntactic constructions other than discourse connective (alternative lexicalizations), and ∼250 tokens of intra-sentential entity relations. Of the additional inter-sentential relations, ∼900 have an explicit discourse connective (some tokens that were missed in annotating the PDTB-2 and some which are new to the PDTB-3), ∼1400 are implicit relations, ∼200 are additional alternative lexicalizations, and ∼70 are additional entity relations.

Documentation

The documentation directory for this release includes a manual describing what is new in the PDTB-3, how the PDTB-3 differs from the PDTB-2, the methods and guidelines used in annotating the PDTB-3, and a range of statistics on the tokens, including the frequency of each connective, its sense labels and its modifiers. More information about the corpus and research carried out by the developers and others using the corpus can be found on the PDTB website.

Data

The annotation is provided in the form of separate text files (standoff annotation) that are byte-indexed into the raw text files of the Penn TreeBank. One can see samples of the annotation of different types of discourse relations, along with their visualization in the Annotator tool at:

Explicit relations
Implicit relations
Altlex and AltLexC relations
Entity relations
Hypophora relations
NoRel (annotated only between adjacent sentences within a paragraph that are not linked to each other by a discourse relation)

Tools

This release includes the tool used in annotating the PDTB-3 (Annotator_v4.8.jar), which can also be used for viewing the corpus. Both input to and output from the tool is in the form of a file of pipe-delimited records whose structure is specified in the annotation manual. The release also includes a tool for converting PDTB-2 annotation files into the PDTB-3 format.

Acknowledgment

This work has been funded by the National Science Foundation, under grant NSF IIS 1422186 to the University of Pennsylvania and grant NSF IIS 1421067 to the University of Wisconsin, Milwaukee. The content of this publication does not necessarily reflect the position or policy of the Government, and no official endorsement should be inferred.

Introduction

Documentation

Data

Tools

Acknowledgment

Copyright