The PDTB-3 is the third version of the Penn Discourse TreeBank. From its start, the goal of the project has been to demonstrate textual coherence through reliably annotating a large corpus with low-level discourse relations holding between eventualities and propositions mentioned in a text (in sentences, clauses and some noun phrases), which then serve as the arguments to the relation. The corpus over which the annotation has been done is the 1 million word Wall Street Journal corpus, distributed by the LDC as Treebank-2 (LDC95T7).
Version 2.0. of the PDTB, developed with NSF support and released by the LDC in 2008 (LDC2008T05), contains over 40600 tokens of annotated relations. Largely because the PDTB was based on the simple idea that discourse relations are grounded in an identifiable set of explicit words or phrases (discourse connectives) or simply in the adjacency of two sentences, it has been taken up and used by many researchers in the NLP community and more recently, by researchers in psycholinguistics as well. It has also stimulated the development of similar resources in other languages (Chinese, Czech, Hindi, Modern Standard Arabic, Turkish and French) and domains (biomedical texts, conversational dialogues), the organization of community-level shared tasks on shallow discourse parsing [Xue et al 2015, 2016], and a cross-lingual discourse annotation of parallel texts, the TED-MDB Corpus [Zeyrek et al, 2018], to support both linguistic understanding of coherence in different languages and improvements in machine translation of discourse connectives. Further references to this and other work can be found on the PDTB website.
While version 3.0 of the PDTB (the PDTB-3) contains a variety of corrections to PDTB-2 annotation, its primary contribution lies in the annotation of ∼13K additional relation tokens, about ∼10k of which hold within the same sentence (intra-sentential relations) and about ∼2700 hold across sentences (inter-sentential relations). The additional intra-sentential relations comprise an additional ∼5K tokens that are signalled by an explicit discourse connective (explicit relations), ∼4200 tokens with no explicit connective (implicit relations), ∼780 tokens in which the relations are signalled by phrases and/or lexico-syntactic constructions other than discourse connective (alternative lexicalizations), and ∼250 tokens of intra-sentential entity relations. Of the additional inter-sentential relations, ∼900 have an explicit discourse connective (some tokens that were missed in annotating the PDTB-2 and some which are new to the PDTB-3), ∼1400 are implicit relations, ∼200 are additional alternative lexicalizations, and ∼70 are additional entity relations.
The annotation is provided in the form of separate text files (standoff annotation) that are byte-indexed into the raw text files of the Penn TreeBank. One can see samples of the annotation of different types of discourse relations, along with their visualization in the Annotator tool at: