Item Name: RST Signalling Corpus
Author(s): Debopam Das, Maite Taboada, Paul McFetridge
LDC Catalog No.: LDC2015T10
ISBN: 1-58563-719-X
ISLRN: 256-234-245-630-4
Release Date: June 15, 2015
Member Year(s): 2015
DCMI Type(s): Text
Data Source(s): newswire
Application(s): discourse analysis
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2015T10 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Das, Debopam, Maite Taboada, and Paul McFetridge. RST Signalling Corpus LDC2015T10. Web Download. Philadelphia: Linguistic Data Consortium, 2015.
RST Signalling Corpus was developed at Simon Fraser University and contains annotations for signalling information added to RST Discourse Treebank (LDC2002T07). RST Discourse Treebank (RST-DT) is a collection of English news texts annotated for rhetorical relations under the RST (Rhetorical Structure Theory) framework. In RST Signalling Corpus, information about textual signals -- such as although, because, thus -- and signals such as tense, lexical chains or punctuation were added as an annotation layer to examine how rhetorical relations are signalled in discourse.


The source data consists of 385 Wall Street Journal news articles from the Penn Treebank annotated for rhetorical relations in RST Discourse Treebank. As in RST-DT, the data in this release is divided into a training set (347 articles) and a test set (38 articles).

The signalling annotation in this data set was performed using the UAM CorpusTool version 2.8.12. Files are presented as UTF-8 encoded XML and plain text. The corpus is divided into three annotation sub-directories: training, test and full. All sub-directories include source, metadata, signalling annotation, and dtd files.


