RST Signalling Corpus Debopam Das and Maite Taboada Simon Fraser University September 2014 ddas@sfu.ca, mtaboada@sfu.ca The RST Signalling Corpus, which is built over the RST Discourse Treebank, includes a collection of over 20,000 coherence relations annotated for signalling information. The RST Discourse Treebank (RST-DT) includes a collection of newspaper texts already annotated for coherence relations. In the RST Signalling Corpus, we have added a new layer of signalling information to the existing RST-DT with the aim to find out how coherence relations are signalled in discourse. More information about the annotation project can be found in Debopam Das' PhD dissertation "Signalling of Coherence Relations in Discourse" completed at Simon Fraser University (SFU) in Summer, 2014. The dissertation is available through the SFU library (http://www.lib.sfu.ca/help/publication-types/finding-sfu-theses). The annotation in the RST Signalling Corpus was performed using the 2.8.12 version of UAM CorpusTool (http://www.wagsoft.com/CorpusTool/). The signalling annotations are also accessible using later versions of UAM CorpusTool, but you may need to convert the .ctpr file into the new version. =============================================================== A description of the directories, sub-directories and data follows: The root directory, Annotation, includes three sub-directories: (1) Training_Annotation (2) Test_Annotation (3) Full_Annotation (1) Training_Annotation ======================== This directory contains the signalling annotations of 347 articles from the training subset of the RST-DT. The directory includes four subdirectories: (1.1) Analyses, (1.2) Corpus, (1.3) Results and (1.4) Schemes, and a UAM CorpusTool project file named "Training_Annotation.ctpr" which can be used to open, view and edit the signalling annotations for those 347 training articles. (1.1) Analyses: This directory includes a subdirectory, (1.1.1) Training, which further includes 347 subdirectories containing the signalling annotations of the 347 training articles. Each of these subdirectories begins with a name of the following form: .txt in which represents the number of the source article in the RST-DT for which the signalling annotation is provided. A .txt directory includes three files, (1.1.1.1) Metadata.xml, (1.1.1.2) Signal.xml and (1.1.1.3) Signal.xml.old. (1.1.1.1) Metadata: This includes information about the metadata of the annotation (language, encoding format, font type and font size) (1.1.1.2) Signal.xml: This XML file contains the actual signalling annotation. (1.1.1.3) Signal.xml.old: This OLD file is a back-up of the previous annotation, and is created when a file is opened to annotate. The .old files have been removed from the distribution version of the corpus. The Training (1.1.1) directory also includes two more files, METADATA.dtd and document.dtd, which are used to validate (1.1.1.1) Metadata.xml and (1.1.1.2) Signal.xml files, respectively. (1.2) Corpus: This directory contains the source corpus for the signalling annotation. It includes a subdirectory, (1.2.1) Training, which further includes 347 text files upon which the signalling annotation is performed. Each file begins with a name of the following form: .txt in which represents the number of the source article in the RST-DT for which the signalling annotation is performed. (1.3) Results: This directory is empty, but it can be used to store different search results and statistics for the RST Signalling Corpus produced by UAM CorpusTool. (1.4) Schemes: This directory includes four files, (1.4.1) ACRuleList.xml, (1.4.2) Signal.xml, (1.4.3) Network.dtd and (1.4.4) rules.dtd. (1.4.1) ACRuleList: This xml file is automatically produced, but does not contain any information about the signalling annotation. (1.4.2) Signal.xml: This is a different file from (1.1.1.2) Signal.xml file in (1.1) Analyses directory. This xml file contains the signalling annotation scheme used in the RST Signalling Corpus. The other two files, (1.4.3) Network.dtd and (1.4.4) rules.dtd, are used to validate (1.4.2) Signal.xml file and (1.4.1) ACRuleList.xml file, respectively. (2) Test_Annotation ======================== This directory contains the signal-wise annotations of 38 articles, the test subset, from the RST-DT. The organization of data in this directory is identical to (1) Training_Annotation directory. (3) Full_Annotation ======================== This directory contains the complete annotation of 385 articles from the RST-DT, combining data from the (1) Training_Annotation and (2) Test_Annotation directories. We have included the Full_Annotation directory so that search results and statistics can be done for the entire corpus. The organization of data in this directory is identical to the Training_Annotation directory.