Penn Discourse Treebank Version 2.0 - German Translation (GermanPDTB)
==========

Henny Sluyter-Gäthje, Peter Bourgonje, Manfred Stede 

### DESCRIPTION
The GermanPDTB is a German corpus annotated for shallow discourse relations in the (financial) news domain. The corpus was produced automatically on the basis of a subset of the PDTB 2.0 (Penn Discourse Treebank [1]) that was used in the 2016 CoNLL shared task on discourse parsing. The PDTB consists of articles from the Wall Street Journal which were manually annotated for discourse relations.
The creation procedure of the GermanPDTB is split into the automatic translation of the PDTB with deepL [2] and the projection of the annotations using word alignments produced with GIZA++ [3]. The final version of the corpus was produced in October 2019. 

Since only a subset of the corpus was evaluated manually, it is to be considered silver data and as such may be erroneous.

The corpus can be used to train systems for shallow discourse parsing.
 

### ANNOTATION SCHEME
In the PDTB discourse relations are divided into five different relation types: 

1. Explicit relations consist of an overtly realised discourse connective (e.g. if, because), one external (arg1) and one internal (arg2) argument. The internal argument is syntactically integrated with the discourse connective. A relation sense (e.g. Contingency.Condition) is allocated to the discourse connective. 

2. Implicit relations consist of two arguments only. No discourse connective is overtly realised, but annotators provide a suggestion of a discourse connective matching the relation sense. Both the suggestion and the sense are annotated to the first word of arg2. 

3.-5. If neither an explicit nor an implicit relation could be assigned, annotators had the option to choose between AltLex, EntRel and NoRel relations.
    a) If the relation between two segments is explicitly expressed, but not in form of a discourse connective, an AltLex relation is assigned. For those relations a sense is allocated. The sense is annotated to those words in arg2 expressing the relation (e.g. this is that). 
    b) If two segments speak of the same entity, an EntRel relation is assigned, but no relation sense can be allocated.
    c) If two segments have no relations, NoRel is assigned. 

In the GermanPDTB, this scheme is adopted and all relations (except for NoRel) are projected onto the German translation text.


### FORMAT
The PDTB used in the 2016 ConNLL shared task was provided in the CoNLL format. The corpus was split into several files, with each file containing connected discourse relations (at least two sentences, at most 178 sentences). 

In the CoNLL format, information is represented in a tab separated, tabular like format. Per file the number of columns is consistent. 

The first five columns convey the following information:
1. Word ID within the file 
2. Sentence ID within the file
3. Word ID within the sentence
4. Word 
5. PoS-Tag 

In the following columns, the discourse relations are indicated (one column per relation). 

Annotation keywords:
Argument 1: arg1
Argument 2: arg2
Explicit discourse connective: conn|sense
Implicit discourse relation: arg2|proposed word|sense
Altlex discourse relation: arg2|altlex|sense
EntRel discourse relation: arg2|EntRel

The GermanPDTB follows this format. In contrast to the PDTB, the GermanPDTB is provided as one single file. The IDs allow for straight forward splitting into several files. The PoS-Tags were produced with MarMoT[4]. 


### KEY CHARACTERISTICS
Total relations			 39,311 
Explicit relations		 16,670 
Implicit relations 		 15,533 
EntRel relations		  4,783 
AltLex relations		    602

Unique connectives (in explicit relations) 	185 
Arg1 token length (average) 			17.91 
Arg2 token length (average) 			16.58 


The corpus, its creation and evaluation is described in more detail in this paper: 
Sluyter-Gäthje, H., Bourgonje, P. and Stede, M. (2020). Shallow Discourse Parsing for Under-Resourced Languages: Combining Machine Translation and Annotation Projection. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’20), Marseille, France, May. European Language Resources Association (ELRA).

### CHECKSUMS
GermanPDTB.conll: sha1 28ee3c2443e790d013e4a0833630614d8a9c7d2b
PDTB_deepL_translation.txt: sha1 260d556cb84eb9ab9796d4a1ed63996f119511f9

[1] Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., and Webber, B. (2008). The Penn Discourse TreeBank 2.0. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco, May. European Language Resources Association (ELRA) 
[2] https://www.deepl.com/en/translator
[3] Och, F. J. and Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19–51 
[4] Mueller, T., Schmid, H. and Hinrich, Schütze. (2013). Efficient Higher-Order CRFs for Morphological Tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322-332, Seattle, USA, October. Association for Computational Linguistics.