Penn Discourse Treebank Version 2.0 - German Translation (GermanPDTB) ========== Henny Sluyter-Gäthje, Peter Bourgonje, Manfred Stede ### DESCRIPTION The GermanPDTB is a German corpus annotated for shallow discourse relations in the (financial) news domain. The corpus was produced automatically on the basis of a subset of the PDTB 2.0 (Penn Discourse Treebank [1]) that was used in the 2016 CoNLL shared task on discourse parsing. The PDTB consists of articles from the Wall Street Journal which were manually annotated for discourse relations. The creation procedure of the GermanPDTB is split into the automatic translation of the PDTB with deepL [2] and the projection of the annotations using word alignments produced with GIZA++ [3]. The final version of the corpus was produced in October 2019. Since only a subset of the corpus was evaluated manually, it is to be considered silver data and as such may be erroneous. The corpus can be used to train systems for shallow discourse parsing. ### ANNOTATION SCHEME In the PDTB discourse relations are divided into five different relation types: 1. Explicit relations consist of an overtly realised discourse connective (e.g. if, because), one external (arg1) and one internal (arg2) argument. The internal argument is syntactically integrated with the discourse connective. A relation sense (e.g. Contingency.Condition) is allocated to the discourse connective. 2. Implicit relations consist of two arguments only. No discourse connective is overtly realised, but annotators provide a suggestion of a discourse connective matching the relation sense. Both the suggestion and the sense are annotated to the first word of arg2. 3.-5. If neither an explicit nor an implicit relation could be assigned, annotators had the option to choose between AltLex, EntRel and NoRel relations. a) If the relation between two segments is explicitly expressed, but not in form of a discourse connective, an AltLex relation is assigned. For those relations a sense is allocated. The sense is annotated to those words in arg2 expressing the relation (e.g. this is that). b) If two segments speak of the same entity, an EntRel relation is assigned, but no relation sense can be allocated. c) If two segments have no relations, NoRel is assigned. In the GermanPDTB, this scheme is adopted and all relations (except for NoRel) are projected onto the German translation text. ### FORMAT The PDTB used in the 2016 ConNLL shared task was provided in the CoNLL format. The corpus was split into several files, with each file containing connected discourse relations (at least two sentences, at most 178 sentences). In the CoNLL format, information is represented in a tab separated, tabular like format. Per file the number of columns is consistent. The first five columns convey the following information: 1. Word ID within the file 2. Sentence ID within the file 3. Word ID within the sentence 4. Word 5. PoS-Tag In the following columns, the discourse relations are indicated (one column per relation). Annotation keywords: Argument 1: arg1 Argument 2: arg2 Explicit discourse connective: conn|sense Implicit discourse relation: arg2|proposed word|sense Altlex discourse relation: arg2|altlex|sense EntRel discourse relation: arg2|EntRel The GermanPDTB follows this format. In contrast to the PDTB, the GermanPDTB is provided as one single file. The IDs allow for straight forward splitting into several files. The PoS-Tags were produced with MarMoT[4]. ### KEY CHARACTERISTICS Total relations 39,311 Explicit relations 16,670 Implicit relations 15,533 EntRel relations 4,783 AltLex relations 602 Unique connectives (in explicit relations) 185 Arg1 token length (average) 17.91 Arg2 token length (average) 16.58 The corpus, its creation and evaluation is described in more detail in this paper: Sluyter-Gäthje, H., Bourgonje, P. and Stede, M. (2020). Shallow Discourse Parsing for Under-Resourced Languages: Combining Machine Translation and Annotation Projection. In Proceedings of the 12th International Conference on Language Resources and Evaluation (LREC’20), Marseille, France, May. European Language Resources Association (ELRA). ### CHECKSUMS GermanPDTB.conll: sha1 28ee3c2443e790d013e4a0833630614d8a9c7d2b PDTB_deepL_translation.txt: sha1 260d556cb84eb9ab9796d4a1ed63996f119511f9 [1] Prasad, R., Dinesh, N., Lee, A., Miltsakaki, E., Robaldo, L., Joshi, A., and Webber, B. (2008). The Penn Discourse TreeBank 2.0. In Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC’08), Marrakech, Morocco, May. European Language Resources Association (ELRA) [2] https://www.deepl.com/en/translator [3] Och, F. J. and Ney, H. (2003). A Systematic Comparison of Various Statistical Alignment Models. Computational Linguistics, 29(1):19–51 [4] Mueller, T., Schmid, H. and Hinrich, Schütze. (2013). Efficient Higher-Order CRFs for Morphological Tagging. In Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, pages 322-332, Seattle, USA, October. Association for Computational Linguistics.