====================================================================== German corpus for the CoNLL-2009 shared task "Syntactic and Semantic Dependencies in Multiple Languages" Version 1.1: January 14, 2009 Organizers of this corpus: Yi Zhang Sebastian Pado ====================================================================== * GENERAL This file contains the basic information regarding the German corpus provided for the CoNLL-2009 shared task on "Syntactic and Semantic Dependencies in Multiple Languages" (http://ufal.mff.cuni.cz/conll2009-st/). The data of this distribution is derived from the TIGER Treebank and the SALSA Corpus, converted into the syntactic and semantic dependencies compatible with the CoNLL-2009 shared task. Please refer to the following COPYRIGHT section for detailed license agreements. * COPYRIGHT This data set is derived from the TIGER Treebank and SALSA Corpus. The text of this distribution in turn comes from the Frankfurter Rundschau newspaper and its Copyright is held by: Druck- und Verlagshaus Frankfurt am Main GmbH Verlag der Frankfurter Rundschau Große Eschenheimer Straße 16-18 D-60313 Frankfurt am Main In addition, the users of this corpus are requested to conform to the following license agreements of the original corpora i) License Agreement for the TIGER Corpus for non-commercial use (license-tiger.html) ii) License Agreement for the SALSA Corpus for non-commercial use (license-salsa.html) * LIST OF CHANGES 2009-02-12: ** changed umlauts in the frame filenames ** repackaged frame files using zip (instead of tar/gzip) * CONTENT OF THIS DISTRIBUTION The following files are included in this distribution: ** README.TXT - this file ** CoNLL2009-ST-German-development.txt - the development corpus ** CoNLL2009-ST-German-train.txt - the training corpus ** CoNLL2009-ST-German-trial.txt - the trial corpus. It contains the first 400 sentences from the training corpus ** salsa-frames.tar.gz - definition of verb frames and mappings to SALSA/FrameNet frames ** license-tiger.html - license agreement for the TIGER Corpus for non-commercial use ** license-salsa.html - license agreement for the SALSA Corpus for non-commercial use * SPECIFICS OF THE GERMAN DATASET ** Syntactic dependency trees can be non-projective. ** Predicates are not exhaustively annotated in SALSA. Additionally, some SALSA annotation types (such as metaphorical or non-literal usages do not map straightforwardly to the CoNLL annotation scheme. Thus, annotation in this dataset is only partial. Please refer to the FILLPRED field to decide whether SRL should be performed for an instance of a predicate. ** To make the German semantic role annotation similar to the datasets of other languages, the original SALSA FrameNet SRL annotations were mapped semi-automatically onto PropBank-style rolesets/arguments. Evaluation will also take place against this PropBank annotation. We provide all mappings from the original SALSA frames/frame elements to PropBank rolesets/arguments, as well as the definitions of the original frames, in XML format in the file salsa-frames.tar.gz. Participants are free to use this information in their systems. Note however that the original SALSA annotations make use of German "proto-frames" that have not been officially integrated into FrameNet and differ from FrameNet frames in terms of granularity. Proto-frames are recognisable by the source="SALSA" attributes of the "frame" elements in the XML files. * PREPROCESSING SYSTEMS ** The Part-of-Speech (PoS) tagset used in this dataset is the Stuttgart-Tuebingen tagset (STTS), as used in the TIGER corpus and described in http://www.ims.uni-stuttgart.de/ftp/pub/corpora/stts_guide.ps.gz ** The predicted lemmas and Part-of-Speech tags (PLEMMA and PPOS) are generated using Helmut Schmid's TreeTagger tool: http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/ ** The predicted morphological features (PFEAT) are produced using Morphisto, a finite state transducer-based morphological analyzer for German (http://www.ids-mannheim.de/ll/TextGrid/morphisto.html), and converted into TIGER compatible morphological marks. Since Morphisto is not sequence-based and no syntax-based disambiguation has been performed, the morphological information in PFEAT is, as a rule, less specific than the gold standard. ** The predicted syntactic dependency columns (PHEAD and PDEPREL) are generated using the MSTParser (McDonald et al. 2005) with the non-projective decoder.