======================================================================
             German corpus for the CoNLL-2009 shared task
      "Syntactic and Semantic Dependencies in Multiple Languages"

                    Version 1.1: January 14, 2009


  Organizers of this corpus:
     Yi Zhang <yzhang@coli.uni-sb.de>
     Sebastian Pado <pado@stanford.edu>

======================================================================


 * GENERAL

This file contains the basic information regarding the German corpus
provided for the CoNLL-2009 shared task on "Syntactic and Semantic
Dependencies in Multiple Languages"
(http://ufal.mff.cuni.cz/conll2009-st/). The data of this distribution
is derived from the TIGER Treebank and the SALSA Corpus, converted
into the syntactic and semantic dependencies compatible with the
CoNLL-2009 shared task. Please refer to the following COPYRIGHT
section for detailed license agreements.


 * COPYRIGHT

This data set is derived from the TIGER Treebank and SALSA Corpus. The
text of this distribution in turn comes from the Frankfurter Rundschau
newspaper and its Copyright is held by:

    Druck- und Verlagshaus Frankfurt am Main GmbH
    Verlag der Frankfurter Rundschau
    Große Eschenheimer Straße 16-18
    D-60313 Frankfurt am Main

In addition, the users of this corpus are requested to conform to the
following license agreements of the original corpora

  i) License Agreement for the TIGER Corpus for non-commercial use
     (license-tiger.html)

 ii) License Agreement for the SALSA Corpus for non-commercial use
     (license-salsa.html)


 * LIST OF CHANGES

2009-02-12: 
  ** changed umlauts in the frame filenames
  ** repackaged frame files using zip (instead of tar/gzip)


 * CONTENT OF THIS DISTRIBUTION

The following files are included in this distribution:

  ** README.TXT - this file
  ** CoNLL2009-ST-German-development.txt - the development corpus
  ** CoNLL2009-ST-German-train.txt - the training corpus
  ** CoNLL2009-ST-German-trial.txt - the trial corpus. It contains the
     first 400 sentences from the training corpus
  ** salsa-frames.tar.gz - definition of verb frames and mappings to 
     SALSA/FrameNet frames
  ** license-tiger.html - license agreement for the TIGER Corpus for 
     non-commercial use
  ** license-salsa.html - license agreement for the SALSA Corpus for
     non-commercial use


 * SPECIFICS OF THE GERMAN DATASET

  ** Syntactic dependency trees can be non-projective.

  ** Predicates are not exhaustively annotated in SALSA. Additionally,
     some SALSA annotation types (such as metaphorical or non-literal
     usages do not map straightforwardly to the CoNLL annotation scheme.
     Thus, annotation in this dataset is only partial. Please refer to
     the FILLPRED field to decide whether SRL should be performed for
     an instance of a predicate.

  ** To make the German semantic role annotation similar to the
     datasets of other languages, the original SALSA FrameNet 
     SRL annotations were mapped semi-automatically onto 
     PropBank-style rolesets/arguments. Evaluation will also 
     take place against this PropBank annotation.
 
     We provide all mappings from the original SALSA 
     frames/frame elements to PropBank rolesets/arguments,
     as well as the definitions of the original frames, in 
     XML format in the file salsa-frames.tar.gz. Participants 
     are free to use this information in their systems. 
     Note however that the original SALSA annotations make use 
     of German "proto-frames" that have not been officially 
     integrated into FrameNet and differ from FrameNet frames 
     in terms of granularity. Proto-frames are recognisable by 
     the source="SALSA" attributes of the "frame" elements in 
     the XML files.


 * PREPROCESSING SYSTEMS

  ** The Part-of-Speech (PoS) tagset used in this dataset is the 
     Stuttgart-Tuebingen tagset (STTS), as used in the TIGER corpus
     and described in
     http://www.ims.uni-stuttgart.de/ftp/pub/corpora/stts_guide.ps.gz

   ** The predicted lemmas and Part-of-Speech tags (PLEMMA and
      PPOS) are generated using Helmut Schmid's TreeTagger tool:
      http://www.ims.uni-stuttgart.de/projekte/corplex/TreeTagger/

  ** The predicted morphological features (PFEAT) are produced using
     Morphisto, a finite state transducer-based morphological analyzer
     for German (http://www.ids-mannheim.de/ll/TextGrid/morphisto.html), 
     and converted into TIGER compatible morphological marks.  Since
     Morphisto is not sequence-based and no syntax-based
     disambiguation has been performed, the morphological information
     in PFEAT is, as a rule, less specific than the gold standard.

  ** The predicted syntactic dependency columns (PHEAD and PDEPREL)
     are generated using the MSTParser (McDonald et al. 2005) with the
     non-projective decoder.