This is the readme file for the Dutch part of the CONLL 2006 Shared Task. Version: $Id: README,v 1.4 2006/01/09 16:51:56 erwin Exp $ 1. Preamble 1.1 Source The Alpino Treebank is available from http://odur.let.rug.nl/~vannoord/trees/ 1.2 Copyright The Dutch data is derived from the Alpino Treebank. Copyright 2002-2005 Leonoor van der Beek, Gosse Bouma, Geert Kloosterman, Robert Malouf, Gertjan van Noord, NWO, RUG. 1.3 License The Dutch data is derived from the Alpino Treebank, and is therefore subject to the GPL. You can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation. 2. Documentation 2.1 Data format Data adheres to the following rules: * Data files contain one or more sentences separated by a blank line. * A sentence consists of one or tokens, each one starting on a new line. * A token consists of ten fields described in the table below. Fields are separated by one tab character. * All data files will contains these ten fields, although only the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are guaranteed to contain non-underscore values for all languages. * Data files are are UTF-8 encoded (unicode). ---------------------------------------------------------------------- Field number: Field name: Description: ---------------------------------------------------------------------- 1 ID Token counter, starting at 1 for each new sentence. 2 FORM Word form or punctuation symbol 3 LEMMA Stem of word form, or a concatenation of stems in case of a multi-word unit, or an underscore if not available 4 CPOSTAG Coarse-grained part-of-speech tag; see the file tagset.txt 5 POSTAG Fine-grained part-of-speech tag, identical to the coarse-grained part-of-speech except for multi-word units, where it is the concatenation of the coarse-grained part-of-speech tags of the words 6 FEATS List of set-valued syntactic and/or morphological features; separated by a vertical bar (|), or an underscore if not available; see the file tagset.txt 7 HEAD Non-projective head of current token, which is either a value of ID or zero ('0') 8 DEPREL Dependency relation to the non-projective-head, which is 'ROOT' when the value of HEAD is zero; see below for the set of dependency relations. 9 PHEAD Projective head of current token; which is always an underscore because it is not available from the Dutch treebank 10 PDEPREL Dependency relation to projective head, which is always an underscore because it is not ---------------------------------------------------------------------- 3.2 Text The text material comes from the following sources: * 7153 sentences from the cdbl (Newspaper) part of the Eindhoven Corpus * 425 sentences from the Corpus Spoken Duth (CGN) annotation guidelines * 450 questions from CLEF 2003 * 700 questions from CLEF 2004 * 200 questions from CLEF 2005 * 500+ sentences from the EANS * 1000 sentences constructed during the development of the Alpino Grammar and Lexicon * 350+ sentences also constructed during the development of the Alpino Grammar and Lexicon * 330 sentences from the CGN Leuven Yellow Pages document * Set of 18 sentences used in the Battle of the Parsers during the 2001 LOT Winterschool * 1000 quiz questions 3.3 Part-of-Speech tags The orginal POS tags from the Alpino Treebank were replaced by POS tags from the Memory-based part-of-speech tagger using the WOTAN tagset, which is described in the file tagset.txt 3.5 Dependency relations The syntactic annotation is mostly identical to that of the Corpus Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the file syn_prot.pdf (Dutch only). An attempt to describe a number of differences between the CGN and Alpino annotation practice is given in the file diff.pdf (which is heavily out of date, but the number of differences has been reduced heavily recently.) ---------------------------------------------------------------------- Relation: Description: ---------------------------------------------------------------------- ROOT app body cnj crd det hd hdf ld me mod obcomp obj1 obj2 pc pobj1 predc predm punct sat se su sup svp vc ---------------------------------------------------------------------- 3.6 Conversion Issues: - head selection - multi-word units - discourse units 4. Acknowledgements Gertjan van Noord and all the other people at the University of Groningen for creating the Alpino Treebank and releasing it for free. Gertjan van Noord for answering all my questions and for providing extra test material. Antal van den Bosch for help with the memory-based tagger.