This is the readme file for the Dutch part of the CONLL 2006 Shared Task.

Version: $Id: README,v 1.4 2006/01/09 16:51:56 erwin Exp $


1. Preamble

    1.1 Source

        The Alpino Treebank is available from
        http://odur.let.rug.nl/~vannoord/trees/

    1.2 Copyright

        The Dutch data is derived from the Alpino Treebank. Copyright
        2002-2005 Leonoor van der Beek, Gosse Bouma, Geert Kloosterman, Robert
        Malouf, Gertjan van Noord, NWO, RUG.

    1.3 License

        The Dutch data is derived from the Alpino Treebank, and is therefore
        subject to the GPL. You can redistribute it and/or modify it under the
        terms of the GNU General Public License as published by the Free
        Software Foundation.


2. Documentation

    2.1 Data format

    	Data adheres to the following rules:

    	* Data files contain one or more sentences separated by a
	  blank line.

    	* A sentence consists of one or tokens, each one starting on a
	  new line.

    	* A token consists of ten fields described in the table
	  below. Fields are separated by one tab character.

	* All data files will contains these ten fields, although only
          the ID, FORM, CPOSTAG, POSTAG, HEAD and DEPREL columns are
          guaranteed to contain non-underscore values for all
          languages.

	* Data files are are UTF-8 encoded (unicode).

	----------------------------------------------------------------------
	Field number: 	Field name: 	Description:
	----------------------------------------------------------------------
	1 	ID 	Token counter, starting at 1 for each new sentence.
	2 	FORM 	Word form or punctuation symbol
	3 	LEMMA 	Stem of word form, or a concatenation of stems in 
			case of a multi-word unit, or an underscore if not 
			available
	4 	CPOSTAG Coarse-grained part-of-speech tag; 
			see the file tagset.txt
	5 	POSTAG 	Fine-grained part-of-speech tag, identical to the 
	                coarse-grained part-of-speech except for multi-word
			units, where it is the concatenation of the 
			coarse-grained part-of-speech tags of the words
	6 	FEATS 	List of set-valued syntactic and/or morphological 
			features; separated by a vertical bar (|), 
			or an underscore if not available;
			see the file tagset.txt
	7 	HEAD 	Non-projective head of current token, 
			which is either a value of ID or zero ('0')
	8 	DEPREL 	Dependency relation to the non-projective-head, 
			which is 'ROOT' when the value of HEAD is zero;
			see below for the set of dependency relations.
	9 	PHEAD 	Projective head of current token;
			which is always an underscore because it is not 
			available from the Dutch treebank
	10 	PDEPREL Dependency relation to projective head, 
			which is always an underscore because it is not 
	----------------------------------------------------------------------

    3.2 Text

	The text material comes from the following sources:

	* 7153 sentences from the cdbl (Newspaper) part of the Eindhoven Corpus
	* 425 sentences from the Corpus Spoken Duth (CGN) annotation guidelines
	* 450 questions from CLEF 2003
	* 700 questions from CLEF 2004
	* 200 questions from CLEF 2005
	* 500+ sentences from the EANS
	* 1000 sentences constructed during the development of 
	  the Alpino Grammar  and Lexicon
	* 350+ sentences also constructed during the development of 
	  the Alpino Grammar and Lexicon
	* 330 sentences from the CGN Leuven Yellow Pages document
	* Set of 18 sentences used in the Battle of the Parsers during 
	  the 2001 LOT Winterschool
	* 1000 quiz questions

    3.3 Part-of-Speech tags 

        The orginal POS tags from the Alpino Treebank were replaced by POS 
	tags from the Memory-based part-of-speech tagger using the WOTAN 
	tagset, which is described in the file tagset.txt

    3.5 Dependency relations

        The syntactic annotation is mostly identical to that of the Corpus
        Gesproken Nederlands (CGN, Spoken Dutch Corpus) as described in the
        file syn_prot.pdf (Dutch only). An attempt to describe a number of
        differences between the CGN and Alpino annotation practice is given in
        the file diff.pdf (which is heavily out of date, but the number of
        differences has been reduced heavily recently.)

	----------------------------------------------------------------------
	Relation:	Description:
	----------------------------------------------------------------------
	ROOT
	app
	body
	cnj
	crd
	det
	hd
	hdf
	ld
	me
	mod
	obcomp
	obj1
	obj2
	pc
	pobj1
	predc
	predm
	punct
	sat
	se
	su
	sup
	svp
	vc
	----------------------------------------------------------------------

    3.6 Conversion

	Issues:
	- head selection
	- multi-word units
	- discourse units


4. Acknowledgements

    	Gertjan van Noord and all the other people at the University
	of Groningen for creating the Alpino Treebank and releasing it
	for free.

	Gertjan van Noord for answering all my questions and for
	providing extra test material.

    	Antal van den Bosch for help with the memory-based tagger.