========================================================================

Twelfth Conference on Computational Natural Language Learning (CoNLL 2008)
            Shared Task Distribution -- Official Release
                   http://www.yr-bcn.es/conll2008/

Created February 28, 2008

Organizers:
Mihai Surdeanu
Richard Johanson
Adam Meyers
Lluis Marquez
Joakim Nivre

========================================================================

WARNING

The data of this distribution uses portions of the Penn Treebank II 
collection. For participants not owning a valid license of the Penn 
Treebank II collection, LDC is providing an "evaluation license",
valid during competition time, which allows the free download and use 
of the the CoNLL-2008 shared task datasets. See the shared task website
for details.


GENERAL

This is the 20080228 release of the shared task corpus. 
This release is intended to be stable, but it is subject to minor 
changes and updates if some errors are found (please inform organizers 
asap if you notice something wrong with the datasets). 

The 2008 CoNLL shared task focuses on the identification of syntactic 
dependencies (from the Penn Treebank [TB]) and semantic dependencies 
(from PropBank [PB] and NomBank [NB]). This year's shared task is mono
lingual: only English is covered. The syntactic dependencies follow
the format and description of the previous shared tasks (with some 
notable exceptions - see website for details). For the identification 
of semantic dependencies, the systems must identify first the semantic
predicates in each sentence. For each target predicate, all corresponding
roles must be identified. 

Please consult the shared task webpage for a detailed description of
the task, instructions on how to participate, calendar, and updates of 
the task and data.


DIRECTORY STRUCTURE

The following directories are included in this distribution:
    * trial/      : Contains the trial corpus
    * train/      : The complete training corpus (covers Sections 02-21 
                    of TreeBank)
    * devel/      : Development corpus (Section 24 of TreeBank)
    * test.wsj/   : In-domain test corpus (Section 23 of TreeBank)
    * test.brown/ : Out-of-domain test corpus (Sections ck01, ck02, and 
                    ck03 of the Brown corpus)

Each data directory contains two files:
    * .closed : Contains the data relevant for the closed challenge.
    * .open   : Contains the additional data for the open challenge.


DATA FORMAT 

The format of the file for the closed challenge is detailed in the 
shared task website: 
http://www.yr-bcn.es/dokuwiki/doku.php?id=conll2008:format
Note that the test corpora have the GPOS column filled with "_" and no
syntactic or semantic dependency information is provided (columns 9+). 

The additional data provided for the open challenge (e.g., trial.open)
follows the same column-based format as the data for the closed challenge.
For the open challenge, five additional columns are provided:

1. Named entity (NE) labels using the tag set from the CoNLL-2003 shared
   task (Tjong Kim Sang and De Meulder 2003).
2. NE labels using the tag set from the BBN Wall Street Journal Entity
   Corpus [BBN].
3. WordNet [WN] super senses (Ciaramita and Altun 2006).
4 and 5. Syntactic dependencies generated by the MALT parser 
         (Nivre et al 2006).


PREPROCESSING SYSTEMS

The input annotations provided for both closed and open challenges are
generated using the following state-of-the-art systems:

*) The predicted Part-of-Speech (PoS) tags (i.e., the PPOS and PPOSS 
   columns in the closed-challenge file) are generated using the PoS 
   tagger of (Gimenez and Marquez 2004). 

*) The lemmas (LEMMA and SPLIT_LEMMA columns) are extracted from WordNet
   using the most common sense for the corresponding predicted PoS tag.

*) Columns 1 to 3 in the open-challenge file are generated using the 
   semantic tagger of (Ciaramita and Altun 2006).

*) Columns 4 and 5 in the open-challenge file are generated using the
   MALT parser (Nivre et al 2006).


REFERENCES

(Ciaramita and Altun 2006)
    M. Ciaramita and Y. Altun
    "Broad Coverage Sense Disambiguation and Information Extraction
        with a Supersense Sequence Tagger"
    Proc. of EMNLP, 2006

(Gimenez and Marquez 2004)
    Gimenez J. and Marquez L. 
    "SVMTool: A general POS tagger generator based on 
        Support Vector Machines" 
    Proc. of LREC, 2004

(Nivre et al 2006)
    Nivre J., Hall J., Nilsson J. and Eryigit G. 
    "Labeled Pseudo-Projective Dependency Parsing with 
        Support Vector Machines"
    Proc. of the CoNLL-X Shared Task, 2006

(Tjong Kim Sang and De Meulder 2003)
    Erik F. Tjong Kim Sang and Fien De Meulder
    "Introduction to the CoNLL-2003 Shared Task: 
        Language-Independent Named Entity Recognition"
    Proc. of CoNLL-2003, 2003

[PB] PropBank Project: http://verbs.colorado.edu/~mpalmer/projects/ace.html

[NB] NomBank Project: http://nlp.cs.nyu.edu/meyers/NomBank.html

[TB] Penn TreeBank II Project: http://www.cis.upenn.edu/~treebank

[BBN] Pronoun coreference and entity type corpus:
      LDC catalog number LDC2005T33

[WN] WordNet: http://wordnet.princeton.edu/


ACKNOWLEDGMENTS

The organizers thank Massimiliano Ciaramita for the help with his 
semantic tagger and Jesus Gimenez for PoS tagging the corpus.