======================================================================== Thirteenth Conference on Computational Natural Language Learning (CoNLL 2009) Shared Task Distribution -- Official Release http://ufal.mff.cuni.cz/conll2009-st/ Version 1.0: January 11, 2009 Organizers of this corpus: Richard Johanson Adam Meyers Mihai Surdeanu Lluis Marquez Joakim Nivre ======================================================================== WARNING The data of this distribution uses portions of the Penn Treebank II collection. For participants not owning a valid license of the Penn Treebank II collection, LDC is providing an "evaluation license", valid during competition time, which allows the free download and use of this CoNLL-2009 shared task dataset. See the shared task website for details. (1) GENERAL This distribution includes the data set for the English language, part of the CoNLL 2009 shared task. The corpus includes syntactic dependencies (from the Penn Treebank [TB]) and semantic dependencies (from PropBank [PB] and NomBank [NB]). (2) LIST OF CHANGES (3) CONTENT OF THIS DISTRIBUTION The following files are included in this distribution: * README.TXT - this file * CoNLL2009-ST-English-trial.txt - the trial corpus. It contains the first 506 lines of the development file. * CoNLL2009-ST-English-train.txt - the training corpus. It matches sections 2 through 21 of Treebank. Note that a few sentences from the original Treebank were removed due to merging problems between the input corpora. * CoNLL2009-ST-English-development.txt - the development corpus. It matches section 24 of Treebank. * pb_frames.tar.gz - English verbal lexicon. Contains the set of accepted argument frames for each verbal predicate in the training and development corpora. * nb_frames.tar.gz - English nominal lexicon. Contains the set of accepted argument frames for each nominal predicate in the training and development corpora. (4) SPECIFICS OF THE ENGLISH DATA SET The special features of this corpus are: * Dependency trees are not always projective, although the vast majority of trees are projective. * Both verbal predicates (from PropBank) and nominal predicates (from NomBank) are annotated. * The same word can be an argument to multiple predicates. For more details please see Section 3 of Surdeanu et al (2008). (5) PREPROCESSING SYSTEMS The input annotations provided for both closed and open challenges are generated using the following state-of-the-art systems: *) The predicted Part-of-Speech (PoS) tags (i.e., the PPOS column) are generated using the PoS tagger of (Gimenez and Marquez 2004). *) The lemmas (LEMMA and PLEMMA columns) are extracted from WordNet using the most common sense for the corresponding tag. The difference between LEMMA and PLEMMA is that LEMMA is generated using the POS tag from the POS column, whereas PLEMMA is generated using the PPOS column. *) Columns PHEAD and PDEPREL are generated using the MALT parser (Nivre et al 2006). (6) DIFFERENCES FROM 2008 Most of the data in this corpus is extracted from the CoNLL 2008 data set. The columns FORM, PLEMMA, PPOS, HEAD, DEPREL, PRED, and APRED are copied from the closed-challenge data set of 2008 (note that this year we discard the original Treebank tokenization and use only the SPLIT tokenization). The PHEAD and PDEPREL are copied from last year's open-challenge data set, i.e., they are the output of the MALT parser. The POS column contains the gold POS tags from NomBank. Note that we use NomBank rather Treebank POS data because NomBank fixes some annotation errors in Treebank. Additionally, for the split cases where Treebank annotation is not available we set the POS tags using rules based on the immediate syntactic environment and lexical/morphological factors. The LEMMA column is generated as the lemma of the most common WN sense for the gold POS tag (the POS column). The FEAT and PFEAT columns are empty for English. REFERENCES (Ciaramita and Altun 2006) M. Ciaramita and Y. Altun "Broad Coverage Sense Disambiguation and Information Extraction with a Supersense Sequence Tagger" Proc. of EMNLP, 2006 (Gimenez and Marquez 2004) Gimenez J. and Marquez L. "SVMTool: A general POS tagger generator based on Support Vector Machines" Proc. of LREC, 2004 (Nivre et al 2006) Nivre J., Hall J., Nilsson J. and Eryigit G. "Labeled Pseudo-Projective Dependency Parsing with Support Vector Machines" Proc. of the CoNLL-X Shared Task, 2006 (Surdeanu et al 2008) Surdeanu M., Johansson R., Meyers A., Marquez L. and Nivre J. "The CoNLL-2008 Shared Task on Joint Parsing of Syntactic and Semantic Dependencies" Proc. of CoNLL, 2008 (Tjong Kim Sang and De Meulder 2003) Erik F. Tjong Kim Sang and Fien De Meulder "Introduction to the CoNLL-2003 Shared Task: Language-Independent Named Entity Recognition" Proc. of CoNLL-2003, 2003 [PB] PropBank Project: http://verbs.colorado.edu/~mpalmer/projects/ace.html [NB] NomBank Project: http://nlp.cs.nyu.edu/meyers/NomBank.html [TB] Penn Treebank II Project: http://www.cis.upenn.edu/~treebank [BBN] Pronoun coreference and entity type corpus: LDC catalog number LDC2005T33 [WN] WordNet: http://wordnet.princeton.edu/ ACKNOWLEDGMENTS The organizers thank Massimiliano Ciaramita for the help with his semantic tagger and Jesus Gimenez for PoS tagging the corpus.