============================================================== Catalan corpus for the CoNLL-2009 shared task "Syntactic and Semantic Dependencies in Multiple Languages" Version 2.1: January 23, 2009 ============================================================== This file contains the basic information regarding the Catalan corpus provided for the CoNLL-2009 shared task on "Syntactic and Semantic Dependencies in Multiple Languages". The current version (2.1, January 23, 2009) corresponds to the release of the training data sets. All changes and updates on these data sets are reported in Section 1 of this document. (1) LIST OF VERSIONS v2.1 [2009/01/23]: incorporates some bug fixing on verbal lexicons. Changes form version 2.0 are: - Several character coding problems have been fixed in the verbal lexicon. First, the file names have been simplified and reduced to ASCII characters. Non-ASCII characters have been converted into similar ASCII characters (e.g., by removing accents), so file names resembles the verbal lemma as much as possible. Character coding of the content of the entry files is now fully compliant with UTF-8 codification. The verb lemma included in the first line of each entry file is the string to be found in the "LEMMA" columns of the copora. - A file containing the mapping between participle verbs and their infinitive form is provided in order to facilitate the match between lemmas encountered in the corpora and the verbal lexicon (see the extended explanation below). v2.0 [2009/01/19]: initial distribution of the TRAINING data sets. The following changes are observed from distribution 1.1: - Training and development sets have been added - New versions of UTF-8-coded verbal lexicons have been provided - The LSS tags for special cases (".0" and "._") have been eliminated. In the first case, the annotation of the adjectives/past-participles has been completed in the corpus with respect to LSS tags. The second case corresponded to errors. The annotation of those "predicates" has been completely eliminated. The "tagsets.pdf" file has been updated accordingly. v1.1 [2009/01/09]: several updates have been made on distribution 1.0: - Erroneous '\t' characters have been eliminated from the trial data set "CoNLL2009-ST-Catalan-trial.txt" - The description of the verbal lexicon has been extended in the README.TXT file (section 2), including details on the difference between senses and semantic classes. - Verbal lexicon files have been corrected in order to convert LSS tags into the format described in the tagset document "tagsets.pdf" (e.g., "1.1" => "a1", "2.2" => "b2", etc.) v1.0 [2009/01/05]: initial distribution of the TRIAL data sets (2) CONTENTS OF THE DISTRIBUTION 2.1 We are providing the following documents: * README.TXT this file * datasets/CoNLL2009-ST-Catalan-train.txt traning data set for Catalan; 13,200 sentences * datasets/CoNLL2009-ST-Catalan-development.txt development data set for Catalan; 1,724 sentences * datasets/CoNLL2009-ST-Catalan-trial.txt trial data set for Catalan; contains the first 50 sentences of datasets/CoNLL2009-ST-Catalan-development.txt. Included just for completeness with respect to previous distributions. * documentation/tagsets.pdf PDF document describing the tagsets of all levels of linguistic annotation: PoS tags and additional features, syntactic dependencies (syntactic functions), semantic dependencies (arguments and thematic roles) and predicate semantic classes (Lexical Semantic Structure, LSS). Tag sets are shared by the two languages. * documentation/verbal-lexicon.ca Catalan verbal lexicon. This lexicon contains, for each verbal predicate in the corpus, the mapping from syntactic functions to thematic roles and the corresponding semantic class (LSS, ELS in Catalan). In the lexicon, each verbal predicate may be divided into different numbered senses (01, 02, 03, ...), where each sense is related to one or more semantic classes, basically differentiated according to the four event classes -accomplishments (a), achievements (b), states (c) and activities (d)-, and on the diatheses alternations in which a sense can occur. The "EXAMPLE.pdf" file included shows an example of a verbal entry in the lexicon. The file "mapping-participles2infinitives.ca.txt" contains a list of equivalent pairs to facilitate the matching between verb participles and the infinitive forms from the LEMMA column of the corpus (e.g., "acompanyat" => "acompanyar"). More information on the verbal lexicons can be obtained at the ANCORA website: http://clic.ub.edu/ancora (3) ON THE CATALAN AND SPANISH DATA SETS The Catalan and Spanish corpora for the CoNLL-2009 shared task are compliant with the standard formatting described in the shared task web site (http://ufal.mff.cuni.cz/conll2009-st/). The sizes of the corpora will be: Catalan: 496,672 lexical tokens training: 390,302 development: 53,015 test: 53,355 Spanish: 528,440 lexical tokens training: 427,442 development: 50,368 test: 50,630 The special features of these corpora are: * Dependency trees are projective * Only verbal predicates are annotated (with exceptional cases referring to words that can be adjectives and past_participles) * No word can be the argument of more than one predicate in a sentence * Semantic dependency labels are composed by a numeric argument plus a thematic role label (see tagsets.pdf for details) * Predicate senses correspond to a Lexical Sematic Structure label (see tagsets.pdf for details) * The corpus is segmented so multi-words, named entities, temporal expressions, compounds, etc. are grouped together * Segmentation also accounts for elliptical pronouns (there are marked as empty lexical tokens "_" with a pronoun POS tag) The following tools have been used to generate the Predicted (P-) columns: * PLEMMA, PPOS, PFEAT are generated with the FreeLing Open source suite of Language Analyzers (http://www.lsi.upc.es/~nlp/freeling/). The accuracy in PLEMMA and PPOS columns is around 95%. Thanks to Lluís Padró (UPC) for helping with the annotation of the morphosyntactic information. * PHEAD and PDEPREL are generated using MaltParser (http://w3.msi.vxu.se/~jha/maltparser/). Parsing accuracy (LAS) is around 86.5%. Thanks to Xavier Lluís (UPC) for helping with the annotation of this part. Sources of the Catalan and Spanish data sets: The Catalan and Spanish data sets are extracted from the Ancora corpora (see http://clic.ub.edu/ancora). AnCora-ES (the Spanish part) contains 75,000 words from the Lexesp Spanish balanced 6-million-word corpus, 225,000 words from the EFE Spanish news agency, and 200,000 from the Spanish version of the `El Periódico' newspaper. AnCora-CA (the Catalan part) consists of 75,000 words from the EFE news agency, 225,000 words from the ACN Catalan news agency, and 200,000 words from the Catalan version of the `El Periódico' newspaper. The subset of 200,000 words coming from `El Periódico' corresponds to the same news in Catalan and Spanish, spanning from January to December 2000. (4) ORGANIZATION Lluís Màrquez Universitat Politècnica de Catalunya (UPC), Barcelona, Spain lluism@lsi.upc.edu http://www.lsi.upc.edu/~lluism Ma. Antònia Martí, Universitat de Barcelona (UB), Barcelona, Spain amarti@ub.edu http://clic.ub.edu Other people behind the preparation of the corpora: Mariona Taulé, CLiC, UB Manuel Bertran, CLiC, UB Oriol Borrega, CLiC, UB