==============================================================
Catalan corpus for the CoNLL-2009 shared task
"Syntactic and Semantic Dependencies in Multiple Languages"
Version 2.1: January 23, 2009
==============================================================

This file contains the basic information regarding the Catalan corpus
provided for the CoNLL-2009 shared task on "Syntactic and Semantic
Dependencies in Multiple Languages". The current version (2.1, January
23, 2009) corresponds to the release of the training data sets. All
changes and updates on these data sets are reported in Section 1 of
this document.

(1) LIST OF VERSIONS

  v2.1 [2009/01/23]: incorporates some bug fixing on verbal lexicons. 
      Changes form version 2.0 are:

    - Several character coding problems have been fixed in the verbal
      lexicon. First, the file names have been simplified and reduced
      to ASCII characters. Non-ASCII characters have been converted
      into similar ASCII characters (e.g., by removing accents), so
      file names resembles the verbal lemma as much as
      possible. Character coding of the content of the entry files is
      now fully compliant with UTF-8 codification. The verb lemma
      included in the first line of each entry file is the string to
      be found in the "LEMMA" columns of the copora.
    - A file containing the mapping between participle verbs and their
      infinitive form is provided in order to facilitate the match
      between lemmas encountered in the corpora and the verbal lexicon
      (see the extended explanation below).

  v2.0 [2009/01/19]: initial distribution of the TRAINING data sets. 
      The following changes are observed from distribution 1.1:
    - Training and development sets have been added
    - New versions of UTF-8-coded verbal lexicons have been provided 
    - The LSS tags for special cases (".0" and "._") have been
      eliminated. In the first case, the annotation of the
      adjectives/past-participles has been completed in the corpus
      with respect to LSS tags. The second case corresponded to
      errors. The annotation of those "predicates" has been completely
      eliminated. The "tagsets.pdf" file has been updated accordingly.

  v1.1 [2009/01/09]: several updates have been made on distribution 1.0: 
    - Erroneous '\t' characters have been eliminated from the
      trial data set "CoNLL2009-ST-Catalan-trial.txt" 
    - The description of the verbal lexicon has been extended in the
      README.TXT file (section 2), including details on the difference
      between senses and semantic classes.  
    - Verbal lexicon files have been corrected in order to convert LSS
      tags into the format described in the tagset document
      "tagsets.pdf" (e.g., "1.1" => "a1", "2.2" => "b2", etc.)

  v1.0 [2009/01/05]: initial distribution of the TRIAL data sets


(2) CONTENTS OF THE DISTRIBUTION 2.1

We are providing the following documents:

* README.TXT 
  this file

* datasets/CoNLL2009-ST-Catalan-train.txt
  traning data set for Catalan; 13,200 sentences
  
* datasets/CoNLL2009-ST-Catalan-development.txt
  development data set for Catalan; 1,724 sentences

* datasets/CoNLL2009-ST-Catalan-trial.txt
  trial data set for Catalan; contains the first 50 sentences of
  datasets/CoNLL2009-ST-Catalan-development.txt. Included just for
  completeness with respect to previous distributions.

* documentation/tagsets.pdf 
  PDF document describing the tagsets of all levels of linguistic
  annotation: PoS tags and additional features, syntactic dependencies
  (syntactic functions), semantic dependencies (arguments and thematic
  roles) and predicate semantic classes (Lexical Semantic
  Structure, LSS). Tag sets are shared by the two languages.

* documentation/verbal-lexicon.ca 

  Catalan verbal lexicon. This lexicon contains, for each verbal
  predicate in the corpus, the mapping from syntactic functions to
  thematic roles and the corresponding semantic class (LSS, ELS in
  Catalan). In the lexicon, each verbal predicate may be divided into
  different numbered senses (01, 02, 03, ...), where each sense is
  related to one or more semantic classes, basically differentiated
  according to the four event classes -accomplishments (a),
  achievements (b), states (c) and activities (d)-, and on the
  diatheses alternations in which a sense can occur. The "EXAMPLE.pdf"
  file included shows an example of a verbal entry in the lexicon. The
  file "mapping-participles2infinitives.ca.txt" contains a list of
  equivalent pairs to facilitate the matching between verb participles
  and the infinitive forms from the LEMMA column of the corpus (e.g.,
  "acompanyat" => "acompanyar"). More information on the verbal
  lexicons can be obtained at the ANCORA website:
  http://clic.ub.edu/ancora


(3) ON THE CATALAN AND SPANISH DATA SETS

The Catalan and Spanish corpora for the CoNLL-2009 shared task are
compliant with the standard formatting described in the shared task
web site (http://ufal.mff.cuni.cz/conll2009-st/). The sizes of the
corpora will be:

   Catalan: 496,672 lexical tokens
      training: 390,302 
      development: 53,015
      test: 53,355

   Spanish: 528,440 lexical tokens
      training: 427,442 
      development: 50,368
      test: 50,630

The special features of these corpora are: 

* Dependency trees are projective
* Only verbal predicates are annotated (with exceptional cases
  referring to words that can be adjectives and past_participles)
* No word can be the argument of more than one predicate in a sentence
* Semantic dependency labels are composed by a numeric argument plus a
  thematic role label (see tagsets.pdf for details)
* Predicate senses correspond to a Lexical Sematic Structure label
  (see tagsets.pdf for details)
* The corpus is segmented so multi-words, named entities, temporal
  expressions, compounds, etc. are grouped together
* Segmentation also accounts for elliptical pronouns (there are marked
  as empty lexical tokens "_" with a pronoun POS tag)

The following tools have been used to generate the Predicted (P-)
columns:

* PLEMMA, PPOS, PFEAT are generated with the FreeLing Open source
  suite of Language Analyzers (http://www.lsi.upc.es/~nlp/freeling/).
  The accuracy in PLEMMA and PPOS columns is around 95%. Thanks to
  Lluís Padró (UPC) for helping with the annotation of the
  morphosyntactic information.

* PHEAD and PDEPREL are generated using MaltParser
  (http://w3.msi.vxu.se/~jha/maltparser/). Parsing accuracy (LAS) is
  around 86.5%. Thanks to Xavier Lluís (UPC) for helping with the
  annotation of this part.

Sources of the Catalan and Spanish data sets:

  The Catalan and Spanish data sets are extracted from the Ancora
  corpora (see http://clic.ub.edu/ancora). AnCora-ES (the Spanish
  part) contains 75,000 words from the Lexesp Spanish balanced
  6-million-word corpus, 225,000 words from the EFE Spanish news
  agency, and 200,000 from the Spanish version of the `El Periódico'
  newspaper. AnCora-CA (the Catalan part) consists of 75,000 words
  from the EFE news agency, 225,000 words from the ACN Catalan news
  agency, and 200,000 words from the Catalan version of the `El
  Periódico' newspaper. The subset of 200,000 words coming from `El
  Periódico' corresponds to the same news in Catalan and Spanish,
  spanning from January to December 2000.


(4) ORGANIZATION

  Lluís Màrquez 
  Universitat Politècnica de Catalunya (UPC), Barcelona, Spain 
  lluism@lsi.upc.edu
  http://www.lsi.upc.edu/~lluism

  Ma. Antònia Martí, 
  Universitat de Barcelona (UB), Barcelona, Spain
  amarti@ub.edu
  http://clic.ub.edu

  Other people behind the preparation of the corpora:

  Mariona Taulé, CLiC, UB 
  Manuel Bertran, CLiC, UB
  Oriol Borrega, CLiC, UB