==============================================================
SemEval-2010 Task 1 OntoNotes English Corpus:
"Coreference Resolution in Multiple Languages" 
http://stel.ub.edu/semeval2010-coref

Created: March 29, 2012
Current version: 3.0 (2011/04/15) 
==============================================================

This file contains the specific information regarding the English
Corpus provided for the SemEval-2010 task #1 on "Coreference
Resolution in Multiple Languages". 

(1) SOURCE

  The English data set is extracted from the OntoNotes Corpus Release
  2.0 (see http://www.bbn.com/ontonotes). The OntoNotes project is a
  collaborative effort between BBN Technologies, the University of
  Colorado, the University of Pennsylvania, and the University of
  Southern California's Information Sciences Institute to annotate a
  one-million-word English corpus with structural information (syntax
  and predicate argument structure) and shallow semantics (NE and
  coreference). The corpus comprises various genres of text, news
  among them, from which the excerpts selected for SemEval Task1 were
  extracted. OntoNotes builds on the Penn Treebank for syntax and the
  Penn PropBank for predicate-argument structure.

  Authors: Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Martha
  Palmer, Nianwen Xue, Mitchell Marcus, Ann Taylor, Craig Greenberg,
  Eduard Hovy, Robert Belvin, Ann Houston.

(2) LICENSE AGREEMENT

  The OntoNotes corpus used at SemEval-2010 is freely distributed by LDC
  (http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T01).
  The LDC catalog entry for this corpus is LDC2011T01.

(3) CONTENTS OF THE DISTRIBUTION

  We provide the following files:

  * /docs/en.info.txt
  This file

  * /data/en.devel.txt
  The development set
  39 documents; 741 sentences; 17,044 tokens

  * /data/en.train.txt
  The training set
  229 documents; 3,648 sentences; 79,060 tokens

  * /data/en.test.txt
  The test set
  85 documents; 1,141 sentences; 24,206 tokens
 
  Note: The complete training material for training systems is the sum
        of the development and training sets.

  The official scorer for the task is also available through the task
  website (http://stel.ub.edu/semeval2010-coref/download). Check the
  website regularly for updates (or subscribe to the official mailing
  list).

(4) DATA FORMATTING

  General formatting is shared by all languages in the task. Data
  formats are inspired by the previous CoNLL shared tasks on syntactic
  and semantic dependencies (2008/2009 editions:
  http://ufal.mff.cuni.cz/conll2009-st).  Trial data are provided as a
  single file per each language. Each file contains several documents
  introduced and finished by comment lines:

    #begin document CESS-CAT-AAP/95694_20030723.tbf.xml
    ...
    sentences in the document
    ...
    #end document CESS-CAT-AAP/95694_20030723.tbf.xml

  Inside a document, the information of each sentence is organized
  vertically with one word per line. The information associated to each
  word is described with several fields (columns) representing different
  layers of linguistic annotation. Columns are separated by TAB
  characters. Sentences are ended by a blank line.

  The following columns are provided: ID, TOKEN, LEMMA, PLEMMA, POS, PPOS,
  FEAT, PFEAT, HEAD, PHEAD, DEPREL, PDEPREL, NE, PNE, PRED, PPRED, APREDs, 
  PAPREDs, and COREF, with the following interpretation.

  Column 1
  1 ID: word identifiers in the sentence

  Columns 2--8: words and morphosyntactic information
  2 TOKEN: word forms 
  3 LEMMA: word lemmas (gold standard manual annotation)
  4 PLEMMA: word lemmas predicted by an automatic analyzer 
  5 POS: coarse part of speech 
  6 PPOS: same as 5 but predicted by an automatic analyzer
  7 FEAT: morphological features (part of speech type, number, gender,
         case, tense, aspect, degree of comparison, etc., separated by
         the character "|")
  8 PFEAT: same as 7 but predicted by an automatic analyzer 

  Columns 9--12: syntactic dependency tree
  9 HEAD: for each word, the ID of the syntactic head ('0' if the word
         is the root of the tree)
  10 PHEAD: same as 9 but predicted by an automatic analyzer 
  11 DEPREL: dependency relation labels corresponding to the
           dependencies described in 9 ("sentence" if the word is the
           root of the tree)
  12 PDEPREL: same as 11 but predicted by an automatic analyzer 

  Columns 13--14: Named entities 
  13 NE: named entities
  14 PNE: same as 13 but predicted by a named entity recognizer

  Columns 15--16+N+M: semantic role labeling
  15 PRED: predicates are marked and annotated with a semantic class label 
  16 PPRED: Same as 15 but predicted by an automatic analyzer   
  * APREDs: N columns, one for each predicate in 15, containing the
           semantic roles/dependencies of each particular predicate
  * PAPREDs: M columns, one for each predicate in 16, with the same
            information as APREDs but predicted with an automatic analyzer.

  Last column: output to be predicted
  * COREF: coreference annotation in open-close notation, using "|" to
          separate multiple annotations (see more details below)

Notes:

- All but the last column are to be considered as input
  information. When available, the predicted columns will be always
  provided. The gold standard manual annotations are to be used at
  test only in the "gold standard" setting of evaluation. For the
  regular setting, participants are not allowed to use the gold
  standard columns at test time. The last column (COREF) is the output
  information, that is, the annotation that has to be predicted by the
  systems.

- Not all input columns are available for all languages. Whenever a
  language misses some linguistic information, the corresponding
  columns contain only underscore characters ('_').

  DETAILS ON THE COREF ANNOTATION
 
  The annotation of coreference is shown in the last column in a
  numerical-bracketed format. Every entity has an ID number. Every
  mention is marked with the ID of the entity it refers to. An open
  parenthesis (before the entity ID) shows the beginning of the mention
  (first token), and a closed parenthesis (after the entity ID) shows
  the end of the mention (last token). Mentions can embed but not
  cross. The resulting annotation is a well-formed nested structure (CF
  language). The following examples are extracted from the sentence:

    [La remodelada plaça del [Mercat]_2]_1 es va inaugurar ahir amb
    actes d'homenatge a [Josep_Roura_i_Estrada]_3 (1787-1860), conegut
    per la introducció de l'enllumenat públic de gas a Espanya. A la
    casa natal de [Roura]_3, a [la plaça]_1, s'[hi]_1 va instal·lar un
    fanal antic de gas.

  Using the open-close notation from the task datasets

    la             [...]     (1
    plaça          [...]      1)
 
  Mentions with one single token show the entity ID within parentheses.

    Roura     [...]     (3)

  Tokens belonging to more than one mention separate the respective
  entity IDs with a pipe symbol "|". For instance:

    La             [...]     (1
    remodelada     [...]
    plaça          [...]
    del            [...]
    Mercat         [...]     (2)|1)

  Since the two mentions "la plaça" and "hi" corefer with "La remodelada
  plaça del Mercat", the COREF column shows the same entity ID for both
  of them:

   la        [...]     (1
   plaça     [...]      1) 
   [...]     
   hi        [...]     (1) 

  Note: formatting of the Named Entity columns (NE and PNE) follows
      	exactly the same rules as COREF annotation.
  
(5) GOLD-STANDARD ANNOTATION

  The annotation follows the Penn Treebank
  (http://www.cis.upenn.edu/~treebank), PropBank
  (http://verbs.colorado.edu/~mpalmer/projects/ace.html) and NomBank
  (http://nlp.cs.nyu.edu/meyers/NomBank.html) annotated datasets. You
  can find information on the annotation style and tagsets for
  POS/parsing/SRL at the "Format of the data" and "Examples" sections
  of the CoNLL-2008 shared task website (http://www.yr-bcn.es/conll2008/).

  In order to make the data conform to the column-based format of
  SemEval Task1, some changes were needed. The Penn syntactic trees
  and the NomBank predicate information were converted to the
  dependency format using the automatic tools of the CoNLL-2008 and
  CoNLL-2009 shared tasks developed by Richard Johansson and Mihai
  Surdeanu. Also, to make the gold-standard information as similar as
  possible to the predicted information, null elements (traces,
  ellipsed material, etc.) and pseudo-attach elements were removed, as
  they were in the past CoNLL shared tasks.

  LEMMA column. OntoNotes contains no gold-standard lemmas.

  FEAT columns: contain a repetition of the POS tags

  PRED column. A repeated ARG tag means that the argument extends to
  the two or more corresponding syntactic nodes with the same
  tag. Tags of the argX/Y type indicate different instances of the
  same argument in the sentence.

  COREF column. Only nominal mentions and identical (IDENT) types were
  taken from the OntoNotes coreference annotation, thus excluding
  coreference relations with verbs and appositives. Since OntoNotes is
  only annotated with multi-mention entities, singleton referential
  elements were identified heuristically: all NPs and possessive
  determiners were annotated as singletons excluding those functioning
  as appositives or as premodifiers but for NPs in the possessive
  case. In coordinated NPs, single constituents as well as the entire
  NPs were considered to be mentions. There is no reliable heuristic
  to automatically detect English expletive pronouns, thus they were
  (although inaccurately) also annotated as singletons.

(6) AUTOMATIC ANNOTATION

  The following tools were used to generate the Predicted (P-)
  columns in the trial data sets:

  * English PLEMMA, PPOS, and PFEAT columns were generated using
    SVMTagger (http://www.lsi.upc.edu/~nlp/SVMTool/) trained on
    PennTreebank (WSJ) and WordNet lemmatizer. The accuracy in PLEMMA
    and PPOS columns is expected to be around 97%. Thanks to Jesús
    Giménez (UPC) for helping with the annotation of the
    morphosyntactic information.

  * PHEAD, PDEPREL and PAPREDs columns were generated with JointParser
    (http://www.lsi.upc.edu/%7Exlluis/?x=cat:5), which is a
    syntactic-semantic parser developed during CoNLL-2008 and 2009
    shared tasks. Thanks to Xavier Lluís (UPC) for helping with the
    annotation of the syntactic and semantic parts. The accuracy of
    automatic annotation is around 82% (Labeled Attachment Score) and
    76.5 (F1 measure) for the syntactic and semantic dependencies,
    respectively.

  Notes:

  1. No automatically predicted Named Entities (PNE column) are
     provided in the English datasets.

  2. (simplification) In the PPRED column, the prediction of verbal
     senses is performed by assigning the most frequent
     sense. Identification of predicates is copied from the gold
     standard column.

(7) ORGANIZATION

* Marta Recasens, Ma. Antònia Martí, Mariona Taulé
  Universitat de Barcelona (UB), Barcelona, Spain
  {mrecasens, amarti, mtaule}@ub.edu
  http://clic.ub.edu

* Lluís Màrquez, Emili Sapena
  Universitat Politècnica de Catalunya (UPC), Barcelona, Spain 
  {lluism, esapena}@lsi.upc.edu
  http://www.lsi.upc.edu/~lluism

*  Massimo Poesio, University of Essex, UK / Universita di Trento, Italy
   http://cswww.essex.ac.uk/staff/poesio/

*  Veronique Hoste, Hogeschool Gent, The Netherlands
   http://webs.hogent.be/~vhos368/

*  Yannick Versley, University of Tübingen, Germany
   http://www.versley.de/

Other people behind the preparation of the corpora:

  Manuel Bertran (UB), Oriol Borrega (UB), Jesús Giménez (UPC),
  Richard Johansson (U.Trento), Xavier Lluís (UPC), Montse Nofre (UB),
  Lluís Padró (UPC), Kepa Rodríguez (U.Trento), Mihai Surdeanu
  (Stanford), Olga Uryupina, Lente Van Leuven (UB), Rita Zaragoza (UB)