============================================================== SemEval-2010 Task 1 OntoNotes English Corpus: "Coreference Resolution in Multiple Languages" http://stel.ub.edu/semeval2010-coref Created: March 29, 2012 Current version: 3.0 (2011/04/15) ============================================================== This file contains the specific information regarding the English Corpus provided for the SemEval-2010 task #1 on "Coreference Resolution in Multiple Languages". (1) SOURCE The English data set is extracted from the OntoNotes Corpus Release 2.0 (see http://www.bbn.com/ontonotes). The OntoNotes project is a collaborative effort between BBN Technologies, the University of Colorado, the University of Pennsylvania, and the University of Southern California's Information Sciences Institute to annotate a one-million-word English corpus with structural information (syntax and predicate argument structure) and shallow semantics (NE and coreference). The corpus comprises various genres of text, news among them, from which the excerpts selected for SemEval Task1 were extracted. OntoNotes builds on the Penn Treebank for syntax and the Penn PropBank for predicate-argument structure. Authors: Ralph Weischedel, Sameer Pradhan, Lance Ramshaw, Martha Palmer, Nianwen Xue, Mitchell Marcus, Ann Taylor, Craig Greenberg, Eduard Hovy, Robert Belvin, Ann Houston. (2) LICENSE AGREEMENT The OntoNotes corpus used at SemEval-2010 is freely distributed by LDC (http://www.ldc.upenn.edu/Catalog/catalogEntry.jsp?catalogId=LDC2011T01). The LDC catalog entry for this corpus is LDC2011T01. (3) CONTENTS OF THE DISTRIBUTION We provide the following files: * /docs/en.info.txt This file * /data/en.devel.txt The development set 39 documents; 741 sentences; 17,044 tokens * /data/en.train.txt The training set 229 documents; 3,648 sentences; 79,060 tokens * /data/en.test.txt The test set 85 documents; 1,141 sentences; 24,206 tokens Note: The complete training material for training systems is the sum of the development and training sets. The official scorer for the task is also available through the task website (http://stel.ub.edu/semeval2010-coref/download). Check the website regularly for updates (or subscribe to the official mailing list). (4) DATA FORMATTING General formatting is shared by all languages in the task. Data formats are inspired by the previous CoNLL shared tasks on syntactic and semantic dependencies (2008/2009 editions: http://ufal.mff.cuni.cz/conll2009-st). Trial data are provided as a single file per each language. Each file contains several documents introduced and finished by comment lines: #begin document CESS-CAT-AAP/95694_20030723.tbf.xml ... sentences in the document ... #end document CESS-CAT-AAP/95694_20030723.tbf.xml Inside a document, the information of each sentence is organized vertically with one word per line. The information associated to each word is described with several fields (columns) representing different layers of linguistic annotation. Columns are separated by TAB characters. Sentences are ended by a blank line. The following columns are provided: ID, TOKEN, LEMMA, PLEMMA, POS, PPOS, FEAT, PFEAT, HEAD, PHEAD, DEPREL, PDEPREL, NE, PNE, PRED, PPRED, APREDs, PAPREDs, and COREF, with the following interpretation. Column 1 1 ID: word identifiers in the sentence Columns 2--8: words and morphosyntactic information 2 TOKEN: word forms 3 LEMMA: word lemmas (gold standard manual annotation) 4 PLEMMA: word lemmas predicted by an automatic analyzer 5 POS: coarse part of speech 6 PPOS: same as 5 but predicted by an automatic analyzer 7 FEAT: morphological features (part of speech type, number, gender, case, tense, aspect, degree of comparison, etc., separated by the character "|") 8 PFEAT: same as 7 but predicted by an automatic analyzer Columns 9--12: syntactic dependency tree 9 HEAD: for each word, the ID of the syntactic head ('0' if the word is the root of the tree) 10 PHEAD: same as 9 but predicted by an automatic analyzer 11 DEPREL: dependency relation labels corresponding to the dependencies described in 9 ("sentence" if the word is the root of the tree) 12 PDEPREL: same as 11 but predicted by an automatic analyzer Columns 13--14: Named entities 13 NE: named entities 14 PNE: same as 13 but predicted by a named entity recognizer Columns 15--16+N+M: semantic role labeling 15 PRED: predicates are marked and annotated with a semantic class label 16 PPRED: Same as 15 but predicted by an automatic analyzer * APREDs: N columns, one for each predicate in 15, containing the semantic roles/dependencies of each particular predicate * PAPREDs: M columns, one for each predicate in 16, with the same information as APREDs but predicted with an automatic analyzer. Last column: output to be predicted * COREF: coreference annotation in open-close notation, using "|" to separate multiple annotations (see more details below) Notes: - All but the last column are to be considered as input information. When available, the predicted columns will be always provided. The gold standard manual annotations are to be used at test only in the "gold standard" setting of evaluation. For the regular setting, participants are not allowed to use the gold standard columns at test time. The last column (COREF) is the output information, that is, the annotation that has to be predicted by the systems. - Not all input columns are available for all languages. Whenever a language misses some linguistic information, the corresponding columns contain only underscore characters ('_'). DETAILS ON THE COREF ANNOTATION The annotation of coreference is shown in the last column in a numerical-bracketed format. Every entity has an ID number. Every mention is marked with the ID of the entity it refers to. An open parenthesis (before the entity ID) shows the beginning of the mention (first token), and a closed parenthesis (after the entity ID) shows the end of the mention (last token). Mentions can embed but not cross. The resulting annotation is a well-formed nested structure (CF language). The following examples are extracted from the sentence: [La remodelada plaça del [Mercat]_2]_1 es va inaugurar ahir amb actes d'homenatge a [Josep_Roura_i_Estrada]_3 (1787-1860), conegut per la introducció de l'enllumenat públic de gas a Espanya. A la casa natal de [Roura]_3, a [la plaça]_1, s'[hi]_1 va instal·lar un fanal antic de gas. Using the open-close notation from the task datasets la [...] (1 plaça [...] 1) Mentions with one single token show the entity ID within parentheses. Roura [...] (3) Tokens belonging to more than one mention separate the respective entity IDs with a pipe symbol "|". For instance: La [...] (1 remodelada [...] plaça [...] del [...] Mercat [...] (2)|1) Since the two mentions "la plaça" and "hi" corefer with "La remodelada plaça del Mercat", the COREF column shows the same entity ID for both of them: la [...] (1 plaça [...] 1) [...] hi [...] (1) Note: formatting of the Named Entity columns (NE and PNE) follows exactly the same rules as COREF annotation. (5) GOLD-STANDARD ANNOTATION The annotation follows the Penn Treebank (http://www.cis.upenn.edu/~treebank), PropBank (http://verbs.colorado.edu/~mpalmer/projects/ace.html) and NomBank (http://nlp.cs.nyu.edu/meyers/NomBank.html) annotated datasets. You can find information on the annotation style and tagsets for POS/parsing/SRL at the "Format of the data" and "Examples" sections of the CoNLL-2008 shared task website (http://www.yr-bcn.es/conll2008/). In order to make the data conform to the column-based format of SemEval Task1, some changes were needed. The Penn syntactic trees and the NomBank predicate information were converted to the dependency format using the automatic tools of the CoNLL-2008 and CoNLL-2009 shared tasks developed by Richard Johansson and Mihai Surdeanu. Also, to make the gold-standard information as similar as possible to the predicted information, null elements (traces, ellipsed material, etc.) and pseudo-attach elements were removed, as they were in the past CoNLL shared tasks. LEMMA column. OntoNotes contains no gold-standard lemmas. FEAT columns: contain a repetition of the POS tags PRED column. A repeated ARG tag means that the argument extends to the two or more corresponding syntactic nodes with the same tag. Tags of the argX/Y type indicate different instances of the same argument in the sentence. COREF column. Only nominal mentions and identical (IDENT) types were taken from the OntoNotes coreference annotation, thus excluding coreference relations with verbs and appositives. Since OntoNotes is only annotated with multi-mention entities, singleton referential elements were identified heuristically: all NPs and possessive determiners were annotated as singletons excluding those functioning as appositives or as premodifiers but for NPs in the possessive case. In coordinated NPs, single constituents as well as the entire NPs were considered to be mentions. There is no reliable heuristic to automatically detect English expletive pronouns, thus they were (although inaccurately) also annotated as singletons. (6) AUTOMATIC ANNOTATION The following tools were used to generate the Predicted (P-) columns in the trial data sets: * English PLEMMA, PPOS, and PFEAT columns were generated using SVMTagger (http://www.lsi.upc.edu/~nlp/SVMTool/) trained on PennTreebank (WSJ) and WordNet lemmatizer. The accuracy in PLEMMA and PPOS columns is expected to be around 97%. Thanks to Jesús Giménez (UPC) for helping with the annotation of the morphosyntactic information. * PHEAD, PDEPREL and PAPREDs columns were generated with JointParser (http://www.lsi.upc.edu/%7Exlluis/?x=cat:5), which is a syntactic-semantic parser developed during CoNLL-2008 and 2009 shared tasks. Thanks to Xavier Lluís (UPC) for helping with the annotation of the syntactic and semantic parts. The accuracy of automatic annotation is around 82% (Labeled Attachment Score) and 76.5 (F1 measure) for the syntactic and semantic dependencies, respectively. Notes: 1. No automatically predicted Named Entities (PNE column) are provided in the English datasets. 2. (simplification) In the PPRED column, the prediction of verbal senses is performed by assigning the most frequent sense. Identification of predicates is copied from the gold standard column. (7) ORGANIZATION * Marta Recasens, Ma. Antònia Martí, Mariona Taulé Universitat de Barcelona (UB), Barcelona, Spain {mrecasens, amarti, mtaule}@ub.edu http://clic.ub.edu * Lluís Màrquez, Emili Sapena Universitat Politècnica de Catalunya (UPC), Barcelona, Spain {lluism, esapena}@lsi.upc.edu http://www.lsi.upc.edu/~lluism * Massimo Poesio, University of Essex, UK / Universita di Trento, Italy http://cswww.essex.ac.uk/staff/poesio/ * Veronique Hoste, Hogeschool Gent, The Netherlands http://webs.hogent.be/~vhos368/ * Yannick Versley, University of Tübingen, Germany http://www.versley.de/ Other people behind the preparation of the corpora: Manuel Bertran (UB), Oriol Borrega (UB), Jesús Giménez (UPC), Richard Johansson (U.Trento), Xavier Lluís (UPC), Montse Nofre (UB), Lluís Padró (UPC), Kepa Rodríguez (U.Trento), Mihai Surdeanu (Stanford), Olga Uryupina, Lente Van Leuven (UB), Rita Zaragoza (UB)