Home › Language Resources › Data

SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages

Item Name:	SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages
Author(s):	Marta Recasens, Lluis Marquez, Emili Sapena, M. Antònia Martí, Mariona Taulé
LDC Catalog No.:	LDC2011T01
ISBN:	1-58563-572-3
ISLRN:	365-198-419-802-6
DOI:	https://doi.org/10.35111/bmpd-n944
Release Date:	January 24, 2011
Member Year(s):	2011
DCMI Type(s):	Text
Data Source(s):	broadcast news, newswire
Project(s):	SemEval
Application(s):	information extraction, information retrieval, semantic role labelling
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2011T01 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Recasens, Marta, et al. SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages LDC2011T01. Web Download. Philadelphia: Linguistic Data Consortium, 2011.
Related Works: Hide	View isOutcomeOf LDC2008T04 OntoNotes Release 2.0 relatesTo LDC2017T08 Phrase Detectives Corpus

Introduction

SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages is a subset of OntoNotes Release 2.0 LDC2008T04 used in SemEval-2010 Task 1, Coreference Resolution in Multiple Languages. OntoNotes Release 2.0 consists of roughly 500,000 words of English broadcast and newswire data annotated with structural information (syntax and predicate argument structure) and shallow semantics (word sense linked to an ontology and coreference). This SemEval-2010 Task 1 release contains approximately 120,000 words extracted from the OntoNotes corpus and formatted for the SemEval task.

SemEval (Semantic Evaluation) is an ongoing series of evaluations of computational semantic analysis systems. The goal of SemEval-2010 Task 1 was to evaluate and compare automatic coreference resolution systems for six languages (Catalan, Dutch, English, German, Italian and Spanish) in four evaluation settings using four metrics. Further information about Task 1 can be found on the task description website. The task organizers included researchers from Universitat de Barcelona (Spain), Universitat Politècnica de Catalunya (Spain), University of Essex (United Kingdom), Universita di Trento (Italy), Hogeschool Gent (Netherlands), University of Tübingen (Germany) and Stanford University (USA).

Data

The data is divided into three sets: the development set (*/data/en.devel.txt) which contains 39 documents, 741 sentences and 17,044 tokens; the training set (*/data/en.train.txt) which contains 229 documents, 3,648 sentences and 79,060 tokens; and the test set (*/data/en.test.txt) which contains 85 documents, 1,141 sentences and 24,206 tokens. The complete material for training systems is the sum of the development and training sets. Details of the SemEval task formatting applied to the data can be found in the documentation file, en.info.txt.

Scorer

The official scorer is available from the the task download page.

Updates

This corpus was updated on March 30, 2012 to fix a bug that caused one annotation error in every document. All data downloaded after this date contains the correct release. Contact ldc@ldc.upenn.edu with any questions.

Samples

For an example of the data in this publication, please review this text file excerpt.

Copyright

Portions © 2000-2001 American Broadcasting Company, © 2000-2001 Cable News Network, LP, LLP, © 1989 Dow Jones & Company, Inc., © 2000-2001 National Broadcasting Company, Inc., © 2000-2001 Public Radio International, © 1995, 2005, 2006, 2007, 2008, 2011 Trustees of the University of Pennsylvania

The World is a co-production of Public Radio International and the British Broadcasting Corporation and is produced at WGBH Boston.