Home › Language Resources › Data

Phrase Detectives Corpus

Item Name:	Phrase Detectives Corpus
Author(s):	Jon Chamberlain, Massimo Poesio, Udo Kruschwitz
LDC Catalog No.:	LDC2017T08
ISBN:	1-58563-798-X
ISLRN:	052-688-100-874-5
DOI:	https://doi.org/10.35111/9890-p128
Release Date:	May 15, 2017
Member Year(s):	2017
DCMI Type(s):	Text
Data Source(s):	fiction, web collection
Application(s):	information detection, parsing, information extraction, tagging
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2017T08 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Chamberlain, Jon, Massimo Poesio, and Udo Kruschwitz. Phrase Detectives Corpus LDC2017T08. Web Download. Philadelphia: Linguistic Data Consortium, 2017.
Related Works: Hide	View hasVersion LDC2019T10 Phrase Detectives Corpus Version 2 relatesTo LDC2011T01 SemEval-2010 Task 1 OntoNotes English: Coreference Resolution in Multiple Languages LDC2013T22 The ARRAU Corpus of Anaphoric Information LDC2013T19 OntoNotes Release 5.0 isCreatedBy Phrase Detectives http://anawiki.essex.ac.uk/phrasedetectives/

Introduction

Phrase Detectives Corpus was developed by the School of Computer Science and Electronic Engineering at the University of Essex and consists of approximately 19,012 words across 40 documents anaphorically-annotated by the Phrase Detectives game, an online interactive "game-with-a-purpose" (GWAP) designed to collect data about English anaphoric coreference.

GWAPs for creating language resources are growing. In general, they employ non-monetary incentives, such as entertainment, to motivate participation and can be successful for large-scale persistent annotation efforts.

Data

The documents in the corpus are taken from Wikipedia articles and from narrative text in Project Gutenberg. Wikipedia articles and annotation files are presented as XML and Project Gutenberg source files are presented as plain text. All text is encoded as UTF-8. Annotations are comprised of a gold standard version created by multiple experts, as well as a set created by a large non-expert crowd (via the Phase Detectives game).

The data was annotated according to a prevalent linguistically-oriented approach for anaphora used in several tasks, including OntoNotes Release 5.0 (LDC2013T19), SemEval-2010 Task 1 Ontonotes English: Coreference Resolution in Multiple Languages (LDC2011T01) and The ARRAU Corpus of Anaphoric Information (LDC2013T22).

Phrase Detectives Corpus

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees