****************************** Phrase Detectives Corpus 2.1.4 ****************************** Jon Chamberlain, Silviu Paun, Juntao Yu, Udo Kruschwitz and Massimo Poesio Release date: January 2019 This is a corpus of anaphorically annotated documents, consisting of all the documents completely annotated as of 12th October 2018 using the Phrase Detectives game-with-a-purpose: https://www.phrasedetectives.org **************************** The Documents **************************** A total of 537 documents are included, for a total of 407K tokens and 108K markables. The documents are divided in: - a `gold' subset of 40 documents (22K tokens, 6K markables) annotated both by Phrase Detectives players and by two experts. This subset is a cleaned-up version of the Phrase Detectives 1.0 corpus released in 2016. In addition to the gold annotation, a `silver' annotation is included for this subset, extracted using the Mention Pair Annotation (MPA) probabilistic aggregation method for coreference discussed in the paper: Silviu Paun, Jon Chamberlain, Udo Kruschwitz, Juntao Yu and Massimo Poesio, 2018. A Probabilistic Annotation Model for Crowdsourcing Coreference. In Proc. of EMNLP. https://aclanthology.coli.uni-saarland.de/papers/D18-1218/d18-1218 The gold subset consists of documents from three genres: - Wikipedia pages (35 docs, 15287 tokens, 3957 markables) - fiction from the Project Gutenberg collection (5 documents, 7536 tokens, 1947 markables) - a `silver' subset of 497 documents (384 tokens, 101K markables) for which only the silver MPA annotation is provided. The silver subset consists mainly of documents from two separate genres: - Wikipedia pages (348 documents, 204K tokens, 54045 markables) - fiction from the Project Gutenberg collection (147 documents, 173014 tokens, 45622 markables) - in addition, 2 extra docs are included (the Monty Python parrot sketch and one fiction produced by a Phrase Detectives player) (2 documents, 7294 tokens, 2126 markables) A document is considered `complete' is all its markables have at least 8 annotations, and each interpretation has been validated by at least 4 players. All annotations and validations are distributed, included those from recognized cheaters. On average, 20.6 annotations per markable are included. **************************** Markup formats **************************** The documents are provided in three different formats: MAS-XML, CONLL, and CRAC. MAS-XML This is the format used for the original release of the corpus. It is an inline XML format, in which each markable is identified with an element, and anaphoric annotation is stored using elements containing in turn elements specifiying annotations and validations (all annotations and validations are provided) and elements with additional information provided by the players. Details for and are as follows: id = id of the markable visibility = whether the administrator marked this to be "hidden" start_chr = the markable span was edited and the new start character is this end_chr = the markable span was edited and the new end character is this timestamp = time the annotation was created interface = interface the annotation was created on (0 = standalone PD, 1 = Facebook PD) mode = mode of annotation (a = annotation, v = validation, e = expert) agree = was this annotation in agreement with the interpretation annotation_time = time to create the annotation response from page load user_rating = rating of the user user_id = encrypted user_id type = interpretation type (DN = Discourse New, DO = Discourse Old, NR = Non-referring, PR = Property, SE = same entitiy) ante = antecedent markable id favoured = the expert indicates whether this is the best interpretation (y) or another might be better type/type_id = the type of error detected: 1. not_selectable 2. out_of_context_window 3. parse_error 4. discourse_diexis 5. ambiguous 6. non_referring 7. nearest_mention_embedding 8. bridge 9. quantifier CONLL This is the format used in the CONLL 2011 and 2012 Shared Tasks on Coreference. Unlike the MAS-XML format, where all interpretations produced by players are included, only one interpretation is provided for each markable in a doc in this format. For the docs in the silver subset, this is the silver interpretation. For the docs in the gold subset, both the gold and the silver interpretations are included, in separate versions of each document. CRAC This is the format used in the CRAC 2018 Shared Task on Anaphora Resolution with the ARRAU corpus. As in the case of the CONLL format, only one interpretation is provided for each markable. **************************** Annotation scheme **************************** Markables are annotated according to a scheme including the following labels: - NR: non-referring. Used for expletives, as in '[It] rains'. - PR: used for predicatives NPs, as in 'John is [a policeman]' - DN: Used for discourse-new mentions. - DO: Used for discourse-old mentions. Split-antecedent plurals are marked. **************************** Folder structure **************************** The gold and silver subsets are contained in subfolders called 'gold' and 'silver', respectively. In turn, the 'gold' folder consists of a single subfolder, 'test', whereas the 'silver' subfolder contains two subfolders, 'dev' and 'train'. The 'test' subfolder contains two subfolders: gutenberg and wikipedia. Each of these two subfolders contains six subfolders: CONLL (gold anno in CONLL format) CRAC (gold anno in CRAC format), masxml (gold anno in the same MASXML format as PD1.0, with all annotations and validations; .dtd provided in subfolder dtd), silver_CONLL (silver labels in CONLL format), silver_CRAC (silver labels in CRAC format), and source (original documents, see below). The 'dev' subfolder contains two subfolders, 'gutenberg' and 'wikipedia'. The 'train' subfolder contains three subfolders: 'gutenberg', 'other', and 'wikipedia'. All these five subfolders contain four subfolders: CONLL, CRAC, masxml contain silver label in the format specified source contains the original sources. **************************** File naming conventions **************************** The source files included in this release have the following types: gutenberg and 'other': .txt (with 'noheader' suffix) Wikipedia: .htm The affixes have the following meaning: -game suffix: docs exported. **************************** Papers **************************** Full documentation and preliminary analysis of the data has been published in a submitted paper. Earlier papers on the corpus include: Chamberlain, J., and Poesio, M., and Kruschwitz, U. (2016). Phrase Detectives Corpus 1.0: Crowdsourced Anaphoric Coreference. In Proceedings of the 2016 Language Resources and Evaluation Conference (LREC'16). Poesio, M., Chamberlain, J., Kruschwitz, U., Robaldo, L., and Ducceschi, L. (2013). Phrase Detectives: Utilizing collective intelligence for internet-scale language resource creation. ACM Transactions on Interactive Intelligent Systems. **************************** Contact **************************** Please report any errors or issues to jchamb@essex.ac.uk **************************** Change Log **************************** January 2019 :: Version 2.1 created.