TAC KBP English Regular Slot Filling Comprehensive Training and Evaluation Data 2009-2014 Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel 1. Overview This package contains training and evaluation data produced in support of the TAC KBP Slot Filling evaluation track conducted from 2009 to 2014. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The regular English Slot Filling evaluation track (SF) involves mining information about entities from text. SF can be viewed as more traditional Information Extraction (IE), or alternatively, as a Question Answering (QA) task, in which the questions are static but the targets change. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person (PER) and organization (ORG) entities and attempted to return all valid answers (slot fillers) in the source collection. For more information about English Slot Filling, please refer to the 2014 track home page (2014 was the last year in which the regular Slot Filling evaluation was conducted), at http://www.nist.gov/tac. This package contains all evaluation and training data developed in support of TAC KBP Slot Filling during the six years in which the track was conducted, from 2009-2014. This includes queries, the 'manual runs' (human-produced responses to the queries), and the final rounds of assessment results. The corresponding source document collections for this release are included in LDC2018T03: TAC KBP Comprehensive English Source Corpora 2009-2014. The corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - can be obtained via LDC2014T16: TAC KBP Reference Knowledge Base. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2009E56: TAC KBP 2009 Evaluation Generic Infoboxes V2.0 LDC2009E65: TAC KBP 2009 Evaluation Slot Filling List LDC2009E90: TAC KBP 2009 Assessment Results LDC2009E110: TAC KBP 2009 Evaluation NIL Link Assessment LDC2010E18: TAC 2010 KBP Training Slot Filling Annotation V2.1 LDC2010E24: TAC 2010 KBP Generic Infoboxes LDC2010E61: TAC 2010 KBP Assessment Results V1.2 LDC2010E32: TAC 2010 KBP Evaluation Slot Filling Annotation LDC2011E48: TAC 2011 KBP English Training Regular Slot Filling Annotation LDC2011E88: TAC 2011 KBP English Regular Slot Filling Assessment Results V1.2 LDC2011E89: TAC 2011 KBP English Evaluation Regular Slot Filling Annotation V1.2 LDC2012E91: TAC 2012 KBP English Regular Slot Filling Evaluation Annotations V1.1 LDC2012E115: TAC 2012 KBP English Regular Slot Filling Assessment Results V1.2 LDC2013E60: TAC 2013 KBP English Regular Slot Filling per:title Training Data LDC2013E77: TAC 2013 KBP English Regular Slot Filling Evaluation Queries and Annotations V1.1 LDC2013E91: TAC 2013 KBP English Regular Slot Filling Evaluation RAssessment esults V1.1 LDC2014E66: TAC 2014 KBP English Regular Slot Filling Evaluation Queries and Annotations V1.1 LDC2014E75: TAC 2014 KBP English Regular Slot Filling Evaluation Assessment Results V2.0 LDC2015E46: TAC KBP English Regular Slot Filling - Comprehensive Training and Evaluation Data 2009-2014 Summaries of data included in this package (for more details see ./data/*/contents.txt): Query Data: +------+------------+-----+-----+-----+-------+ | year | set | PER | ORG | GPE | total | +------+------------+-----+-----+-----+-------+ | 2009 | evaluation | 17 | 31 | 5 | 53 | | 2010 | training | 42 | 56 | n/a | 98 | | 2010 | evaluation | 50 | 50 | n/a | 100 | | 2011 | training | 92 | 106 | n/a | 198 | | 2011 | evaluation | 50 | 50 | n/a | 100 | | 2012 | evaluation | 40 | 40 | n/a | 80 | | 2013 | evaluation | 50 | 50 | n/a | 100 | | 2014 | evaluation | 50 | 50 | n/a | 100 | +------+------------+-----+-----+-----+-------+ Manual Run Data: +------+------------+------------------+ | year | set | manual responses | +------+------------+------------------+ | 2010 | training | 336 | | 2010 | evaluation | 799 | | 2011 | training | 1,627 | | 2011 | evaluation | 796 | | 2012 | evaluation | 1,553 | | 2013 | evaluation | 2,383 | | 2014 | evaluation | 2,216 | +------+------------+------------------+ Assessment Data: +------+------------+-----------+ | eval | | assessed | | year | set | responses | +------+------------+-----------+ | 2009 | evaluation | 10,416 | | 2010 | evaluation | 24,515 | | 2011 | evaluation | 28,041 | | 2012 | evaluation | 22,885 | | 2013 | evaluation | 27,655 | | 2013 | training | 4,660 | | 2014 | evaluation | 21,956 | +------+------------+-----------+ 2. Contents ./README.txt This file. ./data/20*/contents.txt The data in this package are organized by the year of original release in order to clarify dependencies, highlight occassional differences in formats from one year to another, and to increase readability in documentation. The contents.txt file within each year's root directory provides a list of the contents for all subdirectories as well as specific details about file formats and contents. ./dtd/sf_queries_2009-2010-2011.dtd The dtd against which to validate these files: ./data/2009/eval/tac_kbp_2009_regular_sf_evaluation_queries.xml ./data/2010/eval/tac_kbp_2010_regular_sf_evaluation_queries.xml ./data/2010/training/tac_kbp_2009_regular_sf_evaluation_queries.xml ./data/2010/training/tac_kbp_2010_regular_sf_training_queries.xml ./data/2011/eval/tac_kbp_2011_regular_sf_evaluation_queries.xml ./dtd/sf_queries_2012-2013.dtd The dtd against which to validate these files: ./data/2012/eval/tac_kbp_2012_regular_sf_evaluation_queries.xml ./data/2013/eval/tac_kbp_2013_regular_sf_evaluation_queries.xml ./dtd/sf_queries_2014.dtd The dtd against which to validate this file: ./data/2014/eval/tac_kbp_2014_regular_sf_evaluation_queries.xml ./docs/all_files.md5 Paths (relative to the root of the corpus) and md5 checksums for all files included in the package. ./docs/guidelines/*/*.pdf The guidelines used by annotators in developing slot filling queries, manual run annotation, and assessment data contained in this corpus. ./docs/task_descriptions/KBP2009-TaskDefinition-0218.pdf ./docs/task_descriptions/KBP2010_TaskDefinition_Aug31.pdf ./docs/task_descriptions/KBP2011_TaskDefinition.pdf ./docs/task_descriptions/KBP2012_TaskDefinition_1.1.pdf Task Descriptions for respective years covering all of the TAC KBP tracks, written by evaluation track coordinators. Note that these documents also describe tasks not relevant to this specific package. ./docs/task_descriptions/KBP2013_TaskDefinition_EnglishSlotFilling_1.1.pdf ./docs/task_descriptions/KBP2014_TaskDefinition_EnglishSlotFilling_1.1.pdf Task descriptions for the 2013 and 2014 English Regular Slot Filling evaluation tracks, written by track coordinators. ./tools/scorers/KBP20*_English_SF_slot-list.txt Slot list files to be used with the 2013 and 2014 scorers respectively. ./tools/scorers/SFScore20*.java Scorers for regular slot filling files for 2009-2014 respectively, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/validators/check_kbp_20*_slot-filling.pl Validators for regular slot filling files for 2009-2014 respectively, as provided to LDC by evaluation track coordinators, with no further testing. 3. Annotation Tasks The tasks conducted by LDC annotators in support of regular SF included entity selection/query development, manual run development, slot mapping, and assessment of system- and human-produced responses to queries. Each of these subtasks is explained below. 3.1 Query Development Entities, which are the basis of SF queries, were selected based primarily on their level of non-confusability and productivity. A candidate query entity was considered non-confusable if there were one or more references to it in the source corpus that were "canonical", meaning that they were not an alias and, for persons, included more than just a first or last name. Productivity for candidate queries was determined by searching the source corpus to find whether it contained at least two slot fillers (i.e. answers) for the entity. Entities with well-populated Knowledge Base (KB) entries (either in the official TAC KBP KB or in online resources such as Wikipedia) were also generally avoided as query entities. Such entities were dispreferred both to reduce the advantage gained by using online resources and because there was a restriction against returning fillers that were redundant with information already in the official KB. Linking query entities to the KB was discontinued from SF in 2014, which removed the redundancy restriction on responses (though duplicate responses were still considered incorrect). However, query developers in 2014 were still required to check live Wikipedia when considering potential query entities so as to continue avoiding any for which the online resource would indicate numerous correct responses. The final set of SF queries for each evaluation was also selected with the goal of an approximately balanced representation of entity types (person, organization, and - in 2009 only - geo-political entity) and of response type for slots (i.e., those that take named entities as fillers, those that take values (dates and numbers) as fillers, and those that take strings as fillers). Following initial query development, a quality control pass was conducted to flag any fillers that did not have adequate justification in the source document, or that might be at variance with the guidelines in any way. These flagged fillers were then adjudicated by senior annotators who updated, removed, or replaced them as appropriate. 3.2 Manual Run Development LDC developed "manual runs", or the human-produced set of annotated responses for each of the evaluation queries, for all but the 2009 SF evaluation cycles. For each query, annotators were given up to two hours to search the corpus and locate all valid fillers. Note that, unlike systems, annotators producing the manual runs were instructed to return duplicate fillers from separate source documents if time permitted in order to provide more training data for systems in the future. Justification - the minimum extents of provenance supporting the validity of a slot filler - was first added to responses in 2012 in order to pinpoint the sources of assertions and, thereby, reduce the effort required for assessment. Valid justification strings were said to clearly identify all three elements of a relation (i.e. the subject entity, the predicate slot, and the object filler) with minimal extraneous text. In 2013, justification was modified to allow for up to two discontiguous strings selected from as many separate documents, up from one string in 2012. In 2014, justification was again altered to allow for up to four justification strings. This facilitated a greater potential for inferred relations that would be difficult to justify with just a single document. Following the initial round of annotation for manual runs, a quality control pass was conducted to flag any fillers that did not have adequate justification in the source document, or that might be at variance with the guidelines in any way. These flagged fillers were then adjudicated by senior annotators who updated or removed them as appropriate. 3.3 Slot Mapping For the 2009-2013 evaluations, a senior annotator performed a slot-mapping process before assessment in order to indicate how existing attribute labels in the KB for non-NIL query entities would map to the set of TAC KBP SF slots. This process was necessary because attribute labels for the same type of information varied widely in Wikipedia (the source of the TAC KBP KB) based on entity type information. For example, an actor's birth date might be labeled as 'actor-birth-date' while a golfer's could be indicated by 'date-of-birth-golfer'. During the slot-mapping process, both of these would be linked to the TAC KBP slot 'per:date_of_birth'. These mappings were then imported into the assessment tool so that they could be coreferenced with responses marked as correct (with respect to the slot definition), thereby indicating that those responses were redundant with the KB. 3.4 Assessment In assessment, annotators first judged the validity of anonymized human- and system-produced responses returned for the query set and then coreferenced those marked as correct. Fillers were assessed as correct if they were found to be both compatible with the slot descriptions and supported in the text. Fillers were assessed as wrong if they did not meet both of the conditions for correctness, or as inexact if overly insufficient or extraneous text had been selected for an otherwise correct answer. For the years in which it was produced, justification was assessed as correct if it succinctly and completely supported the relation, wrong if it did not support the relation at all (or if the corresponding filler was marked wrong), inexact-short if part but not all of the information necessary to support the relation was provided, or inexact-long if it contained all information necessary to support the relation but also a great deal of extraneous text. In 2014, responses with justification comprising more than 600 characters in total were automatically ignored and removed from the pool of responses for assessment. After first passes of assessment were completed, quality control was performed on the data by senior annotators. During quality control, the text extents of annotated fillers and justifications were checked for correctness, equivalence classes for entities assessed as correct were checked for accuracy, and potentially problematic assessments were either corrected or flagged for additional review. 4. Using the Data As mentioned in the intro, note that the corresponding source document collections for this release are included in LDC2018T03: TAC KBP Comprehensive English Source Corpora 2009-2014. Also, the corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - can be obtained via LDC2014T16: TAC KBP Reference Knowledge Base. 4.1 Text Normalization and Offset Calculation Text normalization of queries consisting of a 1-for-1 substitution of newline (0x0A) and tab (0x09) characters with space (0x20) characters was performed on the document text input to the response field. The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial opening angle bracket of the element ( in DF sources), which is usually the initial character (character 0) of the source. Note as well that character counting includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. Note that although strings included in the annotation files (queries and gold standard mentions) generally match source documents, a few characters are normalized in order to enhance readability: Conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed, and conversion of multiple spaces to a single space. 4.2 Proper Ingesting of XML Queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 5. Acknowledgments This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dave Graff (LDC) Heather Simpson (LDC) Robert Parker (LDC) Neil Kuster (LDC) Hoa Dang (NIST) Heng Ji (RPI) Ralph Grishman (NYU) James Mayfield (JHU) Mihai Surdeanu (UA) Paul McNamee (JHU) Boyan Onyshkevych (DARPA) 6. References Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014 Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: Planning, Execution, and Results https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-2014-overview.pdf TAC KBP 2014 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 17-18 Joe Ellis, Jeremy Getman, Justin Mott, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2013 Linguistic Resources for 2013 Knowledge Base Population Evaluations https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-workshop2013-linguistic-resources-kbp-eval.pdf TAC KBP 2013 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 18-19 Joe Ellis, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2012 Linguistic Resources for 2012 Knowledge Base Population Evaluations https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-workshop2012-linguistic-resources-kbp-eval.pdf TAC KBP 2012 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 5-6 Xuansong Li, Joe Ellis, Kira Griffit, Stephanie Strassel, Robert Parker, Jonathan Wright. 2011 Linguistic Resources for 2011 Knowledge Base Population Evaluation https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tac2011-linguistic-resources-kbp.pdf TAC 2011: Proceedings of the Fourth Text Analysis Conference, Gaithersburg, Maryland, November 14-15 Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Griffitt, Joe Ellis. 2010 Overview of the TAC 2010 Knowledge Base Population Track TAC 2010 Workshop: Proceedings of the Third Text Analysis Conference, Gaithersburg, MD, November 15-16 P. McNamee, H.T. Dang. 2009 Overview of the TAC 2009 Knowledge Base Population Track TAC 2009: Proceedings of the Second Text Analysis Conference Gaithersburg, MD, November 16-17 7. Copyright Information (c) 2018 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, or the TAC KBP project, contact the following project staff at LDC: Joe Ellis, Project Manager Jeremy Getman, Lead Annotator Stephanie Strassel, PI ------------------------------------------------------------------------ README created by Neil Kuster on January 25, 2016 updated by Neil Kuster on March 28, 2016 updated by Joe Ellis on April 21, 2016 updated by Neil Kuster on September 14, 2016 updated by Joe Ellis on September 19, 2016 updated by Joe Ellis on January 3, 2017 updated by Joe Ellis on February 15, 2017 updated by Jeremy Getman on September 27, 2018