TAC KBP English Entity Linking Comprehensive Training and Evaluation Data 2009-2013 Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel 1. Overview This package contains training and evaluation data produced in support of the TAC KBP English Entity Linking evaluation track from 2009 to 2013. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of the English Entity Linking track (EL) is to measure systems' ability to determine whether an entity, specified by a query, has a matching node in a reference knowledge base (KB) and, if so, to create a link between the two. If there is no matching node for a query entity in the KB, EL systems are required to cluster the mention together with others referencing the same entity. More information about the TAC KBP Entity Linking track and other TAC KBP evaluations can be found on the NIST TAC website, http://www.nist.gov/tac/. This package contains all evaluation and training data developed in support of EL during the five years in which the evaluation track was conducted (2009-2013). This includes queries and gold standard entity type information, KB links, and equivalence class clusters for NIL entities (those for which there was no matching node in the KB). Source documents for the queries are also included in this corpus. The corresponding KB is available as LDC2014T16: TAC KBP Reference Knowledge Base. Also included in this package are the results of an Entity Linking IAA study conducted in 2010. The data included in this package were originally released to TAC KBP as: LDC2009E64: TAC KBP 2009 Evaluation Entity Linking List LDC2009E86: TAC KBP 2009 Gold Standard Entity Linking Entity Type List LDC2010E31: TAC 2010 KBP Training Entity Linking V2.0 LDC2010E82: TAC 2010 KBP Evaluation Entity Linking Gold Standard V1.0 LDC2012E31: TAC 2010 KBP Entity Linking IAA Study Results LDC2012E29: TAC 2011 KBP English Evaluation Entity Linking Annotation LDC2012E102: TAC 2012 KBP English Entity Linking Evaluation Annotations V1.1 LDC2013E90: TAC 2013 KBP English Entity Linking Evaluation Queries and Knowledge Base Links V1.1 LDC2015E19: TAC KBP English Entity Linking Comprehensive Training and Evaluation Data Summary of data included in this package (for more details see /docs/tac_kbp_2009-2013_english_entity_linking_query_distribution_table.tsv): +----------+------------------+---------+ | Year | Source Documents | Queries | +----------+------------------+---------+ | 2009 | 3688 | 3904 | | 2010 | 3684 | 3750 | | 2010 IAA | 200 | 600 | | 2011 | 2231 | 2250 | | 2012 | 2016 | 2226 | | 2013 | 1820 | 2190 | +----------+------------------+---------+ 2. Contents ./README.txt This file ./data/{2009,2010,2011,2012,2013}/contents.txt The data in this package are organized by the year of original release in order to clarify dependencies, highlight differences in formats from one year to another, and to increase readability in documentation. The contents.txt file within each year's root directory provides a list of the contents for all subdirectories as well as details about file formats and contents. ./docs/all_files.md5 Paths (relative to the root of the corpus) and md5 checksums for all files included in the package. ./docs/tac_kbp_2009-2013_english_entity_linking_query_distribution_table.tsv Tab-delimited table containing the query distribution quantities for all years and datasets, further broken down by language, source type, KB-Link, and entity type. ./docs/guidelines/2009-2010/* The guidelines used by annotators in developing the 2009 and 2010 Entity Linking queries and gold standard data contained in this corpus. ./docs/guidelines/2011/* The guidelines used by annotators in developing the 2011 Entity Linking queries and gold standard data contained in this corpus. ./docs/guidelines/2012/TAC_KBP_2012_Entity_Selection_V1.1.pdf The guidelines used by annotators in developing the 2012 Entity Linking queries and gold standard data contained in this corpus. ./docs/guidelines/2013/TAC_KBP_2013_EL_Query_Development_Guidelines_V1.0.pdf The guidelines used by annotators in developing the 2013 Entity Linking queries and gold standard data contained in this corpus. ./docs/task_descriptions/090601-KBPTaskGuidelines.pdf Task Description for all of the 2009 TAC KBP tracks, written by track coordinators. ./docs/task_descriptions/KBP2010_TaskDefinition_Aug31.pdf Task Description for all of the 2010 TAC KBP tracks, written by track coordinators. ./docs/task_descriptions/KBP2011_TaskDefinition.pdf Task Description for all of the 2011 TAC KBP tracks, written by track coordinators. ./docs/task_descriptions/KBP2012_TaskDefinition_1.1.pdf Task Description for all of the 2012 TAC KBP tracks, written by track coordinators. ./docs/task_descriptions/KBP2013_EntityLinkingTaskDescription_1.0.pdf Task Description for the 2013 Entity Linking evaluation tracks, written by track coordinators. ./dtd/el_queries_2009-2011.dtd DTD for: ./data/2009/eval/tac_kbp_2009_english_entity_linking_evaluation_queries.xml ./data/2010/eval/tac_kbp_2010_english_entity_linking_evaluation_queries.xml ./data/2011/eval/tac_kbp_2011_english_entity_linking_evaluation_queries.xml ./dtd/el_queries_2010_training.dtd DTD for: ./data/2010/IAA_study_results/tac_kbp_2010_english_entity_linking_IAA_queries_ann1.xml ./data/2010/IAA_study_results/tac_kbp_2010_english_entity_linking_IAA_queries_ann2.xml ./data/2010/IAA_study_results/tac_kbp_2010_english_entity_linking_IAA_queries_ann3.xml ./data/2010/training/tac_kbp_2010_english_entity_linking_training_queries.xml Note: The DTD for 2010 training data and IAA data is slightly different from the evaluation DTD because the queries files contained the KB link for ease of use. ./dtd/el_queries_2012-2013.dtd DTD for: ./data/2012/eval/tac_kbp_2012_english_entity_linking_evaluation_queries.xml ./data/2013/eval/tac_kbp_2013_english_entity_linking_evaluation_queries.xml ./tools/check_kbp2009_entity-linking.pl Validator for 2009 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/check_kbp_2010_entity-linking.pl Validator for 2010 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/check_kbp_2011_english-entity-linking.pl Validator for 2011 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/check_kbp_2012_english-entity-linking.pl Validator for 2012 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/check_kbp_2013_english-entity-linking.pl Validator for 2013 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/el_scorer_2009-2010_kbpenteval.pl Scorer for 2009 and 2010 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/el_scorer_2011-2012.py Scorer for 2011 and 2012 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/el_scorer_2013.py Scorer for 2013 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. 3. Query Development Annotation and Quality Control Query development for Entity Linking (EL) begins with Entity Selection, which has three stages: Namestring Annotation, KB Linking, and NIL Coreference (where a NIL entity is an entity without a node in the KB). Annotators searched the corpus for entities that would make suitable queries, using an interface created by LDC for this task. Each set of queries was roughly balanced across entity type, KB-link status (NIL versus non-NIL), and source document genre. In Namestring Annotation, annotators search for and select named mentions of entities in text. Annotators focused on creating queries using confusable named entity mentions. Confusability was measured both by the number of distinct entities in the full query set referred to by the same name string (polysemy) as well as the number of distinct entities in the set that were referred to by multiple, unique named mentions (synonymy). For example, the string "Smith" would make a polysemous query because an annotator could probably find it in the corpus referring to different entities, while "Barack Obama" would make a synonymous query because the entity is also referred to in the corpus as "B. Hussein Obama" or "Bam Bam". In KB Linking, annotators search the KB and indicate whether or not it includes pages on the entities they selected during Namestring Annotation. Annotators created a link between the query and the matching KB node ID. If no matching node was found, the query was marked as NIL and later coreferenced with other NIL entities. Annotators were allowed to use online searching to assist in determining the KB link/NIL status. Queries for which an annotator could not confidently determine the KB link status were removed from the final data sets. For NIL Coreference, selected entities that were not included in the KB (i.e., NIL entities) were grouped into equivalence classes by annotators. Mentions referring to the same entity were grouped into one equivalence class. Senior annotators conducted quality control on queries to correct errors and identify areas of difficulty to use in improving guidelines and annotator training. Annotators performing quality control made sure that the extent of each selected namestring was correct and checked that each entity was linked to the correct KB node or was properly identified as NIL and coreferenced correctly. 4. Source Documents The source data contained in this release comprises all documents from which queries were drawn and is the complete data set used in the English EL evaluations. The source data was drawn from existing LDC holdings, with no additional validation. An overall scan of character content in the source collections indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors (some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); the web collection also has characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. All source documents were originally released as XML but have been converted to text files for this release. This change was made primarily because the documents were used as text files during data development but also because some fail XML parsing. All documents that have filenames beginning with "eng-NG" are Web Document data (WB) and some of these fail XML parsing (see below for details). All files that start with "bolt-" are Discussion Forum threads (DF) and have the XML structure described below. All other files are Newswire data (NW) and have the newswire markup pattern detailed below. Note as well that some source documents are duplicated across a few of the separated source_documents directories, indicating that some queries from different data sets originated from the same source documents. As it is acceptable for source to be reused for Entity Linking queries, this duplication is intentional and expected. The subsections below go into more detail regarding the markup and other properties of the three source data types: 4.1 Newswire Data Newswire data use the following markup framework: ... ...

...

...
where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "

...

" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files, if converted back to XML, are parseable. 4.2 Discussion Forum Data Discussion forum files use the following markup framework: ... ... ... ... ... where there may be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "..." anchor tags). As mentioned in section 2 above, each unit contains at least five post elements. All the discussion forum files, if converted back to XML, are parseable. 4.3 Web Document Data "Web" files use the following markup framework: {doc_id_string} ... ... ... ... ... ... Other kinds of tags may be present ("", "", etc). Some of the web source documents contain material that interferes with XML parsing (e.g. unescaped "&", or "" tags that lack a corresponding ""). 5. Using the Data 5.1 Offset calculation The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 5.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 6. Acknowledgements This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dana Fore (LDC) Dave Graff (LDC) Heather Simpson (LDC) Robert Parker (LDC) Heng Ji (RPI) Hoa Dang (NIST) Ralph Grishman (NYU) Paul McNamee (JHU) Javier Artiles (Slice Technologies) Boyan Onyshkevych (DARPA) 7. References Joe Ellis, Jeremy Getman, Justin Mott, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2013 Linguistic Resources for 2013 Knowledge Base Population Evaluations https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-workshop2013-linguistic-resources-kbp-eval.pdf TAC KBP 2013 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 18-19 Joe Ellis, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2012 Linguistic Resources for 2012 Knowledge Base Population Evaluations https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-workshop2012-linguistic-resources-kbp-eval.pdf TAC KBP 2012 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 5-6 Xuansong Li, Joe Ellis, Kira Griffit, Stephanie Strassel, Robert Parker, Jonathan Wright. 2011 Linguistic Resources for 2011 Knowledge Base Population Evaluation TAC 2011: Proceedings of the Fourth Text Analysis Conference, Gaithersburg, Maryland, November 14-15 https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tac2011-linguistic-resources-kbp.pdf Heng Ji, Ralph Grishman, Hoa Trang Dang, Kira Griffitt, Joe Ellis. 2010 Overview of the TAC 2010 Knowledge Base Population Track TAC 2010 Workshop: Proceedings of the Third Text Analysis Conference, Gaithersburg, MD, November 15-16 P. McNamee, H.T. Dang. 2009 Overview of the TAC 2009 Knowledge Base Population Track TAC 2009: Proceedings of the Second Text Analysis Conference Gaithersburg, MD, November 16-17 8. Copyright Information (c) 2016 Trustees of the University of Pennsylvania 9. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager Jeremy Getman, Lead Annotator Stephanie Strassel, PI -------------------------------------------------------------------------- README created by Dana Fore on December 15, 2015 updated by Dana Fore on February 5, 2016 updated by Dana Fore on March 18, 2016 updated by Joe Ellis on April 22, 2016