TAC KBP Entity Discovery and Linking Comprehensive Evaluation Data 2016-2017 Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel 1. Overview This package contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking evaluation track in 2016 and 2017. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of the Entity Discovery and Linking (EDL) track is to conduct end-to-end entity extraction, linking and clustering. Given a document collection, an EDL system is required to automatically extract (identify and classify) entity mentions (queries), link them to nodes in a reference Knowledge Base (KB), and cluster NIL mentions (those that do not have corresponding KB entries) into equivalence classes. More information about the TAC KBP Entity Discovery and Linking track and other TAC KBP evaluations can be found on the NIST TAC website, http://www.nist.gov/tac/. This package contains all of the evaluation and training data developed in support of TAC KBP Entity Discovery & Linking from 2016 to 2017. This includes queries, KB links, equivalence class clusters for NIL entities (those that could not be linked to an entity in the knowledge base), and entity type information for each of the queries. The EDL reference KB, to which EDL data are linked, is available separately in LDC2016TX5 TAC KBP Entity Discovery and Linking Comprehensive Training and Evaluation Data 2014-2015. Source documents referenced by the files in this package are available separately in LDC2018TXX TAC KBP Evaluation Source Corpora 2016-2017. Note: The 2016/2017 EDL scorer can be obtained at http://nlp.cs.rpi.edu/kbp/2017/scoring.html The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2016E68: TAC KBP 2016 Entity Discovery and Linking Evaluation Gold Standard Entity Mentions and Knowledge Base Links LDC2017E52: TAC KBP 2017 Entity Discovery and Linking Evaluation Gold Standard Entity Mentions and Knowledge Base Links Summary of data included in this package: +------+----------+----------+ | Year | Task | Mentions | +------+----------+----------+ | 2016 | eval | 24373 | | 2017 | eval | 25040 | +------+----------+----------+ 2. Contents ./data/2016/tac_kbp_2016_edl_evaluation_gold_standard_entity_mentions.tab This file contains 25,040 gold standard responses. Each response consists of the following fields: Column 1: system run ID (always "LDC" in these data) Column 2: mention (query) ID: unique for each entity name mention and in the format of 'EDL16_EVAL_XXXXX' Column 3: mention string: the full string of the query entity mention. Column 4: (document ID):(mention head start offset)–(mention head end offset): an ID for a document in the source corpus from which the mention head was extracted, the starting offset of the mention head, and the ending offset of the mention head (e.g. AFP_ENG_20080610.0052:244-252). These are character offsets (not byte offsets); first character in a documents is at offset 0. Column 5: Knowledge Base (KB) link ID or NIL cluster ID: If the ID begins with "m", the text refers to an entity in the KB (e.g. 'm.09b6zr'). If the given query is not linked to an entity in the KB, then it is given a NIL ID, which consists of "NIL" plus a five-digit, zero-padded integer (e.g. 'NIL00001', 'NIL00002'). Entity mentions pointing to equivalent referents are indicated by shared KB link IDs or NIL cluster IDs; otherwise, all the IDs are distinct from one another. Column 6: entity type: {GPE, ORG, PER, LOC, FAC} type indicator for the entity Column 7: mention type: {NAM, NOM} type indicator for the entity mention Column 8: a confidence value: Always "1.0" in LDC responses. Column 9: web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. Column 10: wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. Column 11: Unknown - (Y/N) indicating whether the source document contained insufficient information about the query entity in order to determine whether or not it existed in the KB. Note that for entity mentions with a KB ID, this value will always be 'N'. NIL entities can have either 'Y' or 'N' in this column. ./data/2017/tac_kbp_2017_edl_evaluation_gold_standard_entity_mentions.tab This file contains 24,373 gold standard responses. Each response consists of the following fields: Column 1: system run ID (always "LDC" in these data) Column 2: mention (query) ID: unique for each entity name mention and in the format of 'EDL17_EVAL_XXXXX' Column 3: mention string: the full string of the query entity mention. Column 4: (document ID):(mention head start offset)–(mention head end offset): an ID for a document in the source corpus from which the mention head was extracted, the starting offset of the mention head, and the ending offset of the mention head (e.g. AFP_ENG_20080610.0052:244-252). These are character offsets (not byte offsets); first character in a documents is at offset 0. Column 5: Knowledge Base (KB) link ID or NIL cluster ID: If the ID begins with "m", the text refers to an entity in the KB (e.g. 'm.09b6zr'). If the given query is not linked to an entity in the KB, then it is given a NIL ID, which consists of "NIL" plus a five-digit, zero-padded integer (e.g. 'NIL00001', 'NIL00002'). Entity mentions pointing to equivalent referents are indicated by shared KB link IDs or NIL cluster IDs; otherwise, all the IDs are distinct from one another. Column 6: entity type: {GPE, ORG, PER, LOC, FAC} type indicator for the entity Column 7: mention type: {NAM, NOM} type indicator for the entity mention Column 8: a confidence value: Always "1.0" in LDC responses. Column 9: web-search - (Y/N) indicating whether the annotator made use of web searches in order to make the linking judgment. Column 10: wiki text - (Y/N) indicating whether the annotator made use of the wiki text in the knowledge base (as opposed to just the infobox information) in order to make the linking judgment. Column 11: Unknown - (Y/N) indicating whether the source document contained insufficient information about the query entity in order to determine whether or not it existed in the KB. Note that for entity mentions with a KB ID, this value is always 'N'. NIL entities can have either 'Y' or 'N' in this column. ./docs/TAC_KBP_2016_EDL_Guidelines_V1.1.pdf The version of the Entity Discovery & Linking annotation guidelines used by annotators in 2016. ./docs/TAC_KBP_2017_EDL_Guidelines_V1.0.pdf The version of the Entity Discovery & Linking annotation guidelines used by annotators in 2017. .docs/TAC_KBP_2016_2017_Entity_Discovery_and_Linking_Task_Description.pdf Task Description for the 2016 and 2017 Entity Discovery and Linking evaluation tracks, written by track coordinators. Note that the same task description was used for both years of the task. 3. Annotation Tasks Once a set of source documents is identified, two separate annotation tasks are conducted to produce gold standard data for Entity Discovery & Linking (EDL). In the first, annotators exhaustively extract and cluster entity mentions from a single document and then attempt to link each entity cluster to a node in a reference knowledge base (KB). In the second task, annotators perform cross-document coreference on within-document entity clusters that could not be linked to the KB. 3.1 Entity Discovery & Linking There are two distinct phases to the first annotation task used to develop gold standard EDL data - Entity Discovery (ED) and Entity Linking (EL). ED involves exhaustive extraction and clustering of valid entity mentions from a single source document. In this part of the task, annotators review a single document and select text extents to indicate valid entity mentions (in more recent versions of the task, labels were also applied to indicate mention type). Every time the first mention of a new entity is selected, annotators also create a new entity cluster, a "bucket" into which all subsequent mentions of the entity are collected and to which an entity type label is applied. Thus, within-document coreference of entities is performed concurrently with mention selection. Having selected, clustered, and labeled all valid entity mentions, annotators proceed to EL. In this part of the task, annotators use a specialized search engine to review the contents of the reference KB, looking for entries with primary subjects matching the referents of the entity clusters they just created. When a matching node is found, its ID is associated with all mentions in the entity cluster, thus establishing a link. If no matching entry in the KB is found, but annotators feel the source document includes sufficient information to identity the entity referent, the entity cluster is marked as NIL. If annotators do not feel that the source document includes sufficient information to identity an entity referent (and, thereby, link it to a node in the KB) the entity cluster is marked as NIL-Unknown. Annotators are allowed to use online searching to assist in determining the KB link/NIL status. Additional information is added to NIL entity mention clusters from non- English source documents (Chinese and Spanish). This includes an English translation of the name and, if available, English summaries of information about the entity from the source document and/or links to online English- language references about the entity. This information is used by English annotators in the downstream task of creating cross-document, cross-lingual NIL entity clusters from the within-document, monolingual clusters. As a result, EDL annotation on non-English source documents is performed by native speakers of the target language who are also fluent English-speakers. Note that English language fluency is also necessary for effectively searching the KB, as most of its entries are only in English. Following a first pass of EDL annotation, senior annotators conduct quality control on annotated entity mentions to correct errors and identify areas of difficulty for improving guidelines and annotator training. Annotators performing quality control check the text extents of selected mentions, the coreferenced clusters of entity mentions, the mention and entity type labels that were applied, and KB links. Some NIL entities are also checked through new searches in the KB. 3.2 Cross-document NIL-entity Coreference Following completion of the annotation and quality control processes described above over all source documents, senior annotators conduct cross-document coreference for all of the within-document entity clusters marked as NIL. For this task, clusters are split up by entity type and then annotators use sorting and searching techniques to identify clusters that might require further collapsing. For example, clusters that includes mentions with strings or substrings that match those in other clusters are reviewed. As mentioned earlier, cross-document NIL-entity coreference is conducted by senior-level English annotators who, for cross-lingual versions of the task, use provided English translations and online references to collapse English, Chinese, and Spanish entity mention clusters. In the vast majority of cases, the additional information provided by non-English annotators is sufficient for collapsing cross- lingual clusters. However, Spanish and Chinese annotators are available to further disambuguate non-English entity cluster referents if needed. 4. Using the Data 4.1 Offset calculation The values of the beg and end elements in the two tabular files in ./data indicate character offsets to identify text extents in the source. Offset counting starts from the initial opening angle bracket of the element ( in DF sources), which is usually the initial character (character 0) of the source. Note as well that character counting includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. Note that although strings included in the annotation files generally match source documents, a few characters are normalized in order to enhance readability: Conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed; and conversion of multiple spaces to a single space. 5. Acknowledgments This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Heng Ji (RPI) Hoa Dang (NIST) Boyan Onyshkevych (DARPA) 6. References Joe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song, Ann Bies, & Stephanie M. Strassel. 2016 Overview of Linguistic Resources for the TAC KBP 2016 Evaluations: Methodologies and Results TAC KBP 2016 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 14-15 Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey, & Stephanie M. Strassel. 2017 Overview of Linguistic Resources for the TAC KBP 2017 Evaluations: Methodologies and Results TAC KBP 2017 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 13-14 7. Copyright Information (c) 2018 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Jeremy Getman, Project Manager Stephanie Strassel, PI -------------------------------------------------------------------------- README created by Joseph Carlough on March 26, 2018 updated by Jeremy Getman on May 10, 2018 updated by Jeremy Getman on November 18, 2019