TAC KBP Entity Discovery and Linking Comprehensive Training and Evaluation Data 2014-2015 Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel 1. Overview This package contains training and evaluation data produced in support of the TAC KBP Entity Discovery and Linking evaluation track in 2014 and 2015. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of the Entity Discovery and Linking (EDL) track is to conduct end-to-end entity extraction, linking and clustering. Given a document collection, an EDL system is required to automatically extract (identify and classify) entity mentions (queries), link them to nodes in a reference Knowledge Base (KB), and cluster NIL mentions (those that do not have corresponding KB entries) into equivalence classes. More information about the TAC KBP Entity Discovery and Linking track and other TAC KBP evaluations can be found on the NIST TAC website, http://www.nist.gov/tac/. This package contains all of the evaluation and training data developed in support of TAC KBP Entity Discovery & Linking from 2014 to 2015. This includes queries, KB links, equivalence class clusters for NIL entities (those that could not be linked to an entity in the knowledge base), and entity type information for each of the queries. Also included in this data set are all necessary source documents as well as BaseKB - the second reference KB that was adopted for use by EDL in 2015. The first EDL reference KB, to which 2014 EDL data are linked, is available separately as LDC2014T16: TAC KBP Reference Knowledge Base. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2014E54: TAC 2014 KBP English Entity Discovery and Linking Training Data V2.0 LDC2014E81: TAC 2014 KBP English Entity Discovery and Linking Evaluation Queries and Knowledge Base Links V2.0 LDC2015E20: TAC KBP English Entity Discovery and Linking Comprehensive Training and Evaluation Data 2014 V2.0 LDC2015E42: TAC KBP Reference Knowledge Base II - Base KB V1.1 LDC2015E44: TAC KBP 2015 Tri-Lingual Entity Discovery and Linking Pilot Gold Standard Knowledge Base Links_V1.1 LDC2015E75: TAC KBP 2015 Tri-Lingual Entity Discovery and Linking Training Data V2.1 LDC2015E103: TAC KBP 2015 Tri-Lingual Entity Discovery and Linking Evaluation Gold Standard Entity Mentions and Knowledge Base Links LDC2016E38: TAC KBP English Entity Discovery and Linking Comprehensive Training and Evaluation Data 2014-2015 Summary of data included in this package (for more details see /docs/tac_kbp_edl_2014-2015_mention_distribution_table.tsv): +------+----------+------------------+----------+ | Year | Task | Source Documents | Mentions | +------+----------+------------------+----------+ | 2014 | eval | 138 | 5598 | | 2014 | training | 160 | 6349 | | 2015 | pilot | 15 | 686 | | 2015 | training | 444 | 30,834 | | 2015 | eval | 500 | 32,459 | +------+----------+------------------+----------+ 2. Contents ./README.txt This file. ./data/{2014,2015}/contents.txt The data in this package are organized by the evaluation year in order to clarify dependencies, highlight occasional differences in formats from one year to another, and to increase readability in documentation. The contents.txt file within each year's root directory provides a list of the contents for all subdirectories as well as details about file formats and contents. ./data/BaseKB/* This directory contains 348 nt.gz files, the combination of which make up the knowledge base to which entities discovered in the 2015 EDL data were linked. This KB contains over a billion facts about more than 40 million subjects, including people, places, creative works, and everything that had a page in Wikipedia at the time the KB was produced. The data files included in this directory were acquired as-is from :BaseKB Gold (basekb.com/gold/) and received no additional processing after being downloaded in January 2015. Note that BaseKB contains data from Freebase (developers.google.com/freebase/). ./docs/all_files.md5 Paths (relative to the root of the corpus) and md5 checksums for all files included in the package. ./docs/tac_kbp_edl_2014-2015_mention_distribution_table.tsv Tab-delimited table containing the mention distribution quantities for all years and datasets, further broken down by language, source type, KB-Link, and entity type. ./docs/guidelines/TAC_KBP_2014_EDL_Query_Development_Guidelines_V1.5.pdf The most up-to-date version of the 2014 Entity Discovery & Linking annotation guidelines. ./docs/guidelines/TAC_KBP_2015_EDL_Guidelines_V1.2.pdf The most up-to-date version of the 2015 Entity Discovery & Linking annotation guidelines, encompassing Tri-lingual Entity Discovery & Linking. ./docs/task_descriptions/KBP2014EL_taskspec_v1.1.pdf Task Description for the 2014 Entity Linking evaluation tracks, written by track coordinators. Note that, as the Entity Discovery and Linking task was an extension of Entity Linking, this document describes Entity Discovery & Linking as well as tasks not relevant to this specific package. ./docs/task_descriptions/KBP2015EDL_taskspec_v1.0.pdf Task Description for the 2015 Entity Discovery and Linking evaluation track, written by track coordinators. ./dtd/edl_queries_2014.dtd The dtd against which the files ./data/2014/eval/tac_kbp_2014_english_EDL_evaluation_queries.xml and ./data/2014/training/tac_kbp_2014_english_EDL_training_queries.xml validate. ./dtd/ltf.v1.5.dtd The dtd against which all ltf.xml files in ./data/2015/{eval|training}/source_docs/{cmn|eng|spa}/ validate. ./tools/check_kbp_EDL2014.pl Validator for 2014 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/check_kbp_EDL2015.pl Validator for 2015 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/scorer/* Scorer for 2014 and 2015 entity linking submission files, as provided to LDC by evaluation track coordinators, with no further testing. 3. Annotation Tasks Once a set of source documents is identified, two separate annotation tasks are conducted to produce gold standard data for Entity Discovery & Linking (EDL). In the first, annotators exhaustively extract and cluster entity mentions from a single document and then attempt to link each entity cluster to a node in a reference knowledge base (KB). In the second task, annotators perform cross-document coreference on within-document entity clusters that could not be linked to the KB. 3.1 Entity Discovery & Linking There are two distinct phases to the first annotation task used to develop gold standard EDL data - Entity Discovery (ED) and Entity Linking (EL). ED involves exhaustive extraction and clustering of valid entity mentions from a single source document. In this part of the task, annotators review a single document and select text extents to indicate valid entity mentions (in more recent versions of the task, labels were also applied to indicate mention type). Every time the first mention of a new entity is selected, annotators also create a new entity cluster, a "bucket" into which all subsequent mentions of the entity are collected and to which an entity type label is applied. Thus, within-document coreference of entities is performed concurrently with mention selection. Having selected, clustered, and labeled all valid entity mentions, annotators proceed to EL. In this part of the task, annotators use a specialized search engine to review the contents of the reference KB, looking for entries with primary subjects matching the referents of the entity clusters they just created. When a matching node is found, its ID is associated with all mentions in the entity cluster, thus establishing a link. If no matching entry in the KB is found, but annotators feel the source document includes sufficient information to identity the entity referent, the entity cluster is marked as NIL. If annotators do not feel that the source document includes sufficient information to identity an entity referent (and, thereby, link it to a node in the KB) the entity cluster is marked as NIL-Unknown. Annotators are allowed to use online searching to assist in determining the KB link/NIL status. For the more recent, cross-lingual version of EDL annotation, additional information is added to NIL entity mention clusters from non-English source documents (Chinese and Spanish for EDL 2015). This includes an English translation of the name and, if available, English summaries of information about the entity from the source document and/or links to online English- language references about the entity. This information is used by English annotators in the downstream task of creating cross-document, cross-lingual NIL entity clusters from the within-document, monolingual clusters. As a result, EDL annotation on non-English source documents is performed by native speakers of the target language who are also fluent English-speakers. Note that English language fluency is also necessary for effectively searching the KB, as most of its entries are only in English. Following a first pass of EDL annotation, senior annotators conduct quality control on annotated entity mentions to correct errors and identify areas of difficulty for improving guidelines and annotator training. Annotators performing quality control check the text extents of selected mentions, the coreferenced clusters of entity mentions, the mention and entity type labels that were applied, and KB links. Some NIL entities are also checked through new searches in the KB. For 2014, when EDL was a monolingual task in English, only named persons, organizations, and geo-political entities were considered valid. In 2015, the task became cross-lingual and so Chinese and Spanish source documents were added as well as two new entity types - facilities and locations. Also for 2015, extraction of entity mentions from English documents was expanded to include nominal mentions. Note that quote regions in discussion forum threads, which are indicated as distinct elements in the XML, were ignored for the purposes of annotation in order to ensure that manual effort was not spent on producing redundant annotations. 3.2 Cross-document NIL-entity Coreference Following completion of the annotation and quality control processes described above over all source documents, senior annotators conduct cross-document coreference for all of the within-document entity clusters marked as NIL. For this task, clusters are split up by entity type and then annotators use sorting and searching techniques to identify clusters that might require further collapsing. For example, clusters that includes mentions with strings or substrings that match those in other clusters are reviewed. As mentioned earlier, cross-document NIL-entity coreference is conducted by senior-level English annotators who, for cross-lingual versions of the task, use provided English translations and online references to collapse English, Chinese, and Spanish entity mention clusters. In the vast majority of cases, the additional information provided by non-English annotators is sufficient for collapsing cross- lingual clusters. However, Spanish and Chinese annotators are available to further disambuguate non-English entity cluster referents if needed. 4. Source Documents All the text data in the source files have been taken directly from previous LDC corpus releases, and are being provided here essentially "as-is", with little or no additional quality control. An overall scan of character content in the source collections indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors (some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); the web collection also has characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. All documents that have filenames beginning with "cmn-NG" and "eng-NG" are Web Document data (WB) and some of these fail XML parsing (see below for details). All files that start with "bolt-" are Discussion Forum threads (DF) and have the XML structure described below. All other files are Newswire data (NW) and have the newswire markup pattern detailed below. Note as well that some source documents are duplicated across a few of the separated source_documents directories, indicating that some queries from different data sets originated from the same source documents. As it is acceptable for sources to be reused for Entity Linking queries, this duplication is intentional and expected. The subsections below go into more detail regarding the markup and other properties of the three source data types: 4.1 Newswire Data Newswire data use the following markup framework: ... ...

...

...
where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "

...

" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. 4.2 Discussion Forum Data Discussion forum files use the following markup framework: ... ... ... ... ... where there may be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "..." anchor tags). As mentioned in section 2 above, each unit contains at least five post elements. All the discussion forum files are parseable as XML. 4.3 Web Document Data "Web" files use the following markup framework: {doc_id_string} ... ... ... ... ... ... Other kinds of tags may be present ("", "", etc). Some of the web source documents contain material that interferes with XML parsing (e.g. unescaped "&", or "" tags that lack a corresponding ""). 5. Using the Data 5.1 Offset calculation The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial opening angle bracket of the element ( in DF sources), which is usually the initial character (character 0) of the source. Note as well that character counting includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. Note that although strings included in the annotation files (queries and gold standard mentions) generally match source documents, a few characters are normalized in order to enhance readability: Conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed; and conversion of multiple spaces to a single space. 5.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 6. Acknowledgments This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dana Fore (LDC) Dave Graff (LDC) Neil Kuster (LDC) Heng Ji (RPI) Hoa Dang (NIST) Boyan Onyshkevych (DARPA) 7. References Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann Bies, Stephanie Strassel. 2015 Overview of Linguistic Resources for the TAC KBP 2015 Evaluations: Methodologies and Results https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp2015_overview.pdf TAC KBP 2015 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 16-17 Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014 Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: Planning, Execution, and Results https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-2014-overview.pdf TAC KBP 2014 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 17-18 7. Copyright Information Portions © 2008-2010 Agence France Presse, © 2009-2010 The Associated Press, © 2009 Los Angeles Times - Washington Post News Service, Inc., © 2009-2010 New York Times, © 2010 The Washington Post Service with Bloomberg News, © 2008-2010, 2014-2015 Xinhua News Agency, © 2008, 2009, 2010, 2014, 2015, 2019 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager Jeremy Getman, Lead Annotator Stephanie Strassel, PI -------------------------------------------------------------------------- README created by Neil Kuster on February 3, 2016 updated by Jeremy Getman on February 11, 2016 updated by Joe Ellis on April 22, 2016 updated by Jeremy Getman on September 14, 2016 updated by Joe Ellis on December 6, 2016 updated by Joe Ellis on December 8, 2016