TAC KBP Entity Discovery and Linking
Comprehensive Training and Evaluation Data 2014-2015
Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel
1. Overview
This package contains training and evaluation data produced in support of
the TAC KBP Entity Discovery and Linking evaluation track in 2014 and 2015.
Text Analysis Conference (TAC) is a series of workshops organized by the
National Institute of Standards and Technology (NIST). TAC was developed to
encourage research in natural language processing (NLP) and related
applications by providing a large test collection, common evaluation
procedures, and a forum for researchers to share their results. Through its
various evaluations, the Knowledge Base Population (KBP) track of TAC
encourages the development of systems that can match entities mentioned in
natural texts with those appearing in a knowledge base and extract novel
information about entities from a document collection and add it to a new
or existing knowledge base.
The goal of the Entity Discovery and Linking (EDL) track is to conduct
end-to-end entity extraction, linking and clustering. Given a document
collection, an EDL system is required to automatically extract (identify
and classify) entity mentions (queries), link them to nodes in a
reference Knowledge Base (KB), and cluster NIL mentions (those that do
not have corresponding KB entries) into equivalence classes. More
information about the TAC KBP Entity Discovery and Linking track and
other TAC KBP evaluations can be found on the NIST TAC website,
http://www.nist.gov/tac/.
This package contains all of the evaluation and training data developed
in support of TAC KBP Entity Discovery & Linking from 2014 to 2015.
This includes queries, KB links, equivalence class clusters for
NIL entities (those that could not be linked to an entity in the
knowledge base), and entity type information for each of the queries.
Also included in this data set are all necessary source documents as well
as BaseKB - the second reference KB that was adopted for use by EDL in
2015. The first EDL reference KB, to which 2014 EDL data are linked, is
available separately as LDC2014T16: TAC KBP Reference Knowledge Base.
The data included in this package were originally released by LDC
to TAC KBP coordinators and performers under the following ecorpora
catalog IDs and titles:
LDC2014E54: TAC 2014 KBP English Entity Discovery
and Linking Training Data V2.0
LDC2014E81: TAC 2014 KBP English Entity Discovery and Linking Evaluation
Queries and Knowledge Base Links V2.0
LDC2015E20: TAC KBP English Entity Discovery and Linking Comprehensive
Training and Evaluation Data 2014 V2.0
LDC2015E42: TAC KBP Reference Knowledge Base II - Base KB V1.1
LDC2015E44: TAC KBP 2015 Tri-Lingual Entity Discovery and Linking Pilot
Gold Standard Knowledge Base Links_V1.1
LDC2015E75: TAC KBP 2015 Tri-Lingual Entity Discovery
and Linking Training Data V2.1
LDC2015E103: TAC KBP 2015 Tri-Lingual Entity Discovery and Linking Evaluation
Gold Standard Entity Mentions and Knowledge Base Links
LDC2016E38: TAC KBP English Entity Discovery and Linking Comprehensive
Training and Evaluation Data 2014-2015
Summary of data included in this package (for more details see
/docs/tac_kbp_edl_2014-2015_mention_distribution_table.tsv):
+------+----------+------------------+----------+
| Year | Task | Source Documents | Mentions |
+------+----------+------------------+----------+
| 2014 | eval | 138 | 5598 |
| 2014 | training | 160 | 6349 |
| 2015 | pilot | 15 | 686 |
| 2015 | training | 444 | 30,834 |
| 2015 | eval | 500 | 32,459 |
+------+----------+------------------+----------+
2. Contents
./README.txt
This file.
./data/{2014,2015}/contents.txt
The data in this package are organized by the evaluation year in order to
clarify dependencies, highlight occasional differences in formats from one
year to another, and to increase readability in documentation. The
contents.txt file within each year's root directory provides a list of the
contents for all subdirectories as well as details about file formats and
contents.
./data/BaseKB/*
This directory contains 348 nt.gz files, the combination of which make up
the knowledge base to which entities discovered in the 2015 EDL data were
linked. This KB contains over a billion facts about more than 40 million
subjects, including people, places, creative works, and everything that had
a page in Wikipedia at the time the KB was produced. The data files included
in this directory were acquired as-is from :BaseKB Gold (basekb.com/gold/)
and received no additional processing after being downloaded in January 2015.
Note that BaseKB contains data from Freebase (developers.google.com/freebase/).
./docs/all_files.md5
Paths (relative to the root of the corpus) and md5 checksums for all files
included in the package.
./docs/tac_kbp_edl_2014-2015_mention_distribution_table.tsv
Tab-delimited table containing the mention distribution quantities for
all years and datasets, further broken down by language, source type,
KB-Link, and entity type.
./docs/guidelines/TAC_KBP_2014_EDL_Query_Development_Guidelines_V1.5.pdf
The most up-to-date version of the 2014 Entity Discovery & Linking
annotation guidelines.
./docs/guidelines/TAC_KBP_2015_EDL_Guidelines_V1.2.pdf
The most up-to-date version of the 2015 Entity Discovery & Linking
annotation guidelines, encompassing Tri-lingual Entity Discovery & Linking.
./docs/task_descriptions/KBP2014EL_taskspec_v1.1.pdf
Task Description for the 2014 Entity Linking evaluation tracks, written
by track coordinators. Note that, as the Entity Discovery and Linking
task was an extension of Entity Linking, this document describes Entity
Discovery & Linking as well as tasks not relevant to this specific package.
./docs/task_descriptions/KBP2015EDL_taskspec_v1.0.pdf
Task Description for the 2015 Entity Discovery and Linking evaluation
track, written by track coordinators.
./dtd/edl_queries_2014.dtd
The dtd against which the files
./data/2014/eval/tac_kbp_2014_english_EDL_evaluation_queries.xml
and
./data/2014/training/tac_kbp_2014_english_EDL_training_queries.xml
validate.
./dtd/ltf.v1.5.dtd
The dtd against which all ltf.xml files in
./data/2015/{eval|training}/source_docs/{cmn|eng|spa}/
validate.
./tools/check_kbp_EDL2014.pl
Validator for 2014 entity linking submission files, as provided to
LDC by evaluation track coordinators, with no further testing.
./tools/check_kbp_EDL2015.pl
Validator for 2015 entity linking submission files, as provided to
LDC by evaluation track coordinators, with no further testing.
./tools/scorer/*
Scorer for 2014 and 2015 entity linking submission files, as provided to
LDC by evaluation track coordinators, with no further testing.
3. Annotation Tasks
Once a set of source documents is identified, two separate annotation
tasks are conducted to produce gold standard data for Entity Discovery
& Linking (EDL). In the first, annotators exhaustively extract and cluster
entity mentions from a single document and then attempt to link each
entity cluster to a node in a reference knowledge base (KB). In the second
task, annotators perform cross-document coreference on within-document
entity clusters that could not be linked to the KB.
3.1 Entity Discovery & Linking
There are two distinct phases to the first annotation task used to
develop gold standard EDL data - Entity Discovery (ED) and Entity
Linking (EL). ED involves exhaustive extraction and clustering of valid
entity mentions from a single source document. In this part of the task,
annotators review a single document and select text extents to indicate
valid entity mentions (in more recent versions of the task, labels were
also applied to indicate mention type). Every time the first mention of
a new entity is selected, annotators also create a new entity cluster, a
"bucket" into which all subsequent mentions of the entity are collected
and to which an entity type label is applied. Thus, within-document
coreference of entities is performed concurrently with mention selection.
Having selected, clustered, and labeled all valid entity mentions,
annotators proceed to EL. In this part of the task, annotators
use a specialized search engine to review the contents of the reference KB,
looking for entries with primary subjects matching the referents of the
entity clusters they just created. When a matching node is found, its ID is
associated with all mentions in the entity cluster, thus establishing a
link. If no matching entry in the KB is found, but annotators feel the source
document includes sufficient information to identity the entity referent, the
entity cluster is marked as NIL. If annotators do not feel that the source
document includes sufficient information to identity an entity referent (and,
thereby, link it to a node in the KB) the entity cluster is marked as
NIL-Unknown. Annotators are allowed to use online searching to assist in
determining the KB link/NIL status.
For the more recent, cross-lingual version of EDL annotation, additional
information is added to NIL entity mention clusters from non-English source
documents (Chinese and Spanish for EDL 2015). This includes an English
translation of the name and, if available, English summaries of information
about the entity from the source document and/or links to online English-
language references about the entity. This information is used by English
annotators in the downstream task of creating cross-document, cross-lingual
NIL entity clusters from the within-document, monolingual clusters. As a
result, EDL annotation on non-English source documents is performed by
native speakers of the target language who are also fluent English-speakers.
Note that English language fluency is also necessary for effectively
searching the KB, as most of its entries are only in English.
Following a first pass of EDL annotation, senior annotators conduct quality
control on annotated entity mentions to correct errors and identify areas
of difficulty for improving guidelines and annotator training. Annotators
performing quality control check the text extents of selected mentions,
the coreferenced clusters of entity mentions, the mention and entity type
labels that were applied, and KB links. Some NIL entities are also checked
through new searches in the KB.
For 2014, when EDL was a monolingual task in English, only named persons,
organizations, and geo-political entities were considered valid. In 2015, the task
became cross-lingual and so Chinese and Spanish source documents were added as
well as two new entity types - facilities and locations. Also for 2015, extraction
of entity mentions from English documents was expanded to include nominal mentions.
Note that quote regions in discussion forum threads, which are indicated as distinct
elements in the XML, were ignored for the purposes of annotation in order to ensure
that manual effort was not spent on producing redundant annotations.
3.2 Cross-document NIL-entity Coreference
Following completion of the annotation and quality control processes described above
over all source documents, senior annotators conduct cross-document coreference
for all of the within-document entity clusters marked as NIL. For this task, clusters
are split up by entity type and then annotators use sorting and searching techniques
to identify clusters that might require further collapsing. For example, clusters that
includes mentions with strings or substrings that match those in other clusters are
reviewed. As mentioned earlier, cross-document NIL-entity coreference is conducted by
senior-level English annotators who, for cross-lingual versions of the task, use
provided English translations and online references to collapse English, Chinese, and
Spanish entity mention clusters. In the vast majority of cases, the additional
information provided by non-English annotators is sufficient for collapsing cross-
lingual clusters. However, Spanish and Chinese annotators are available to further
disambuguate non-English entity cluster referents if needed.
4. Source Documents
All the text data in the source files have been taken directly from
previous LDC corpus releases, and are being provided here essentially
"as-is", with little or no additional quality control. An overall scan
of character content in the source collections indicates some relatively
small quantities of various problems, especially in the web and
discussion forum data, including language mismatch (characters from
Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors
(some documents have apparently undergone "double encoding" into UTF-8,
and others may have been "noisy" to begin with, or may have gone through
an improper encoding conversion, yielding occurrences of the
Unicode "replacement character" (U+FFFD) throughout the corpus); the web
collection also has characters whose Unicode code points lie
outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF.
All documents that have filenames beginning with "cmn-NG" and "eng-NG"
are Web Document data (WB) and some of these fail XML parsing (see below
for details). All files that start with "bolt-" are Discussion Forum
threads (DF) and have the XML structure described below. All other files are
Newswire data (NW) and have the newswire markup pattern detailed below.
Note as well that some source documents are duplicated across a few of
the separated source_documents directories, indicating that some queries
from different data sets originated from the same source documents. As
it is acceptable for sources to be reused for Entity Linking queries, this
duplication is intentional and expected.
The subsections below go into more detail regarding the markup and
other properties of the three source data types:
4.1 Newswire Data
Newswire data use the following markup framework:
...
...
" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. 4.2 Discussion Forum Data Discussion forum files use the following markup framework:......
", "", etc). Some of the web source documents contain material that interferes with XML parsing (e.g. unescaped "&", or "" tags that lack a corresponding ""). 5. Using the Data 5.1 Offset calculation The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial opening angle bracket of theelement ( in DF sources), which is usually the initial character (character 0) of the source. Note as well that character counting includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. Note that although strings included in the annotation files (queries and gold standard mentions) generally match source documents, a few characters are normalized in order to enhance readability: Conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed; and conversion of multiple spaces to a single space. 5.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 6. Acknowledgments This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dana Fore (LDC) Dave Graff (LDC) Neil Kuster (LDC) Heng Ji (RPI) Hoa Dang (NIST) Boyan Onyshkevych (DARPA) 7. References Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann Bies, Stephanie Strassel. 2015 Overview of Linguistic Resources for the TAC KBP 2015 Evaluations: Methodologies and Results https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp2015_overview.pdf TAC KBP 2015 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 16-17 Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014 Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: Planning, Execution, and Results https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-2014-overview.pdf TAC KBP 2014 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 17-18 7. Copyright Information Portions © 2008-2010 Agence France Presse, © 2009-2010 The Associated Press, © 2009 Los Angeles Times - Washington Post News Service, Inc., © 2009-2010 New York Times, © 2010 The Washington Post Service with Bloomberg News, © 2008-2010, 2014-2015 Xinhua News Agency, © 2008, 2009, 2010, 2014, 2015, 2019 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager Jeremy Getman, Lead Annotator Stephanie Strassel, PI -------------------------------------------------------------------------- README created by Neil Kuster on February 3, 2016 updated by Jeremy Getman on February 11, 2016 updated by Joe Ellis on April 22, 2016 updated by Jeremy Getman on September 14, 2016 updated by Joe Ellis on December 6, 2016 updated by Joe Ellis on December 8, 2016