TAC KBP Entity Discovery and Linking
            Comprehensive Training and Evaluation Data 2014-2015

            Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel

1. Overview

This package contains training and evaluation data produced in support of
the TAC KBP Entity Discovery and Linking evaluation track in 2014 and 2015.

Text Analysis Conference (TAC) is a series of workshops organized by the
National Institute of Standards and Technology (NIST). TAC was developed to
encourage research in natural language processing (NLP) and related
applications by providing a large test collection, common evaluation
procedures, and a forum for researchers to share their results. Through its
various evaluations, the Knowledge Base Population (KBP) track of TAC
encourages the development of systems that can match entities mentioned in
natural texts with those appearing in a knowledge base and extract novel
information about entities from a document collection and add it to a new
or existing knowledge base.

The goal of the Entity Discovery and Linking (EDL) track is to conduct
end-to-end entity extraction, linking and clustering. Given a document
collection, an EDL system is required to automatically extract (identify 
and classify) entity mentions (queries), link them to nodes in a
reference Knowledge Base (KB), and cluster NIL mentions (those that do 
not have corresponding KB entries) into equivalence classes.  More 
information about the TAC KBP Entity Discovery and Linking track and 
other TAC KBP evaluations can be found on the NIST TAC website,
http://www.nist.gov/tac/.

This package contains all of the evaluation and training data developed 
in support of TAC KBP Entity Discovery & Linking from 2014 to 2015. 
This includes queries, KB links, equivalence class clusters for 
NIL entities (those that could not be linked to an entity in the 
knowledge base), and entity type information for each of the queries. 
Also included in this data set are all necessary source documents as well 
as BaseKB - the second reference KB that was adopted for use by EDL in 
2015. The first EDL reference KB, to which 2014 EDL data are linked, is 
available separately as LDC2014T16: TAC KBP Reference Knowledge Base.

The data included in this package were originally released by LDC
to TAC KBP coordinators and performers under the following ecorpora
catalog IDs and titles:

LDC2014E54: TAC 2014 KBP English Entity Discovery 
            and Linking Training Data V2.0
LDC2014E81: TAC 2014 KBP English Entity Discovery and Linking Evaluation 
            Queries and Knowledge Base Links V2.0
LDC2015E20: TAC KBP English Entity Discovery and Linking Comprehensive 
            Training and Evaluation Data 2014 V2.0
LDC2015E42: TAC KBP Reference Knowledge Base II - Base KB V1.1
LDC2015E44: TAC KBP 2015 Tri-Lingual Entity Discovery and Linking Pilot 
            Gold Standard Knowledge Base Links_V1.1
LDC2015E75: TAC KBP 2015 Tri-Lingual Entity Discovery 
            and Linking Training Data V2.1
LDC2015E103: TAC KBP 2015 Tri-Lingual Entity Discovery and Linking Evaluation 
             Gold Standard Entity Mentions and Knowledge Base Links
LDC2016E38: TAC KBP English Entity Discovery and Linking Comprehensive 
            Training and Evaluation Data 2014-2015

Summary of data included in this package (for more details see 
/docs/tac_kbp_edl_2014-2015_mention_distribution_table.tsv):
+------+----------+------------------+----------+
| Year | Task     | Source Documents | Mentions |
+------+----------+------------------+----------+
| 2014 | eval     |              138 |     5598 |
| 2014 | training |              160 |     6349 |
| 2015 | pilot    |               15 |      686 |
| 2015 | training |              444 |   30,834 |
| 2015 | eval     |              500 |   32,459 |
+------+----------+------------------+----------+


2. Contents

./README.txt
 
  This file.

./data/{2014,2015}/contents.txt

  The data in this package are organized by the evaluation year in order to
  clarify dependencies, highlight occasional differences in formats from one
  year to another, and to increase readability in documentation. The
  contents.txt file within each year's root directory provides a list of the
  contents for all subdirectories as well as details about file formats and
  contents.

./data/BaseKB/*

  This directory contains 348 nt.gz files, the combination of which make up 
  the knowledge base to which entities discovered in the 2015 EDL data were 
  linked. This KB contains over a billion facts about more than 40 million 
  subjects, including people, places, creative works, and everything that had 
  a page in Wikipedia at the time the KB was produced. The data files included 
  in this directory were acquired as-is from :BaseKB Gold (basekb.com/gold/) 
  and received no additional processing after being downloaded in January 2015. 
  Note that BaseKB contains data from Freebase (developers.google.com/freebase/).

./docs/all_files.md5

  Paths (relative to the root of the corpus) and md5 checksums for all files
  included in the package.

./docs/tac_kbp_edl_2014-2015_mention_distribution_table.tsv

  Tab-delimited table containing the mention distribution quantities for
  all years and datasets, further broken down by language, source type,
  KB-Link, and entity type.
  
./docs/guidelines/TAC_KBP_2014_EDL_Query_Development_Guidelines_V1.5.pdf

  The most up-to-date version of the 2014 Entity Discovery & Linking
  annotation guidelines.

./docs/guidelines/TAC_KBP_2015_EDL_Guidelines_V1.2.pdf

  The most up-to-date version of the 2015 Entity Discovery & Linking 
  annotation guidelines, encompassing Tri-lingual Entity Discovery & Linking.
  
./docs/task_descriptions/KBP2014EL_taskspec_v1.1.pdf

  Task Description for the 2014 Entity Linking evaluation tracks, written 
  by track coordinators. Note that, as the Entity Discovery and Linking 
  task was an extension of Entity Linking, this document describes Entity 
  Discovery & Linking as well as tasks not relevant to this specific package.

./docs/task_descriptions/KBP2015EDL_taskspec_v1.0.pdf

  Task Description for the 2015 Entity Discovery and Linking evaluation 
  track, written by track coordinators.

./dtd/edl_queries_2014.dtd

  The dtd against which the files
  ./data/2014/eval/tac_kbp_2014_english_EDL_evaluation_queries.xml 
  and
  ./data/2014/training/tac_kbp_2014_english_EDL_training_queries.xml
  validate.

./dtd/ltf.v1.5.dtd

  The dtd against which all ltf.xml files in 
  ./data/2015/{eval|training}/source_docs/{cmn|eng|spa}/
  validate.
  
./tools/check_kbp_EDL2014.pl

  Validator for 2014 entity linking submission files, as provided to
  LDC by evaluation track coordinators, with no further testing.

./tools/check_kbp_EDL2015.pl

  Validator for 2015 entity linking submission files, as provided to
  LDC by evaluation track coordinators, with no further testing.

./tools/scorer/*

  Scorer for 2014 and 2015 entity linking submission files, as provided to
  LDC by evaluation track coordinators, with no further testing.


3. Annotation Tasks

Once a set of source documents is identified, two separate annotation 
tasks are conducted to produce gold standard data for Entity Discovery 
& Linking (EDL). In the first, annotators exhaustively extract and cluster 
entity mentions from a single document and then attempt to link each 
entity cluster to a node in a reference knowledge base (KB). In the second 
task, annotators perform cross-document coreference on within-document 
entity clusters that could not be linked to the KB. 

3.1 Entity Discovery & Linking

There are two distinct phases to the first annotation task used to 
develop gold standard EDL data - Entity Discovery (ED) and Entity 
Linking (EL). ED involves exhaustive extraction and clustering of valid 
entity mentions from a single source document. In this part of the task, 
annotators review a single document and select text extents to indicate 
valid entity mentions (in more recent versions of the task, labels were 
also applied to indicate mention type). Every time the first mention of 
a new entity is selected, annotators also create a new entity cluster, a 
"bucket" into which all subsequent mentions of the entity are collected 
and to which an entity type label is applied. Thus, within-document 
coreference of entities is performed concurrently with mention selection.

Having selected, clustered, and labeled all valid entity mentions, 
annotators proceed to EL. In this part of the task, annotators
use a specialized search engine to review the contents of the reference KB, 
looking for entries with primary subjects matching the referents of the 
entity clusters they just created. When a matching node is found, its ID is 
associated with all mentions in the entity cluster, thus establishing a 
link. If no matching entry in the KB is found, but annotators feel the source 
document includes sufficient information to identity the entity referent, the 
entity cluster is marked as NIL. If annotators do not feel that the source 
document includes sufficient information to identity an entity referent (and, 
thereby, link it to a node in the KB) the entity cluster is marked as 
NIL-Unknown. Annotators are allowed to use online searching to assist in 
determining the KB link/NIL status.

For the more recent, cross-lingual version of EDL annotation, additional 
information is added to NIL entity mention clusters from non-English source 
documents (Chinese and Spanish for EDL 2015). This includes an English 
translation of the name and, if available, English summaries of information 
about the entity from the source document and/or links to online English-
language references about the entity. This information is used by English 
annotators in the downstream task of creating cross-document, cross-lingual 
NIL entity clusters from the within-document, monolingual clusters. As a 
result, EDL annotation on non-English source documents is performed by 
native speakers of the target language who are also fluent English-speakers. 
Note that English language fluency is also necessary for effectively 
searching the KB, as most of its entries are only in English.

Following a first pass of EDL annotation, senior annotators conduct quality 
control on annotated entity mentions to correct errors and identify areas 
of difficulty for improving guidelines and annotator training. Annotators 
performing quality control check the text extents of selected mentions, 
the coreferenced clusters of entity mentions, the mention and entity type 
labels that were applied, and KB links. Some NIL entities are also checked 
through new searches in the KB.     

For 2014, when EDL was a monolingual task in English, only named persons, 
organizations, and geo-political entities were considered valid. In 2015, the task 
became cross-lingual and so Chinese and Spanish source documents were added as 
well as two new entity types - facilities and locations. Also for 2015, extraction 
of entity mentions from English documents was expanded to include nominal mentions.
Note that quote regions in discussion forum threads, which are indicated as distinct 
elements in the XML, were ignored for the purposes of annotation in order to ensure 
that manual effort was not spent on producing redundant annotations. 

3.2 Cross-document NIL-entity Coreference

Following completion of the annotation and quality control processes described above 
over all source documents, senior annotators conduct cross-document coreference 
for all of the within-document entity clusters marked as NIL. For this task, clusters 
are split up by entity type and then annotators use sorting and searching techniques 
to identify clusters that might require further collapsing. For example, clusters that 
includes mentions with strings or substrings that match those in other clusters are 
reviewed. As mentioned earlier, cross-document NIL-entity coreference is conducted by 
senior-level English annotators who, for cross-lingual versions of the task, use 
provided English translations and online references to collapse English, Chinese, and 
Spanish entity mention clusters. In the vast majority of cases, the additional 
information provided by non-English annotators is sufficient for collapsing cross-
lingual clusters. However, Spanish and Chinese annotators are available to further 
disambuguate non-English entity cluster referents if needed.


4. Source Documents 

All the text data in the source files have been taken directly from
previous LDC corpus releases, and are being provided here essentially
"as-is", with little or no additional quality control.  An overall scan
of character content in the source collections indicates some relatively
small quantities of various problems, especially in the web and
discussion forum data, including language mismatch (characters from
Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors
(some documents have apparently undergone "double encoding" into UTF-8,
and others may have been "noisy" to begin with, or may have gone through
an improper encoding conversion, yielding  occurrences of the
Unicode "replacement character" (U+FFFD) throughout the corpus); the web
collection also has characters whose Unicode code points lie
outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF.

All documents that have filenames beginning with "cmn-NG" and "eng-NG"
are Web Document data (WB) and some of these fail XML parsing (see below
for details). All files that start with "bolt-" are Discussion Forum
threads (DF) and have the XML structure described below. All other files are 
Newswire data (NW) and have the newswire markup pattern detailed below. 

Note as well that some source documents are duplicated across a few of
the separated source_documents directories, indicating that some queries
from different data sets originated from the same source documents. As
it is acceptable for sources to be reused for Entity Linking queries, this
duplication is intentional and expected. 

The subsections below go into more detail regarding the markup and
other properties of the three source data types:

4.1  Newswire Data

Newswire data use the following markup framework:

  <DOC id="{doc_id_string}" type="{doc_type_label}">
  <HEADLINE>
  ...
  </HEADLINE>
  <DATELINE>
  ...
  </DATELINE>
  <TEXT>
  <P>
  ...
  </P>
  ...
  </TEXT>
  </DOC>

where the HEADLINE and DATELINE tags are optional (not always
present), and the TEXT content may or may not include "<P> ... </P>"
tags (depending on whether or not the "doc_type_label" is "story").

All the newswire files are parseable as XML.

4.2  Discussion Forum Data

Discussion forum files use the following markup framework:

  <doc id="{doc_id_string}">
  <headline>
  ...
  </headline>
  <post ...>
  ...
  <quote ...>
  ...
  </quote>
  ...
  </post>
  ...
  </doc>

where there may be arbitrarily deep nesting of quote elements, and
other elements may be present (e.g. "<a...>...</a>" anchor tags).  As
mentioned in section 2 above, each <doc> unit contains at least five
post elements.

All the discussion forum files are parseable as XML.

4.3  Web Document Data

"Web" files use the following markup framework:

  <DOC>
  <DOCID> {doc_id_string} </DOCID>
  <DOCTYPE> ... </DOCTYPE>
  <DATETIME> ... </DATETIME>
  <BODY>
  <HEADLINE>
  ...
  </HEADLINE>
  <TEXT>
  <POST>
  <POSTER> ... </POSTER>
  <POSTDATE> ... </POSTDATE>
  ...
  </POST>
  </TEXT>
  </BODY>
  </DOC>

Other kinds of tags may be present ("<QUOTE ...>", "<A ...>", etc).

Some of the web source documents contain material that interferes 
with XML parsing (e.g. unescaped "&", or "<QUOTE>" tags that lack
a corresponding "</QUOTE>").


5. Using the Data

5.1 Offset calculation

The values of the beg and end XML elements in the later queries.xml files
indicate character offsets to identify text extents in the source.  Offset
counting starts from the initial opening angle bracket of the <DOC> element
(<doc> in DF sources), which is usually the initial character (character 0)
of the source. Note as well that character counting includes newlines and
all markup characters - that is, the offsets are based on treating the
source document file as "raw text", with all its markup included.

Note that although strings included in the annotation files (queries and 
gold standard mentions) generally match source documents, a few characters 
are normalized in order to enhance readability: Conversion of newlines to 
spaces, except where preceding characters were hyphens ("-"), in which case 
newlines were removed; and conversion of multiple spaces to a single space.

5.2 Proper ingesting of XML queries

While the character offsets are calculated based on treating the source
document as "raw text", the "name" strings being referenced by the queries
sometimes contain XML metacharacters, and these had to be "re-escaped" for
proper inclusion in the queries.xml file.  For example, an actual name like
"AT&T" may show up a source document file as "AT&amp;T" (because the source
document was originally formatted as XML data).  But since the source doc is
being treated here as raw text, this name string is treated in queries.xml as
having 7 characters (i.e., the character offsets, when provided, will point to
a string of length 7).

However, the "name" element itself, as presented in the queries.xml file, will
be even longer - "AT&amp;amp;T" - because the queries.xml file is intended to
be handled by an XML parser, which will return "AT&amp;T" when this "name"
element is extracted.  Using the queries.xml data without XML parsing would
yield a mismatch between the "name" value and the corresponding string in the
source data. 


6. Acknowledgments

This material is based on research sponsored by Air Force Research
Laboratory and Defense Advance Research Projects Agency under
agreement number FA8750-13-2-0045. The U.S. Government is authoized
to reproduce and distribute reprints for Governmental purposes
notwithstanding any copyright notation thereon. The views and
conclusions contained herein are those of the authors and should
not be interpreted as necessarily representing the official policies
or endorsements, either expressed or implied, of Air Force Research
Laboratory and Defense Advanced Research Projects Agency or the U.S.
Government.

The authors acknowledge the following contributors to this data set:

Dana Fore (LDC)
Dave Graff (LDC)
Neil Kuster (LDC)
Heng Ji (RPI)
Hoa Dang (NIST)
Boyan Onyshkevych (DARPA)


7. References

Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann 
Bies, Stephanie Strassel. 2015
Overview of Linguistic Resources for the TAC KBP 2015 Evaluations: 
Methodologies and Results
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp2015_overview.pdf
TAC KBP 2015 Workshop: National Institute of Standards and Technology, 
Gaithersburg, Maryland, November 16-17

Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014
Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: 
Planning, Execution, and Results
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-2014-overview.pdf
TAC KBP 2014 Workshop: National Institute of Standards and Technology, 
Gaithersburg, Maryland, November 17-18


7. Copyright Information

Portions © 2008-2010 Agence France Presse, © 2009-2010 The Associated Press, 
© 2009 Los Angeles Times - Washington Post News Service, Inc., 
© 2009-2010 New York Times, © 2010 The Washington Post Service with Bloomberg News, 
© 2008-2010, 2014-2015 Xinhua News Agency, 
© 2008, 2009, 2010, 2014, 2015, 2019 Trustees of the University of Pennsylvania


8. Contact Information

For further information about this data release, contact the following
project staff at LDC:

  Joseph Ellis, Project Manager      <joellis@ldc.upenn.edu>
  Jeremy Getman, Lead Annotator      <jgetman@ldc.upenn.edu>
  Stephanie Strassel, PI             <strassel@ldc.upenn.edu>

--------------------------------------------------------------------------
README created by Neil Kuster on February 3, 2016
       updated by Jeremy Getman on February 11, 2016
       updated by Joe Ellis on April 22, 2016
       updated by Jeremy Getman on September 14, 2016
       updated by Joe Ellis on December 6, 2016
       updated by Joe Ellis on December 8, 2016