TAC KBP Entity Discovery and Linking
                   Comprehensive Evaluation Data 2016-2017

            Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel

1. Overview

This package contains training and evaluation data produced in support of
the TAC KBP Entity Discovery and Linking evaluation track in 2016 and 2017.

Text Analysis Conference (TAC) is a series of workshops organized by the
National Institute of Standards and Technology (NIST). TAC was developed to
encourage research in natural language processing (NLP) and related
applications by providing a large test collection, common evaluation
procedures, and a forum for researchers to share their results. Through its
various evaluations, the Knowledge Base Population (KBP) track of TAC
encourages the development of systems that can match entities mentioned in
natural texts with those appearing in a knowledge base and extract novel
information about entities from a document collection and add it to a new
or existing knowledge base.

The goal of the Entity Discovery and Linking (EDL) track is to conduct
end-to-end entity extraction, linking and clustering. Given a document
collection, an EDL system is required to automatically extract (identify 
and classify) entity mentions (queries), link them to nodes in a
reference Knowledge Base (KB), and cluster NIL mentions (those that do 
not have corresponding KB entries) into equivalence classes.  More 
information about the TAC KBP Entity Discovery and Linking track and 
other TAC KBP evaluations can be found on the NIST TAC website,
http://www.nist.gov/tac/.

This package contains all of the evaluation and training data developed 
in support of TAC KBP Entity Discovery & Linking from 2016 to 2017. 
This includes queries, KB links, equivalence class clusters for 
NIL entities (those that could not be linked to an entity in the 
knowledge base), and entity type information for each of the queries.
The EDL reference KB, to which EDL data are linked, is available
separately in LDC2016TX5 TAC KBP Entity Discovery and Linking
Comprehensive Training and Evaluation Data 2014-2015. Source documents 
referenced by the files in this package are available separately in 
LDC2018TXX TAC KBP Evaluation Source Corpora 2016-2017.

Note: The 2016/2017 EDL scorer can be obtained at
http://nlp.cs.rpi.edu/kbp/2017/scoring.html

The data included in this package were originally released by LDC
to TAC KBP coordinators and performers under the following ecorpora
catalog IDs and titles:

LDC2016E68: TAC KBP 2016 Entity Discovery and Linking Evaluation 
            Gold Standard Entity Mentions and Knowledge Base Links
LDC2017E52: TAC KBP 2017 Entity Discovery and Linking Evaluation 
            Gold Standard Entity Mentions and Knowledge Base Links


Summary of data included in this package:
+------+----------+----------+
| Year | Task     | Mentions |
+------+----------+----------+
| 2016 | eval     |    24373 |
| 2017 | eval     |    25040 | 
+------+----------+----------+


2. Contents

./data/2016/tac_kbp_2016_edl_evaluation_gold_standard_entity_mentions.tab

  This file contains 25,040 gold standard responses. Each response 
  consists of the following fields:

  Column 1:  system run ID (always "LDC" in these data)

  Column 2:  mention (query) ID: unique for each entity name mention and
             in the format of 'EDL16_EVAL_XXXXX'

  Column 3:  mention string: the full string of the query entity mention.

  Column 4:  (document ID):(mention head start offset)–(mention head end
             offset): an ID for a document in the source corpus from which the
             mention head was extracted, the starting offset of the mention
             head, and the ending offset of the mention head (e.g.
             AFP_ENG_20080610.0052:244-252).  These are character offsets (not
             byte offsets); first character in a documents is at offset 0.

  Column 5:  Knowledge Base (KB) link ID or NIL cluster ID: If the ID
             begins with "m", the text refers to an entity in the KB (e.g.
             'm.09b6zr').  If the given query is not linked to an entity in
             the KB, then it is given a NIL ID, which consists of "NIL"
             plus a five-digit, zero-padded integer (e.g. 'NIL00001',
             'NIL00002').  Entity mentions pointing to equivalent referents
             are indicated by shared KB link IDs or NIL cluster IDs;
             otherwise, all the IDs are distinct from one another.

  Column 6:  entity type: {GPE, ORG, PER, LOC, FAC} type indicator
             for the entity

  Column 7:  mention type: {NAM, NOM} type indicator for the entity mention

  Column 8:  a confidence value: Always "1.0" in LDC responses.

  Column 9:  web-search - (Y/N) indicating whether the annotator made use
             of web searches in order to make the linking judgment.

  Column 10: wiki text - (Y/N) indicating whether the annotator made use
             of the wiki text in the knowledge base (as opposed to just
             the infobox information) in order to make the linking judgment.

  Column 11: Unknown - (Y/N) indicating whether the source document
             contained insufficient information about the query entity in
             order to determine whether or not it existed in the KB.  Note
             that for entity mentions with a KB ID, this value will always be 
             'N'. NIL entities can have either 'Y' or 'N' in this column.

./data/2017/tac_kbp_2017_edl_evaluation_gold_standard_entity_mentions.tab

  This file contains 24,373 gold standard responses. Each response 
  consists of the following fields:

  Column 1:  system run ID (always "LDC" in these data)

  Column 2:  mention (query) ID: unique for each entity name mention and
             in the format of 'EDL17_EVAL_XXXXX'

  Column 3:  mention string: the full string of the query entity mention.

  Column 4:  (document ID):(mention head start offset)–(mention head end
             offset): an ID for a document in the source corpus from which 
             the mention head was extracted, the starting offset of the 
             mention head, and the ending offset of the mention head (e.g.
             AFP_ENG_20080610.0052:244-252).  These are character offsets 
             (not byte offsets); first character in a documents is at 
             offset 0.

  Column 5:  Knowledge Base (KB) link ID or NIL cluster ID: If the ID
             begins with "m", the text refers to an entity in the KB (e.g.
             'm.09b6zr').  If the given query is not linked to an entity 
             in the KB, then it is given a NIL ID, which consists of "NIL"
             plus a five-digit, zero-padded integer (e.g. 'NIL00001',
             'NIL00002').  Entity mentions pointing to equivalent referents
             are indicated by shared KB link IDs or NIL cluster IDs;
             otherwise, all the IDs are distinct from one another.

  Column 6:  entity type: {GPE, ORG, PER, LOC, FAC} type indicator
             for the entity

  Column 7:  mention type: {NAM, NOM} type indicator for the entity mention

  Column 8:  a confidence value: Always "1.0" in LDC responses.

  Column 9:  web-search - (Y/N) indicating whether the annotator made use
             of web searches in order to make the linking judgment.

  Column 10: wiki text - (Y/N) indicating whether the annotator made use
             of the wiki text in the knowledge base (as opposed to just
             the infobox information) in order to make the linking judgment.

  Column 11: Unknown - (Y/N) indicating whether the source document
             contained insufficient information about the query entity in
             order to determine whether or not it existed in the KB.  Note
             that for entity mentions with a KB ID, this value is always  
             'N'. NIL entities can have either 'Y' or 'N' in this column.
 
./docs/TAC_KBP_2016_EDL_Guidelines_V1.1.pdf

  The version of the Entity Discovery & Linking annotation guidelines 
  used by annotators in 2016.
 
./docs/TAC_KBP_2017_EDL_Guidelines_V1.0.pdf

  The version of the Entity Discovery & Linking annotation guidelines 
  used by annotators in 2017.

.docs/TAC_KBP_2016_2017_Entity_Discovery_and_Linking_Task_Description.pdf

  Task Description for the 2016 and 2017 Entity Discovery and Linking 
  evaluation tracks, written by track coordinators. Note that the same
  task description was used for both years of the task.


3. Annotation Tasks

Once a set of source documents is identified, two separate annotation 
tasks are conducted to produce gold standard data for Entity Discovery 
& Linking (EDL). In the first, annotators exhaustively extract and cluster 
entity mentions from a single document and then attempt to link each 
entity cluster to a node in a reference knowledge base (KB). In the second 
task, annotators perform cross-document coreference on within-document 
entity clusters that could not be linked to the KB. 

3.1 Entity Discovery & Linking

There are two distinct phases to the first annotation task used to 
develop gold standard EDL data - Entity Discovery (ED) and Entity 
Linking (EL). ED involves exhaustive extraction and clustering of valid 
entity mentions from a single source document. In this part of the task, 
annotators review a single document and select text extents to indicate 
valid entity mentions (in more recent versions of the task, labels were 
also applied to indicate mention type). Every time the first mention of 
a new entity is selected, annotators also create a new entity cluster, a 
"bucket" into which all subsequent mentions of the entity are collected 
and to which an entity type label is applied. Thus, within-document 
coreference of entities is performed concurrently with mention selection.

Having selected, clustered, and labeled all valid entity mentions, 
annotators proceed to EL. In this part of the task, annotators
use a specialized search engine to review the contents of the reference KB, 
looking for entries with primary subjects matching the referents of the 
entity clusters they just created. When a matching node is found, its ID is 
associated with all mentions in the entity cluster, thus establishing a 
link. If no matching entry in the KB is found, but annotators feel the source 
document includes sufficient information to identity the entity referent, the 
entity cluster is marked as NIL. If annotators do not feel that the source 
document includes sufficient information to identity an entity referent (and, 
thereby, link it to a node in the KB) the entity cluster is marked as 
NIL-Unknown. Annotators are allowed to use online searching to assist in 
determining the KB link/NIL status.

Additional information is added to NIL entity mention clusters from non-
English source documents (Chinese and Spanish). This includes an English 
translation of the name and, if available, English summaries of information 
about the entity from the source document and/or links to online English-
language references about the entity. This information is used by English 
annotators in the downstream task of creating cross-document, cross-lingual 
NIL entity clusters from the within-document, monolingual clusters. As a 
result, EDL annotation on non-English source documents is performed by 
native speakers of the target language who are also fluent English-speakers. 
Note that English language fluency is also necessary for effectively 
searching the KB, as most of its entries are only in English.

Following a first pass of EDL annotation, senior annotators conduct quality 
control on annotated entity mentions to correct errors and identify areas 
of difficulty for improving guidelines and annotator training. Annotators 
performing quality control check the text extents of selected mentions, 
the coreferenced clusters of entity mentions, the mention and entity type 
labels that were applied, and KB links. Some NIL entities are also checked 
through new searches in the KB.     

3.2 Cross-document NIL-entity Coreference

Following completion of the annotation and quality control processes described above 
over all source documents, senior annotators conduct cross-document coreference 
for all of the within-document entity clusters marked as NIL. For this task, clusters 
are split up by entity type and then annotators use sorting and searching techniques 
to identify clusters that might require further collapsing. For example, clusters that 
includes mentions with strings or substrings that match those in other clusters are 
reviewed. As mentioned earlier, cross-document NIL-entity coreference is conducted by 
senior-level English annotators who, for cross-lingual versions of the task, use 
provided English translations and online references to collapse English, Chinese, and 
Spanish entity mention clusters. In the vast majority of cases, the additional 
information provided by non-English annotators is sufficient for collapsing cross-
lingual clusters. However, Spanish and Chinese annotators are available to further 
disambuguate non-English entity cluster referents if needed.


4. Using the Data

4.1 Offset calculation

The values of the beg and end elements in the two tabular files in ./data
indicate character offsets to identify text extents in the source.  Offset
counting starts from the initial opening angle bracket of the <DOC> element
(<doc> in DF sources), which is usually the initial character (character 0)
of the source. Note as well that character counting includes newlines and
all markup characters - that is, the offsets are based on treating the
source document file as "raw text", with all its markup included.

Note that although strings included in the annotation files generally match 
source documents, a few characters are normalized in order to enhance 
readability: Conversion of newlines to spaces, except where preceding characters 
were hyphens ("-"), in which case newlines were removed; and conversion of 
multiple spaces to a single space.


5. Acknowledgments

This material is based on research sponsored by Air Force Research
Laboratory and Defense Advance Research Projects Agency under
agreement number FA8750-13-2-0045. The U.S. Government is authoized
to reproduce and distribute reprints for Governmental purposes
notwithstanding any copyright notation thereon. The views and
conclusions contained herein are those of the authors and should
not be interpreted as necessarily representing the official policies
or endorsements, either expressed or implied, of Air Force Research
Laboratory and Defense Advanced Research Projects Agency or the U.S.
Government.

The authors acknowledge the following contributors to this data set:

Heng Ji (RPI)
Hoa Dang (NIST)
Boyan Onyshkevych (DARPA)


6. References

Joe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song, Ann Bies, & Stephanie 
M. Strassel. 2016
Overview of Linguistic Resources for the TAC KBP 2016 Evaluations: 
Methodologies and Results 
TAC KBP 2016 Workshop: National Institute of Standards and Technology, 
Gaithersburg, MD, November 14-15 

Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey, & Stephanie M.
Strassel. 2017
Overview of Linguistic Resources for the TAC KBP 2017 Evaluations: 
Methodologies and Results 
TAC KBP 2017 Workshop: National Institute of Standards and Technology, 
Gaithersburg, MD, November 13-14 


7. Copyright Information

(c) 2018 Trustees of the University of Pennsylvania


8. Contact Information

For further information about this data release, contact the following
project staff at LDC:

  Jeremy Getman, Project Manager     <jgetman@ldc.upenn.edu>
  Stephanie Strassel, PI             <strassel@ldc.upenn.edu>

--------------------------------------------------------------------------
README created by Joseph Carlough on March 26, 2018
       updated by Jeremy Getman on May 10, 2018
       updated by Jeremy Getman on November 18, 2019