TAC KBP Cold Start
              Comprehensive Evaluation Data 2012-2017

           Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel


1. Overview

This package contains evaluation data produced in support of the TAC KBP 
Cold Start evaluation track conducted from 2012 to 2017.

Text Analysis Conference (TAC) is a series of workshops organized by the 
National Institute of Standards and Technology (NIST). TAC was developed 
to encourage research in natural language processing (NLP) and related 
applications by providing a large test collection, common evaluation 
procedures, and a forum for researchers to share their results. Through 
its various evaluations, the Knowledge Base Population (KBP) track of 
TAC encourages the development of systems that can match entities 
mentioned in natural texts with those appearing in a knowledge base and 
extract novel information about entities from a document collection and 
add it to a new or existing knowledge base. 

Cold Start is designed to evaluate a system's ability to construct a new 
knowledge base (KB) from the information provided in a text collection, 
combining technologies developed via other TAC KBP evaluation tracks. 
Like the Slot Filling track (SF), Cold Start involves mining information 
about entities from text, and can be viewed as an Information Extraction 
(IE) or Question Answering (QA). As in the Entity Discovery & Linking 
track (EDL), Cold Start systems must also find all entities mentioned in 
the text. Ideally, Cold Start KBs include every person, organization, 
and geo-political entity mentioned in the text collection as well as all 
of the targeted relations between them. To facilitate the evaluation of 
these KBs, LDC annotators create sets of queries, human-generated 
responses to the queries, and assessments of both human and system 
responses. More information about Cold Start and other TAC KBP 
evaluations can be found on the NIST TAC website, 
http://www.nist.gov/tac/ 

This package contains all evaluation data developed in support of TAC 
KBP Cold Start during the six years the track was conducted, from
2012 to 2017. This includes queries, the manual runs produced by LDC 
annotators, and the final assessment results for each evaluation year. 
Source collections for the 2012, 2014, and 2015 evaluations are also 
included. The source collections used in the 2016 and 2017 evaluations 
were not specific to Cold Start and are available as  LDC2019T12 TAC KBP
Evaluation Source Corpora 2016-2017, although the 2016 pilot source 
colletion is included here. The archived 2013 Cold Start source 
collection is available but you must contact NIST to request access: 
http://www.nist.gov/tac/data/index.html 

The data included in this package were originally released by LDC to TAC 
KBP coordinators and performers under the following ecorpora catalog IDs 
and titles: 

LDC2012E104: TAC 2012 KBP Cold Start Evaluation Corpus v1.3
LDC2012E105: TAC 2012 KBP Cold Start Queries V1.1
LDC2012E116: TAC 2012 KBP Cold Start Assessment Results
LDC2013E101: TAC 2013 KBP English Cold Start Evaluation Assessment 
             Results
LDC2013E39:  TAC 2012 KBP Cold Start Automated Queries Assessment 
             Results
LDC2013E87:  TAC 2013 KBP English Cold Start Evaluation Queries and 
             Annotations V1.1
LDC2014E73:  TAC 2014 KBP English Cold Start Evaluation Queries and 
             Annotations V1.1
LDC2014E82:  TAC 2014 KBP English Cold Start Evaluation Assessment 
             Results V2.1
LDC2014R42:  TAC 2014 KBP English Cold Start Evaluation Source Corpus
LDC2015E48:  TAC KBP English Cold Start Collected Evaluation Data Sets 
             2012-2014
LDC2015E72:  TAC KBP 2015 English Cold Start Entity Discovery 
             Sample Data
LDC2015E77:  TAC KBP 2015 English Cold Start Evaluation Source Corpus 
             V2.0
LDC2015E80:  TAC KBP 2015 English Cold Start Evaluation Queries and 
             Manual Run
LDC2015E81:  TAC KBP 2015 English Cold Start Entity Discovery Evaluation 
             Gold Standard Entity Mentions V2.0
LDC2015E100: TAC KBP 2015 English Cold Start Evaluation Assessment 
             Results V3.1
LDC2016E41:  TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot 
             Training Data V1.1
LDC2016E42:  TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot 
             Source Corpus
LDC2016E44:  TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot 
             Queries and Manual Run
LDC2016E52:  TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot 
             Assessment Results V1.1
LDC2016E69:  TAC KBP 2016 Cold Start Evaluation Queries and Manual Run 
             V1.1
LDC2016E106: TAC KBP 2016 Cold Start Evaluation Assessment Results V3.0
LDC2017E04:  TAC KBP Cold Start Comprehensive Evaluation Data 2012-2016
LDC2017E34:  TAC KBP 2017 Cold Start Evaluation Queries and Manual Run 
             V1.2
LDC2017E56:  TAC KBP 2017 Cold Start Evaluation Assessment Results V3.0

Summary of data included in this package:
+------+------------------+---------+-------------+-----------------+
| Year | Source Documents | Queries | Assessments | Manual Responses
+------+------------------+---------+-------------+-----------------+
| 2012 |            26469 |     385 |        5015 |             979 |
| 2013 |               0* |     326 |        6745 |            1595 |
| 2014 |            50192 |     247 |        7258 |            1386 |
| 2015 |            49124 |    2539 |       30654 |            2218 |
| 2016 |               0* |    4636 |       26234 |            6756 |
| 2017 |               0* |    1392 |       26802 |            3495 |
+------+------------------+---------+-------------+-----------------+
* see above regarding 2013, 2016, and 2017 Cold Start source collections


2. Contents

./README.txt

  This file.

./data/{2012,2013,2014,2015,2016,2017}/contents.txt

  The data in this package are organized by the evaluation year in order 
  to clarify dependencies, highlight occasional differences in formats 
  from one year to another, and to increase readability in 
  documentation. The contents.txt file within each year's root directory 
  provides a list of the contents for all subdirectories as well as 
  details about file formats and contents.

./docs/guidelines/{2012,2013,2014,2015,2016,2017}/*

  The guidelines used by annotators in developing the respective year's
  Cold Start queries, annotations, and assessments.

./docs/task_descriptions/*

  Task Descriptions for the respective 2012-2017 Cold Start evaluation 
  tracks, written by evaluation track coordinators. 

./dtd/cold_start_queries_2012.dtd

  DTD for:
  
  ./data/2012/tac_kbp_2012_cold_start_evaluation_queries.xml
  ./data/2012/tac_kbp_2012_cold_start_automated_queries.xml

./dtd/cold_start_queries_2013.dtd

  DTD for:
  
  ./data/2013/tac_kbp_2013_cold_start_evaluation_queries.xml
  ./data/2013/tac_kbp_2013_cold_start_1-hop_queries.xml

./dtd/cold_start_queries_2014-2015.dtd

  DTD for:
  
  ./data/2014/tac_kbp_2014_cold_start_evaluation_queries.xml
  ./data/2015/tac_kbp_2015_cold_start_evaluation_queries.xml
  ./data/2015/tac_kbp_2015_cold_start_evaluation_queries_v2.1.xml
  ./data/2015/tac_kbp_2015_cold_start_slot_filling_evaluation_queries_v2.xml
  
./dtd/cold_start_queries_2016.dtd

  DTD for:
  
  ./data/2016/eval/tac_kbp_2016_cold_start_evaluation_queries.xml

./dtd/spanish-english_cold_start_queries.dtd

  DTD for:
  
  ./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_pilot_evaluation_queries_cssf.xml
  ./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_pilot_evaluation_queries_ldc.xml
  ./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_pilot_evaluation_validated_queries.xml
  ./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_training_validated_queries.xml

./dtd/cold_start_queries_2017.dtd

  DTD for:
  
  ./data/tac_kbp_2017_cold_start_evaluation_queries.xml

./tools/2012/*

  Tools for 2012 cold start, as provided to LDC by evaluation track
  coordinators, with no further testing. See ./ResolveQueries.pl 
  for more information.

./tools/2013/*

  Tools for 2013 cold start, as provided by evaluation track
  coordinators, with no further testing. See
  ./TAC_2013_KBP_Cold_Start_Example_Documents/Cold_Start_Sample_Collection_2.0/README.txt
  for more information.

./tools/2014/*

  Tools for 2014 cold start, as provided by evaluation track
  coordinators, with no further testing. See
  ./README-Scoring.md.txt 
  for more information.

Note: To request 2015, 2016, or 2017 Cold Start tools, contact NIST
http://www.nist.gov/tac/data/index.html


3. Annotation tasks

Cold Start data development primarily involves three annotation 
tasks: query development, manual run annotation, and assessment. 
Entity Discovery was an additional task conducted only in 2015. 
Each of these subtasks is explained below.

3.1 Query Development

In Cold Start query development, annotators create sets of queries, with 
each set defined by a shared Entry Point Entity (EPE). The EPE in a Cold 
Start query is the first entity initiating a chain of relations. For 
example, in the query "Find all parent organizations of organizations at 
which 'Jane Doe' has been an employee", the EPE would be "Jane Doe". 
Ideally, EPEs allow for multiple queries, some of which can generate 
multiple responses from the source collection (though not too many) and 
others that allow for the utilization of under-represented TAC KBP slots 
(the official set of valid attributes pertaining to entities). 

In order to find promising EPEs, query developers generally begin by 
conducting searches through the corpus, focusing on key words related 
to the set of TAC KBP slots. For example, annotators might search for 
"arrested" or "charged" to find entities related to arrest or conviction 
events. Once an initial 'seed' relation is found, query developers 
search elsewhere in the corpus for other mentions of the related 
entities. Whichever entity seems the most promising is then chosen as 
the EPE and annotators extract 2-5 other mentions of it from different
source documents. When possible, confusable name strings such as aliases 
or misspellings are selected to add difficulty to the queries. 
Throughout the process of query development, annotators also attempt to 
balance query entity types (PER, GPE, ORG, FAC, or LOC), response types
(entity or string), and document genre (formal or informal).

3.2 Manual Run Development

Having created a set of queries that share an EPE, annotators proceed to 
generating the 'manual run', the set of all human-produced responses to 
the queries that can be found in the corpus. In this task annotators 
again search the corpus for mentions of the EPE participating in the 
specified TAC KBP relations, using online searching as well to research 
the entities and guide keyword searches. 

In order to be valid, responses must include justification - the minimum 
extents of provenance supporting the validity of a response. Valid 
justification strings must clearly identify all three elements of a 
relation (i.e. the subject entity, the predicate slot, and the object 
filler) with minimal extraneous text. In 2013, justification was 
modified to allow for up to two discontiguous strings selected from as 
many separate documents, up from one string in 2012. In 2014, 
justification was again altered to allow for up to four justification 
strings. This facilitated a greater potential for inferred relations 
that would be difficult to justify with just a single document. 

Note that, for Cold Start 2012-2015, the query and manual run 
development tasks were conducted concurrently, such that annotators 
could switch back and forth between finding queries and finding as many 
valid responses for them as the corpus had to offer. This approach was 
taken simply to increase efficiency as it requires annotators to only 
have to research query entities once. In 2016-2017, the query and manual
run development tasks were conducted separately, in an effort to
increase the number of responses found during the manual run.

Following the initial round of query and manual run development, a 
quality control pass is conducted by senior annotators to check extents 
for EPE mentions and responses and to ensure that responses have 
adequate justification in the source document and are not at variance 
with the guidelines in any way. Any responses that are not clearly 
correct or incorrect are flagged for further review by Lead Annotators 
and possibly managers. 

3.3 Assessment

In assessment, annotators assess and coreference anonymized responses 
returned from both the manual run and from systems. Fillers are marked 
as correct if they are found to be both compatible with the slot 
descriptions and supported in the provided justification string(s) 
and/or its surrounding content. Fillers are assessed as wrong if they do 
not meet both of the conditions for correctness, or as inexact if overly 
insufficient or extraneous text was selected for an otherwise correct 
response. 

Justification receives a separate assessment from the response, being 
marked as correct if it succinctly and completely supports the relation, 
wrong if it does not support the relation at all (or if the 
corresponding filler is marked wrong), inexact-short if part but not all 
of the information necessary to support the relation is provided, or 
inexact-long if it contains all information necessary to support the 
relation but also a great deal of extraneous text. Starting in 2014, 
responses with justification comprising more than 600 characters in 
total were automatically ignored and removed from the pool of responses 
for assessment. 

After first passes of assessment are completed, quality control is 
performed on the data by senior annotators. During quality control, the 
extent of each annotated filler and justification are checked for 
correctness, entity equivalence classes were checked for accuracy, and 
potentially problematic assessments are either corrected or flagged for 
additional review. 

3.4 Entity Discovery
Within 2015 Cold Start, an additional evaluation track, Entity Discovery 
(ED), was conducted to provide another metric for measuring systems' 
ability to find and extract all valid entity mentions, an obvious 
preliminary to successfully completing full Cold Start. The data 
development tasks conducted in support of ED were essentially those 
conducted in support of Entity Discovery and Linking, another TAC KBP 
track, though with some slight modifications. 

Source documents for 2015 Entity Discovery are a subset of those 
included in the full 2015 Cold Start source collection. This subset was 
selected based on features indicated by the Cold Start queries, which 
were in development at the time ED source documents were selected. These 
features include mention of ambiguous entites (those that had aliases or 
shared a name with other entities) and entities that were referenced in 
multiple documents across the corpus. The two genres of source documents 
in the full collection (newswire and discussion forum) are also roughly 
equally represented in the ED subset. 

Once the set of source documents is selected, annotators exhaustively 
extract and cluster valid entity mentions from each one. Given a single 
document, annotators developing the ED gold standard select text extents 
to indicate valid entity mentions. Every time the first mention of a new 
entity is selected, annotators also create a new entity cluster, a 
"bucket" into which all subsequent mentions of the entity are collected 
and to which an entity type label is applied. Thus, within-document 
coreference of entities is performed concurrently with mention 
selection. As documents are completed, annotators performing quality 
control make sure that the extent of each selected namestring is correct 
and that each entity is coreferenced correctly. 

Following completion of ED over all documents in the collection, senior 
annotators conduct cross-document coreference for all of the 
within-document entity clusters. For this task, clusters are split up by 
entity type and then sorting and searching techniques are used to 
identify clusters that might require further collapsing. For example, 
clusters that include mentions with strings or substrings that match 
those in other clusters are reviewed. 


4. Source Documents 

The source data contained in this release comprises all documents from 
which queries were drawn for 2012, 2014 and 2015. The source data was 
drawn from existing LDC holdings, with no additional validation. An 
overall scan of character content in the source collections indicates 
some relatively small quantities of various problems, especially in the 
web and discussion forum data, including language mismatch (characters 
from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding 
errors(some documents have apparently undergone "double encoding" into 
UTF-8, and others may have been "noisy" to begin with, or may have gone 
through an improper encoding conversion, yielding occurrences of the 
Unicode "replacement character" (U+FFFD) throughout the corpus); the web 
collection also has characters whose Unicode code points lie outside the 
"Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. 

All source documents originally released as XML have been converted to 
text files for this release. This change was made primarily because the 
documents were used as text files during data development but also 
because some fail XML parsing. All documents that have filenames 
beginning with "eng-NG" are Web Document data (WB) and some of these 
fail XML parsing (see below for details). All files that start with 
"bolt-" are Discussion Forum threads (DF) and have the structure 
described below. All other files are Newswire data (NW) and have the 
newswire markup pattern detailed below. 

Note as well that some source documents are duplicated across a few of 
the separated source_documents directories, indicating that some queries 
from different data sets originated from the same source documents. As 
it is acceptable for source to be reused for Entity Linking queries, 
this duplication is intentional and expected. 

The subsections below go into more detail regarding the markup and other 
properties of the three source data types: 

4.1. Newswire Data

Newswire data use the following markup framework:

  <DOC id="{doc_id_string}" type="{doc_type_label}">
  <HEADLINE>
  ...
  </HEADLINE>
  <DATELINE>
  ...
  </DATELINE>
  <TEXT>
  <P>
  ...
  </P>
  ...
  </TEXT>
  </DOC>

where the HEADLINE and DATELINE tags are optional (not always present), 
and the TEXT content may or may not include "<P> ... </P>" tags 
(depending on whether or not the "doc_type_label" is "story"). 

Some NW files contain a single double-escaped ampersand.

All the newswire files, if converted back to XML files are parseable.

4.2 Multi-Post Discussion Forum Data

Multi-Post Discussion Forum files (MPDFs) are derived from Discussion 
Forum threads. They consist of a continuous run of posts from a thread
but they are only approximately 800 words in length (excluding
metadata and text within <quote> elements). When taken from a short 
thread, a MPDF may comprise the entire thread. However, when taken from 
longer threads, a MPDF is a truncated version of its source, though it 
will always start with the preliminary post. 

361 of the 40,186 MPDF files have a total of 974 various forms of 
double-escapes; like '& amp;#x202a;', 'Obama& amp;rsquo;s', etc., as 
well as things like 'http://some.url/query?a=525119& amp;f=19">', which 
isn't really a double-escape, but rather something else that resembles a 
double-escape. 

The MPDF files use the following markup framework, in which there may 
also be arbitrarily deep nesting of quote elements, and other elements 
may be present (e.g. "<a...>...</a>" anchor tags): 

  <doc id="{doc_id_string}">
  <headline>
  ...
  </headline>
  <post ...>
  ...
  <quote ...>
  ...
  </quote>
  ...
  </post>
  ...
  </doc>

All the discussion forum files, if converted back to XML files are 
parseable. 

4.3  Web Document Data

"Web" files use the following markup framework:

  <DOC>
  <DOCID> {doc_id_string} </DOCID>
  <DOCTYPE> ... </DOCTYPE>
  <DATETIME> ... </DATETIME>
  <BODY>
  <HEADLINE>
  ...
  </HEADLINE>
  <TEXT>
  <POST>
  <POSTER> ... </POSTER>
  <POSTDATE> ... </POSTDATE>
  ...
  </POST>
  </TEXT>
  </BODY>
  </DOC>

Other kinds of tags may be present ("<QUOTE ...>", "<A ...>", etc).


5. Using the Data

5.1 Text normalization and offset calculation

Text normalization of queries consisting of a 1-for-1 substitution of 
newline (0x0A) and tab (0x09) characters with space (0x20) characters 
was performed on the document text input to the response field. 

The values of the 'beg=' and 'end=' XML attributes in the more recent 
queries.xml files indicate character offsets to identify text extents in 
the source. Offset counting starts from the initial opening angle 
bracket of the <DOC> element (<doc> in DF sources), which is usually, 
but not always, the initial character (character 0) of the source. Note 
as well that character counting includes newlines and all markup 
characters - that is, the offsets are based on treating the source 
document file as "raw text", with all its markup included. 

Note that although strings included in the annotation files (queries and 
gold standard mentions) generally match source documents, a few 
characters are normalized in order to enhance readability: Conversion of 
newlines to spaces, except where preceding characters were hyphens 
("-"), in which case newlines were removed, and conversion of multiple 
spaces to a single space. 

5.2 Proper ingesting of XML queries

While the character offsets are calculated based on treating the source 
document as "raw text", the "name" strings being referenced by the 
queries sometimes contain XML metacharacters, and these had to be 
"re-escaped" for proper inclusion in the queries.xml file. For example, 
an actual name like "AT&T" may show up a source document file as 
"AT&amp;T" (because the source document was originally formatted as XML 
data). But since the source doc is being treated here as raw text, this 
name string is treated in queries.xml as having 7 characters (i.e., the 
character offsets, when provided, will point to a string of length 7). 

However, the "name" element itself, as presented in the queries.xml 
file, will be even longer - "AT&amp;amp;T" - because the queries.xml 
file is intended to be handled by an XML parser, which will return 
"AT&amp;T" when this "name" element is extracted. Using the queries.xml 
data without XML parsing would yield a mismatch between the "name" value 
and the corresponding string in the source data. 


6. Acknowledgemnts

This material is based on research sponsored by Air Force Research 
Laboratory and Defense Advance Research Projects Agency under agreement 
number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce 
and distribute reprints for Governmental purposes notwithstanding any 
copyright notation thereon. The views and conclusions contained herein 
are those of the authors and should not be interpreted as necessarily 
representing the official policies or endorsements, either expressed or 
implied, of Air Force Research Laboratory and Defense Advanced Research 
Projects Agency or the U.S. Government. 

The authors acknowledge the following contributors to this data set:
Dana Fore (LDC)
Dave Graff (LDC)
James Mayfield (JHU)
Hoa Dang (NIST)
Boyan Onyshkevych (DARPA)


7. References

Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey, & Stephanie M.
Strassel. 2017
Overview of Linguistic Resources for the TAC KBP 2017 Evaluations: 
Methodologies and Results 
TAC KBP 2017 Workshop: National Institute of Standards and Technology, 
Gaithersburg, MD, November 13-14 

Joe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song, Ann Bies, & Stephanie
M. Strassel. 2016
Overview of Linguistic Resources for the TAC KBP 2016 Evaluations: 
Methodologies and Results 
TAC KBP 2016 Workshop: National Institute of Standards and Technology, 
Gaithersburg, MD, November 14-15 

Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann Bies, 
Stephanie M. Strassel. 2015
Overview of Linguistic Resources for the TAC KBP 2015 Evaluations: 
Methodologies and Results
TAC KBP Workshop 2015: National Institute of Standards and Technology, 
Gaithersburg, MD, November 16-17

Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014
Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: 
Planning, Execution, and Results
TAC KBP 2014 Workshop: National Institute of Standards and Technology, 
Gaithersburg, Maryland, November 17-18

Joe Ellis, Jeremy Getman, Justin Mott, Xuansong Li, Kira Griffitt, 
Stephanie M. Strassel, Jonathan Wright. 2013
Linguistic Resources for 2013 Knowledge Base Population Evaluations
TAC KBP 2013 Workshop: National Institute of Standards and Technology,
Gaithersburg, MD, November 18-19

Joe Ellis, Xuansong Li, Kira Griffitt, Stephanie M. Strassel,
Jonathan Wright. 2012 
Linguistic Resources for 2012 Knowledge Base Population Evaluations
TAC KBP 2012 Workshop: National Institute of Standards and Technology,
Gaithersburg, MD, November 5-6


8. Copyright Information

(c) 2018 Trustees of the University of Pennsylvania


9. Contact Information

For further information about this data release, or the TAC KBP
project, contact the following project staff at LDC:

    Joe Ellis, Project Manager           <joellis@ldc.upenn.edu>
    Jeremy Getman, Lead Annotator        <jgetman@ldc.upenn.edu>
    Stephanie Strassel, PI               <strassel@ldc.upenn.edu>

-----------------------------------------------------------------------------
README created by Dana Fore on February 24, 2016
       updated by Dana Fore on March 22, 2016
       updated by Dana Fore on April 4, 2016
       updated by Jeremy Getman on April 4, 2016
       updated by Neil Kuster on September 22, 2016
       updated by Joe Ellis on December 21, 2016 
       updated by Jeremy Getman on December 20, 2017
       updated by Jeremy Getman on March 22, 2018
       updated by Jeremy Getman on May 18, 2018
       updated by Jeremy Getman on May 21, 2019