TAC KBP Cold Start
Comprehensive Evaluation Data 2012-2017
Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel
1. Overview
This package contains evaluation data produced in support of the TAC KBP
Cold Start evaluation track conducted from 2012 to 2017.
Text Analysis Conference (TAC) is a series of workshops organized by the
National Institute of Standards and Technology (NIST). TAC was developed
to encourage research in natural language processing (NLP) and related
applications by providing a large test collection, common evaluation
procedures, and a forum for researchers to share their results. Through
its various evaluations, the Knowledge Base Population (KBP) track of
TAC encourages the development of systems that can match entities
mentioned in natural texts with those appearing in a knowledge base and
extract novel information about entities from a document collection and
add it to a new or existing knowledge base.
Cold Start is designed to evaluate a system's ability to construct a new
knowledge base (KB) from the information provided in a text collection,
combining technologies developed via other TAC KBP evaluation tracks.
Like the Slot Filling track (SF), Cold Start involves mining information
about entities from text, and can be viewed as an Information Extraction
(IE) or Question Answering (QA). As in the Entity Discovery & Linking
track (EDL), Cold Start systems must also find all entities mentioned in
the text. Ideally, Cold Start KBs include every person, organization,
and geo-political entity mentioned in the text collection as well as all
of the targeted relations between them. To facilitate the evaluation of
these KBs, LDC annotators create sets of queries, human-generated
responses to the queries, and assessments of both human and system
responses. More information about Cold Start and other TAC KBP
evaluations can be found on the NIST TAC website,
http://www.nist.gov/tac/
This package contains all evaluation data developed in support of TAC
KBP Cold Start during the six years the track was conducted, from
2012 to 2017. This includes queries, the manual runs produced by LDC
annotators, and the final assessment results for each evaluation year.
Source collections for the 2012, 2014, and 2015 evaluations are also
included. The source collections used in the 2016 and 2017 evaluations
were not specific to Cold Start and are available as LDC2019T12 TAC KBP
Evaluation Source Corpora 2016-2017, although the 2016 pilot source
colletion is included here. The archived 2013 Cold Start source
collection is available but you must contact NIST to request access:
http://www.nist.gov/tac/data/index.html
The data included in this package were originally released by LDC to TAC
KBP coordinators and performers under the following ecorpora catalog IDs
and titles:
LDC2012E104: TAC 2012 KBP Cold Start Evaluation Corpus v1.3
LDC2012E105: TAC 2012 KBP Cold Start Queries V1.1
LDC2012E116: TAC 2012 KBP Cold Start Assessment Results
LDC2013E101: TAC 2013 KBP English Cold Start Evaluation Assessment
Results
LDC2013E39: TAC 2012 KBP Cold Start Automated Queries Assessment
Results
LDC2013E87: TAC 2013 KBP English Cold Start Evaluation Queries and
Annotations V1.1
LDC2014E73: TAC 2014 KBP English Cold Start Evaluation Queries and
Annotations V1.1
LDC2014E82: TAC 2014 KBP English Cold Start Evaluation Assessment
Results V2.1
LDC2014R42: TAC 2014 KBP English Cold Start Evaluation Source Corpus
LDC2015E48: TAC KBP English Cold Start Collected Evaluation Data Sets
2012-2014
LDC2015E72: TAC KBP 2015 English Cold Start Entity Discovery
Sample Data
LDC2015E77: TAC KBP 2015 English Cold Start Evaluation Source Corpus
V2.0
LDC2015E80: TAC KBP 2015 English Cold Start Evaluation Queries and
Manual Run
LDC2015E81: TAC KBP 2015 English Cold Start Entity Discovery Evaluation
Gold Standard Entity Mentions V2.0
LDC2015E100: TAC KBP 2015 English Cold Start Evaluation Assessment
Results V3.1
LDC2016E41: TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot
Training Data V1.1
LDC2016E42: TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot
Source Corpus
LDC2016E44: TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot
Queries and Manual Run
LDC2016E52: TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot
Assessment Results V1.1
LDC2016E69: TAC KBP 2016 Cold Start Evaluation Queries and Manual Run
V1.1
LDC2016E106: TAC KBP 2016 Cold Start Evaluation Assessment Results V3.0
LDC2017E04: TAC KBP Cold Start Comprehensive Evaluation Data 2012-2016
LDC2017E34: TAC KBP 2017 Cold Start Evaluation Queries and Manual Run
V1.2
LDC2017E56: TAC KBP 2017 Cold Start Evaluation Assessment Results V3.0
Summary of data included in this package:
+------+------------------+---------+-------------+-----------------+
| Year | Source Documents | Queries | Assessments | Manual Responses
+------+------------------+---------+-------------+-----------------+
| 2012 | 26469 | 385 | 5015 | 979 |
| 2013 | 0* | 326 | 6745 | 1595 |
| 2014 | 50192 | 247 | 7258 | 1386 |
| 2015 | 49124 | 2539 | 30654 | 2218 |
| 2016 | 0* | 4636 | 26234 | 6756 |
| 2017 | 0* | 1392 | 26802 | 3495 |
+------+------------------+---------+-------------+-----------------+
* see above regarding 2013, 2016, and 2017 Cold Start source collections
2. Contents
./README.txt
This file.
./data/{2012,2013,2014,2015,2016,2017}/contents.txt
The data in this package are organized by the evaluation year in order
to clarify dependencies, highlight occasional differences in formats
from one year to another, and to increase readability in
documentation. The contents.txt file within each year's root directory
provides a list of the contents for all subdirectories as well as
details about file formats and contents.
./docs/guidelines/{2012,2013,2014,2015,2016,2017}/*
The guidelines used by annotators in developing the respective year's
Cold Start queries, annotations, and assessments.
./docs/task_descriptions/*
Task Descriptions for the respective 2012-2017 Cold Start evaluation
tracks, written by evaluation track coordinators.
./dtd/cold_start_queries_2012.dtd
DTD for:
./data/2012/tac_kbp_2012_cold_start_evaluation_queries.xml
./data/2012/tac_kbp_2012_cold_start_automated_queries.xml
./dtd/cold_start_queries_2013.dtd
DTD for:
./data/2013/tac_kbp_2013_cold_start_evaluation_queries.xml
./data/2013/tac_kbp_2013_cold_start_1-hop_queries.xml
./dtd/cold_start_queries_2014-2015.dtd
DTD for:
./data/2014/tac_kbp_2014_cold_start_evaluation_queries.xml
./data/2015/tac_kbp_2015_cold_start_evaluation_queries.xml
./data/2015/tac_kbp_2015_cold_start_evaluation_queries_v2.1.xml
./data/2015/tac_kbp_2015_cold_start_slot_filling_evaluation_queries_v2.xml
./dtd/cold_start_queries_2016.dtd
DTD for:
./data/2016/eval/tac_kbp_2016_cold_start_evaluation_queries.xml
./dtd/spanish-english_cold_start_queries.dtd
DTD for:
./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_pilot_evaluation_queries_cssf.xml
./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_pilot_evaluation_queries_ldc.xml
./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_pilot_evaluation_validated_queries.xml
./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_training_validated_queries.xml
./dtd/cold_start_queries_2017.dtd
DTD for:
./data/tac_kbp_2017_cold_start_evaluation_queries.xml
./tools/2012/*
Tools for 2012 cold start, as provided to LDC by evaluation track
coordinators, with no further testing. See ./ResolveQueries.pl
for more information.
./tools/2013/*
Tools for 2013 cold start, as provided by evaluation track
coordinators, with no further testing. See
./TAC_2013_KBP_Cold_Start_Example_Documents/Cold_Start_Sample_Collection_2.0/README.txt
for more information.
./tools/2014/*
Tools for 2014 cold start, as provided by evaluation track
coordinators, with no further testing. See
./README-Scoring.md.txt
for more information.
Note: To request 2015, 2016, or 2017 Cold Start tools, contact NIST
http://www.nist.gov/tac/data/index.html
3. Annotation tasks
Cold Start data development primarily involves three annotation
tasks: query development, manual run annotation, and assessment.
Entity Discovery was an additional task conducted only in 2015.
Each of these subtasks is explained below.
3.1 Query Development
In Cold Start query development, annotators create sets of queries, with
each set defined by a shared Entry Point Entity (EPE). The EPE in a Cold
Start query is the first entity initiating a chain of relations. For
example, in the query "Find all parent organizations of organizations at
which 'Jane Doe' has been an employee", the EPE would be "Jane Doe".
Ideally, EPEs allow for multiple queries, some of which can generate
multiple responses from the source collection (though not too many) and
others that allow for the utilization of under-represented TAC KBP slots
(the official set of valid attributes pertaining to entities).
In order to find promising EPEs, query developers generally begin by
conducting searches through the corpus, focusing on key words related
to the set of TAC KBP slots. For example, annotators might search for
"arrested" or "charged" to find entities related to arrest or conviction
events. Once an initial 'seed' relation is found, query developers
search elsewhere in the corpus for other mentions of the related
entities. Whichever entity seems the most promising is then chosen as
the EPE and annotators extract 2-5 other mentions of it from different
source documents. When possible, confusable name strings such as aliases
or misspellings are selected to add difficulty to the queries.
Throughout the process of query development, annotators also attempt to
balance query entity types (PER, GPE, ORG, FAC, or LOC), response types
(entity or string), and document genre (formal or informal).
3.2 Manual Run Development
Having created a set of queries that share an EPE, annotators proceed to
generating the 'manual run', the set of all human-produced responses to
the queries that can be found in the corpus. In this task annotators
again search the corpus for mentions of the EPE participating in the
specified TAC KBP relations, using online searching as well to research
the entities and guide keyword searches.
In order to be valid, responses must include justification - the minimum
extents of provenance supporting the validity of a response. Valid
justification strings must clearly identify all three elements of a
relation (i.e. the subject entity, the predicate slot, and the object
filler) with minimal extraneous text. In 2013, justification was
modified to allow for up to two discontiguous strings selected from as
many separate documents, up from one string in 2012. In 2014,
justification was again altered to allow for up to four justification
strings. This facilitated a greater potential for inferred relations
that would be difficult to justify with just a single document.
Note that, for Cold Start 2012-2015, the query and manual run
development tasks were conducted concurrently, such that annotators
could switch back and forth between finding queries and finding as many
valid responses for them as the corpus had to offer. This approach was
taken simply to increase efficiency as it requires annotators to only
have to research query entities once. In 2016-2017, the query and manual
run development tasks were conducted separately, in an effort to
increase the number of responses found during the manual run.
Following the initial round of query and manual run development, a
quality control pass is conducted by senior annotators to check extents
for EPE mentions and responses and to ensure that responses have
adequate justification in the source document and are not at variance
with the guidelines in any way. Any responses that are not clearly
correct or incorrect are flagged for further review by Lead Annotators
and possibly managers.
3.3 Assessment
In assessment, annotators assess and coreference anonymized responses
returned from both the manual run and from systems. Fillers are marked
as correct if they are found to be both compatible with the slot
descriptions and supported in the provided justification string(s)
and/or its surrounding content. Fillers are assessed as wrong if they do
not meet both of the conditions for correctness, or as inexact if overly
insufficient or extraneous text was selected for an otherwise correct
response.
Justification receives a separate assessment from the response, being
marked as correct if it succinctly and completely supports the relation,
wrong if it does not support the relation at all (or if the
corresponding filler is marked wrong), inexact-short if part but not all
of the information necessary to support the relation is provided, or
inexact-long if it contains all information necessary to support the
relation but also a great deal of extraneous text. Starting in 2014,
responses with justification comprising more than 600 characters in
total were automatically ignored and removed from the pool of responses
for assessment.
After first passes of assessment are completed, quality control is
performed on the data by senior annotators. During quality control, the
extent of each annotated filler and justification are checked for
correctness, entity equivalence classes were checked for accuracy, and
potentially problematic assessments are either corrected or flagged for
additional review.
3.4 Entity Discovery
Within 2015 Cold Start, an additional evaluation track, Entity Discovery
(ED), was conducted to provide another metric for measuring systems'
ability to find and extract all valid entity mentions, an obvious
preliminary to successfully completing full Cold Start. The data
development tasks conducted in support of ED were essentially those
conducted in support of Entity Discovery and Linking, another TAC KBP
track, though with some slight modifications.
Source documents for 2015 Entity Discovery are a subset of those
included in the full 2015 Cold Start source collection. This subset was
selected based on features indicated by the Cold Start queries, which
were in development at the time ED source documents were selected. These
features include mention of ambiguous entites (those that had aliases or
shared a name with other entities) and entities that were referenced in
multiple documents across the corpus. The two genres of source documents
in the full collection (newswire and discussion forum) are also roughly
equally represented in the ED subset.
Once the set of source documents is selected, annotators exhaustively
extract and cluster valid entity mentions from each one. Given a single
document, annotators developing the ED gold standard select text extents
to indicate valid entity mentions. Every time the first mention of a new
entity is selected, annotators also create a new entity cluster, a
"bucket" into which all subsequent mentions of the entity are collected
and to which an entity type label is applied. Thus, within-document
coreference of entities is performed concurrently with mention
selection. As documents are completed, annotators performing quality
control make sure that the extent of each selected namestring is correct
and that each entity is coreferenced correctly.
Following completion of ED over all documents in the collection, senior
annotators conduct cross-document coreference for all of the
within-document entity clusters. For this task, clusters are split up by
entity type and then sorting and searching techniques are used to
identify clusters that might require further collapsing. For example,
clusters that include mentions with strings or substrings that match
those in other clusters are reviewed.
4. Source Documents
The source data contained in this release comprises all documents from
which queries were drawn for 2012, 2014 and 2015. The source data was
drawn from existing LDC holdings, with no additional validation. An
overall scan of character content in the source collections indicates
some relatively small quantities of various problems, especially in the
web and discussion forum data, including language mismatch (characters
from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding
errors(some documents have apparently undergone "double encoding" into
UTF-8, and others may have been "noisy" to begin with, or may have gone
through an improper encoding conversion, yielding occurrences of the
Unicode "replacement character" (U+FFFD) throughout the corpus); the web
collection also has characters whose Unicode code points lie outside the
"Basic Multilanguage Plane" (BMP), i.e. above U+FFFF.
All source documents originally released as XML have been converted to
text files for this release. This change was made primarily because the
documents were used as text files during data development but also
because some fail XML parsing. All documents that have filenames
beginning with "eng-NG" are Web Document data (WB) and some of these
fail XML parsing (see below for details). All files that start with
"bolt-" are Discussion Forum threads (DF) and have the structure
described below. All other files are Newswire data (NW) and have the
newswire markup pattern detailed below.
Note as well that some source documents are duplicated across a few of
the separated source_documents directories, indicating that some queries
from different data sets originated from the same source documents. As
it is acceptable for source to be reused for Entity Linking queries,
this duplication is intentional and expected.
The subsections below go into more detail regarding the markup and other
properties of the three source data types:
4.1. Newswire Data
Newswire data use the following markup framework:
...
...
" tags (depending on whether or not the "doc_type_label" is "story"). Some NW files contain a single double-escaped ampersand. All the newswire files, if converted back to XML files are parseable. 4.2 Multi-Post Discussion Forum Data Multi-Post Discussion Forum files (MPDFs) are derived from Discussion Forum threads. They consist of a continuous run of posts from a thread but they are only approximately 800 words in length (excluding metadata and text withinelements). When taken from a short thread, a MPDF may comprise the entire thread. However, when taken from longer threads, a MPDF is a truncated version of its source, though it will always start with the preliminary post. 361 of the 40,186 MPDF files have a total of 974 various forms of double-escapes; like '& amp;#x202a;', 'Obama& amp;rsquo;s', etc., as well as things like 'http://some.url/query?a=525119& amp;f=19">', which isn't really a double-escape, but rather something else that resembles a double-escape. The MPDF files use the following markup framework, in which there may also be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "..." anchor tags): All the discussion forum files, if converted back to XML files are parseable. 4.3 Web Document Data "Web" files use the following markup framework: ... ... .........Other kinds of tags may be present (" {doc_id_string} ... ... ... ... ... ...", "", etc). 5. Using the Data 5.1 Text normalization and offset calculation Text normalization of queries consisting of a 1-for-1 substitution of newline (0x0A) and tab (0x09) characters with space (0x20) characters was performed on the document text input to the response field. The values of the 'beg=' and 'end=' XML attributes in the more recent queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial opening angle bracket of theelement ( in DF sources), which is usually, but not always, the initial character (character 0) of the source. Note as well that character counting includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. Note that although strings included in the annotation files (queries and gold standard mentions) generally match source documents, a few characters are normalized in order to enhance readability: Conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed, and conversion of multiple spaces to a single space. 5.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 6. Acknowledgemnts This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dana Fore (LDC) Dave Graff (LDC) James Mayfield (JHU) Hoa Dang (NIST) Boyan Onyshkevych (DARPA) 7. References Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey, & Stephanie M. Strassel. 2017 Overview of Linguistic Resources for the TAC KBP 2017 Evaluations: Methodologies and Results TAC KBP 2017 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 13-14 Joe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song, Ann Bies, & Stephanie M. Strassel. 2016 Overview of Linguistic Resources for the TAC KBP 2016 Evaluations: Methodologies and Results TAC KBP 2016 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 14-15 Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann Bies, Stephanie M. Strassel. 2015 Overview of Linguistic Resources for the TAC KBP 2015 Evaluations: Methodologies and Results TAC KBP Workshop 2015: National Institute of Standards and Technology, Gaithersburg, MD, November 16-17 Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014 Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: Planning, Execution, and Results TAC KBP 2014 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 17-18 Joe Ellis, Jeremy Getman, Justin Mott, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2013 Linguistic Resources for 2013 Knowledge Base Population Evaluations TAC KBP 2013 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 18-19 Joe Ellis, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2012 Linguistic Resources for 2012 Knowledge Base Population Evaluations TAC KBP 2012 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 5-6 8. Copyright Information (c) 2018 Trustees of the University of Pennsylvania 9. Contact Information For further information about this data release, or the TAC KBP project, contact the following project staff at LDC: Joe Ellis, Project Manager Jeremy Getman, Lead Annotator Stephanie Strassel, PI ----------------------------------------------------------------------------- README created by Dana Fore on February 24, 2016 updated by Dana Fore on March 22, 2016 updated by Dana Fore on April 4, 2016 updated by Jeremy Getman on April 4, 2016 updated by Neil Kuster on September 22, 2016 updated by Joe Ellis on December 21, 2016 updated by Jeremy Getman on December 20, 2017 updated by Jeremy Getman on March 22, 2018 updated by Jeremy Getman on May 18, 2018 updated by Jeremy Getman on May 21, 2019