TAC KBP Cold Start Comprehensive Evaluation Data 2012-2017 Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel 1. Overview This package contains evaluation data produced in support of the TAC KBP Cold Start evaluation track conducted from 2012 to 2017. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. Cold Start is designed to evaluate a system's ability to construct a new knowledge base (KB) from the information provided in a text collection, combining technologies developed via other TAC KBP evaluation tracks. Like the Slot Filling track (SF), Cold Start involves mining information about entities from text, and can be viewed as an Information Extraction (IE) or Question Answering (QA). As in the Entity Discovery & Linking track (EDL), Cold Start systems must also find all entities mentioned in the text. Ideally, Cold Start KBs include every person, organization, and geo-political entity mentioned in the text collection as well as all of the targeted relations between them. To facilitate the evaluation of these KBs, LDC annotators create sets of queries, human-generated responses to the queries, and assessments of both human and system responses. More information about Cold Start and other TAC KBP evaluations can be found on the NIST TAC website, http://www.nist.gov/tac/ This package contains all evaluation data developed in support of TAC KBP Cold Start during the six years the track was conducted, from 2012 to 2017. This includes queries, the manual runs produced by LDC annotators, and the final assessment results for each evaluation year. Source collections for the 2012, 2014, and 2015 evaluations are also included. The source collections used in the 2016 and 2017 evaluations were not specific to Cold Start and are available as LDC2019T12 TAC KBP Evaluation Source Corpora 2016-2017, although the 2016 pilot source colletion is included here. The archived 2013 Cold Start source collection is available but you must contact NIST to request access: http://www.nist.gov/tac/data/index.html The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2012E104: TAC 2012 KBP Cold Start Evaluation Corpus v1.3 LDC2012E105: TAC 2012 KBP Cold Start Queries V1.1 LDC2012E116: TAC 2012 KBP Cold Start Assessment Results LDC2013E101: TAC 2013 KBP English Cold Start Evaluation Assessment Results LDC2013E39: TAC 2012 KBP Cold Start Automated Queries Assessment Results LDC2013E87: TAC 2013 KBP English Cold Start Evaluation Queries and Annotations V1.1 LDC2014E73: TAC 2014 KBP English Cold Start Evaluation Queries and Annotations V1.1 LDC2014E82: TAC 2014 KBP English Cold Start Evaluation Assessment Results V2.1 LDC2014R42: TAC 2014 KBP English Cold Start Evaluation Source Corpus LDC2015E48: TAC KBP English Cold Start Collected Evaluation Data Sets 2012-2014 LDC2015E72: TAC KBP 2015 English Cold Start Entity Discovery Sample Data LDC2015E77: TAC KBP 2015 English Cold Start Evaluation Source Corpus V2.0 LDC2015E80: TAC KBP 2015 English Cold Start Evaluation Queries and Manual Run LDC2015E81: TAC KBP 2015 English Cold Start Entity Discovery Evaluation Gold Standard Entity Mentions V2.0 LDC2015E100: TAC KBP 2015 English Cold Start Evaluation Assessment Results V3.1 LDC2016E41: TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot Training Data V1.1 LDC2016E42: TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot Source Corpus LDC2016E44: TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot Queries and Manual Run LDC2016E52: TAC KBP 2016 Bilingual Spanish-English Cold Start Pilot Assessment Results V1.1 LDC2016E69: TAC KBP 2016 Cold Start Evaluation Queries and Manual Run V1.1 LDC2016E106: TAC KBP 2016 Cold Start Evaluation Assessment Results V3.0 LDC2017E04: TAC KBP Cold Start Comprehensive Evaluation Data 2012-2016 LDC2017E34: TAC KBP 2017 Cold Start Evaluation Queries and Manual Run V1.2 LDC2017E56: TAC KBP 2017 Cold Start Evaluation Assessment Results V3.0 Summary of data included in this package: +------+------------------+---------+-------------+-----------------+ | Year | Source Documents | Queries | Assessments | Manual Responses +------+------------------+---------+-------------+-----------------+ | 2012 | 26469 | 385 | 5015 | 979 | | 2013 | 0* | 326 | 6745 | 1595 | | 2014 | 50192 | 247 | 7258 | 1386 | | 2015 | 49124 | 2539 | 30654 | 2218 | | 2016 | 0* | 4636 | 26234 | 6756 | | 2017 | 0* | 1392 | 26802 | 3495 | +------+------------------+---------+-------------+-----------------+ * see above regarding 2013, 2016, and 2017 Cold Start source collections 2. Contents ./README.txt This file. ./data/{2012,2013,2014,2015,2016,2017}/contents.txt The data in this package are organized by the evaluation year in order to clarify dependencies, highlight occasional differences in formats from one year to another, and to increase readability in documentation. The contents.txt file within each year's root directory provides a list of the contents for all subdirectories as well as details about file formats and contents. ./docs/guidelines/{2012,2013,2014,2015,2016,2017}/* The guidelines used by annotators in developing the respective year's Cold Start queries, annotations, and assessments. ./docs/task_descriptions/* Task Descriptions for the respective 2012-2017 Cold Start evaluation tracks, written by evaluation track coordinators. ./dtd/cold_start_queries_2012.dtd DTD for: ./data/2012/tac_kbp_2012_cold_start_evaluation_queries.xml ./data/2012/tac_kbp_2012_cold_start_automated_queries.xml ./dtd/cold_start_queries_2013.dtd DTD for: ./data/2013/tac_kbp_2013_cold_start_evaluation_queries.xml ./data/2013/tac_kbp_2013_cold_start_1-hop_queries.xml ./dtd/cold_start_queries_2014-2015.dtd DTD for: ./data/2014/tac_kbp_2014_cold_start_evaluation_queries.xml ./data/2015/tac_kbp_2015_cold_start_evaluation_queries.xml ./data/2015/tac_kbp_2015_cold_start_evaluation_queries_v2.1.xml ./data/2015/tac_kbp_2015_cold_start_slot_filling_evaluation_queries_v2.xml ./dtd/cold_start_queries_2016.dtd DTD for: ./data/2016/eval/tac_kbp_2016_cold_start_evaluation_queries.xml ./dtd/spanish-english_cold_start_queries.dtd DTD for: ./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_pilot_evaluation_queries_cssf.xml ./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_pilot_evaluation_queries_ldc.xml ./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_pilot_evaluation_validated_queries.xml ./data/2016/pilot/tac_kbp_2016_bilingual_spanish-english_cold_start_training_validated_queries.xml ./dtd/cold_start_queries_2017.dtd DTD for: ./data/tac_kbp_2017_cold_start_evaluation_queries.xml ./tools/2012/* Tools for 2012 cold start, as provided to LDC by evaluation track coordinators, with no further testing. See ./ResolveQueries.pl for more information. ./tools/2013/* Tools for 2013 cold start, as provided by evaluation track coordinators, with no further testing. See ./TAC_2013_KBP_Cold_Start_Example_Documents/Cold_Start_Sample_Collection_2.0/README.txt for more information. ./tools/2014/* Tools for 2014 cold start, as provided by evaluation track coordinators, with no further testing. See ./README-Scoring.md.txt for more information. Note: To request 2015, 2016, or 2017 Cold Start tools, contact NIST http://www.nist.gov/tac/data/index.html 3. Annotation tasks Cold Start data development primarily involves three annotation tasks: query development, manual run annotation, and assessment. Entity Discovery was an additional task conducted only in 2015. Each of these subtasks is explained below. 3.1 Query Development In Cold Start query development, annotators create sets of queries, with each set defined by a shared Entry Point Entity (EPE). The EPE in a Cold Start query is the first entity initiating a chain of relations. For example, in the query "Find all parent organizations of organizations at which 'Jane Doe' has been an employee", the EPE would be "Jane Doe". Ideally, EPEs allow for multiple queries, some of which can generate multiple responses from the source collection (though not too many) and others that allow for the utilization of under-represented TAC KBP slots (the official set of valid attributes pertaining to entities). In order to find promising EPEs, query developers generally begin by conducting searches through the corpus, focusing on key words related to the set of TAC KBP slots. For example, annotators might search for "arrested" or "charged" to find entities related to arrest or conviction events. Once an initial 'seed' relation is found, query developers search elsewhere in the corpus for other mentions of the related entities. Whichever entity seems the most promising is then chosen as the EPE and annotators extract 2-5 other mentions of it from different source documents. When possible, confusable name strings such as aliases or misspellings are selected to add difficulty to the queries. Throughout the process of query development, annotators also attempt to balance query entity types (PER, GPE, ORG, FAC, or LOC), response types (entity or string), and document genre (formal or informal). 3.2 Manual Run Development Having created a set of queries that share an EPE, annotators proceed to generating the 'manual run', the set of all human-produced responses to the queries that can be found in the corpus. In this task annotators again search the corpus for mentions of the EPE participating in the specified TAC KBP relations, using online searching as well to research the entities and guide keyword searches. In order to be valid, responses must include justification - the minimum extents of provenance supporting the validity of a response. Valid justification strings must clearly identify all three elements of a relation (i.e. the subject entity, the predicate slot, and the object filler) with minimal extraneous text. In 2013, justification was modified to allow for up to two discontiguous strings selected from as many separate documents, up from one string in 2012. In 2014, justification was again altered to allow for up to four justification strings. This facilitated a greater potential for inferred relations that would be difficult to justify with just a single document. Note that, for Cold Start 2012-2015, the query and manual run development tasks were conducted concurrently, such that annotators could switch back and forth between finding queries and finding as many valid responses for them as the corpus had to offer. This approach was taken simply to increase efficiency as it requires annotators to only have to research query entities once. In 2016-2017, the query and manual run development tasks were conducted separately, in an effort to increase the number of responses found during the manual run. Following the initial round of query and manual run development, a quality control pass is conducted by senior annotators to check extents for EPE mentions and responses and to ensure that responses have adequate justification in the source document and are not at variance with the guidelines in any way. Any responses that are not clearly correct or incorrect are flagged for further review by Lead Annotators and possibly managers. 3.3 Assessment In assessment, annotators assess and coreference anonymized responses returned from both the manual run and from systems. Fillers are marked as correct if they are found to be both compatible with the slot descriptions and supported in the provided justification string(s) and/or its surrounding content. Fillers are assessed as wrong if they do not meet both of the conditions for correctness, or as inexact if overly insufficient or extraneous text was selected for an otherwise correct response. Justification receives a separate assessment from the response, being marked as correct if it succinctly and completely supports the relation, wrong if it does not support the relation at all (or if the corresponding filler is marked wrong), inexact-short if part but not all of the information necessary to support the relation is provided, or inexact-long if it contains all information necessary to support the relation but also a great deal of extraneous text. Starting in 2014, responses with justification comprising more than 600 characters in total were automatically ignored and removed from the pool of responses for assessment. After first passes of assessment are completed, quality control is performed on the data by senior annotators. During quality control, the extent of each annotated filler and justification are checked for correctness, entity equivalence classes were checked for accuracy, and potentially problematic assessments are either corrected or flagged for additional review. 3.4 Entity Discovery Within 2015 Cold Start, an additional evaluation track, Entity Discovery (ED), was conducted to provide another metric for measuring systems' ability to find and extract all valid entity mentions, an obvious preliminary to successfully completing full Cold Start. The data development tasks conducted in support of ED were essentially those conducted in support of Entity Discovery and Linking, another TAC KBP track, though with some slight modifications. Source documents for 2015 Entity Discovery are a subset of those included in the full 2015 Cold Start source collection. This subset was selected based on features indicated by the Cold Start queries, which were in development at the time ED source documents were selected. These features include mention of ambiguous entites (those that had aliases or shared a name with other entities) and entities that were referenced in multiple documents across the corpus. The two genres of source documents in the full collection (newswire and discussion forum) are also roughly equally represented in the ED subset. Once the set of source documents is selected, annotators exhaustively extract and cluster valid entity mentions from each one. Given a single document, annotators developing the ED gold standard select text extents to indicate valid entity mentions. Every time the first mention of a new entity is selected, annotators also create a new entity cluster, a "bucket" into which all subsequent mentions of the entity are collected and to which an entity type label is applied. Thus, within-document coreference of entities is performed concurrently with mention selection. As documents are completed, annotators performing quality control make sure that the extent of each selected namestring is correct and that each entity is coreferenced correctly. Following completion of ED over all documents in the collection, senior annotators conduct cross-document coreference for all of the within-document entity clusters. For this task, clusters are split up by entity type and then sorting and searching techniques are used to identify clusters that might require further collapsing. For example, clusters that include mentions with strings or substrings that match those in other clusters are reviewed. 4. Source Documents The source data contained in this release comprises all documents from which queries were drawn for 2012, 2014 and 2015. The source data was drawn from existing LDC holdings, with no additional validation. An overall scan of character content in the source collections indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors(some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); the web collection also has characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. All source documents originally released as XML have been converted to text files for this release. This change was made primarily because the documents were used as text files during data development but also because some fail XML parsing. All documents that have filenames beginning with "eng-NG" are Web Document data (WB) and some of these fail XML parsing (see below for details). All files that start with "bolt-" are Discussion Forum threads (DF) and have the structure described below. All other files are Newswire data (NW) and have the newswire markup pattern detailed below. Note as well that some source documents are duplicated across a few of the separated source_documents directories, indicating that some queries from different data sets originated from the same source documents. As it is acceptable for source to be reused for Entity Linking queries, this duplication is intentional and expected. The subsections below go into more detail regarding the markup and other properties of the three source data types: 4.1. Newswire Data Newswire data use the following markup framework: ... ...

...

...
where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "

...

" tags (depending on whether or not the "doc_type_label" is "story"). Some NW files contain a single double-escaped ampersand. All the newswire files, if converted back to XML files are parseable. 4.2 Multi-Post Discussion Forum Data Multi-Post Discussion Forum files (MPDFs) are derived from Discussion Forum threads. They consist of a continuous run of posts from a thread but they are only approximately 800 words in length (excluding metadata and text within elements). When taken from a short thread, a MPDF may comprise the entire thread. However, when taken from longer threads, a MPDF is a truncated version of its source, though it will always start with the preliminary post. 361 of the 40,186 MPDF files have a total of 974 various forms of double-escapes; like '& amp;#x202a;', 'Obama& amp;rsquo;s', etc., as well as things like 'http://some.url/query?a=525119& amp;f=19">', which isn't really a double-escape, but rather something else that resembles a double-escape. The MPDF files use the following markup framework, in which there may also be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "..." anchor tags): ... ... ... ... ... All the discussion forum files, if converted back to XML files are parseable. 4.3 Web Document Data "Web" files use the following markup framework: {doc_id_string} ... ... ... ... ... ... Other kinds of tags may be present ("", "", etc). 5. Using the Data 5.1 Text normalization and offset calculation Text normalization of queries consisting of a 1-for-1 substitution of newline (0x0A) and tab (0x09) characters with space (0x20) characters was performed on the document text input to the response field. The values of the 'beg=' and 'end=' XML attributes in the more recent queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial opening angle bracket of the element ( in DF sources), which is usually, but not always, the initial character (character 0) of the source. Note as well that character counting includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. Note that although strings included in the annotation files (queries and gold standard mentions) generally match source documents, a few characters are normalized in order to enhance readability: Conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed, and conversion of multiple spaces to a single space. 5.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 6. Acknowledgemnts This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dana Fore (LDC) Dave Graff (LDC) James Mayfield (JHU) Hoa Dang (NIST) Boyan Onyshkevych (DARPA) 7. References Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey, & Stephanie M. Strassel. 2017 Overview of Linguistic Resources for the TAC KBP 2017 Evaluations: Methodologies and Results TAC KBP 2017 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 13-14 Joe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song, Ann Bies, & Stephanie M. Strassel. 2016 Overview of Linguistic Resources for the TAC KBP 2016 Evaluations: Methodologies and Results TAC KBP 2016 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 14-15 Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann Bies, Stephanie M. Strassel. 2015 Overview of Linguistic Resources for the TAC KBP 2015 Evaluations: Methodologies and Results TAC KBP Workshop 2015: National Institute of Standards and Technology, Gaithersburg, MD, November 16-17 Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014 Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: Planning, Execution, and Results TAC KBP 2014 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 17-18 Joe Ellis, Jeremy Getman, Justin Mott, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2013 Linguistic Resources for 2013 Knowledge Base Population Evaluations TAC KBP 2013 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 18-19 Joe Ellis, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2012 Linguistic Resources for 2012 Knowledge Base Population Evaluations TAC KBP 2012 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 5-6 8. Copyright Information (c) 2018 Trustees of the University of Pennsylvania 9. Contact Information For further information about this data release, or the TAC KBP project, contact the following project staff at LDC: Joe Ellis, Project Manager Jeremy Getman, Lead Annotator Stephanie Strassel, PI ----------------------------------------------------------------------------- README created by Dana Fore on February 24, 2016 updated by Dana Fore on March 22, 2016 updated by Dana Fore on April 4, 2016 updated by Jeremy Getman on April 4, 2016 updated by Neil Kuster on September 22, 2016 updated by Joe Ellis on December 21, 2016 updated by Jeremy Getman on December 20, 2017 updated by Jeremy Getman on March 22, 2018 updated by Jeremy Getman on May 18, 2018 updated by Jeremy Getman on May 21, 2019