TAC KBP English Temporal Slot Filling Comprehensive Training and Evaluation Data Sets 2011 and 2013 Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel 1. Overview This package contains training and evaluation data produced in support of the TAC KBP English Temporal Slot Filling tasks in 2011 and 2013. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The Temporal Slot Filling task (TSF) seeks to build upon the technology developed for regular Slot Filling (SF). The regular Slot Filling task involves mining information about entities from text. SF can be viewed as more traditional Information Extraction, or alternatively, as a Question Answering (QA) task, in which the questions are static but the targets change. In completing the SF task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person (PER) and organization (ORG) entities and returned any valid responses (slot fillers) that were not redundant with those in an existing knowledge base (KB). The purpose of the TSF task is to identify and capture temporal information in text that indicates when a given relation between an SF query entity and filler held true. For more information about Temporal Slot Filling, please refer to the 2013 track home page (2013 was the last year in which the Temporal Slot Filling evaluation was conducted as of the time this package was created) at http://www.nist.gov/tac. This package contains all evaluation and training data developed in support of TAC KBP Temporal Slot Filling during 2011 and 2013, the two years a TSF eval was run. This includes queries, the manual runs produced by LDC annotators, and the final rounds of assessment results for the Temporal Slot Filling evaluations held in 2011 and 2013. The corresponding source document collections for this release are included in LDC2018T03: TAC KBP Comprehensive English Source Corpora 2009-2014. The corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - can be obtained via LDC2014T16: TAC KBP Reference Knowledge Base. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2011E49: TAC 2011 KBP English Training Temporal Slot Filling Annotation V1.1 LDC2011E85: TAC 2011 KBP English Evaluation Diagnostic Temporal Slot Filling Queries V1.1 LDC2012E38: TAC 2011 KBP English Evaluation Temporal Slot Filling Annotation LDC2013E82: TAC 2013 KBP English Temporal Slot Filling Training Queries and Annotations LDC2013E86: TAC 2013 KBP English Temporal Slot Filling Evaluation Queries and Annotations V1.1 LDC2013E99: TAC 2013 KBP English Temporal Slot Filling Evaluation Assessment Results V1.1 LDC2015E50: TAC KBP English Temporal Slot Filling – Collected Training and Evaluation Data Sets 2011 and 2013 Summaries of data included in this package (for more details see ./data/{2011,2013}/contents.txt): Query Data: +------+------------+-----+-----+-------+ | year | set | PER | ORG | total | +------+------------+-----+-----+-------+ | 2011 | training | 40 | 10 | 50 | | 2011 | evaluation | 80 | 20 | 100 | | 2013 | training | 6 | 1 | 7 | | 2013 | evaluation | 232 | 39 | 271 | +------+------------+-----+-----+-------+ Manual Response Data: +------+------------+-------------------+ | year | set | manual responses | +------+------------+-------------------+ | 2011 | training | 1,258 | | 2011 | evaluation | 1,413 | | 2013 | training | 16 | | 2013 | evaluation | 1,519 | +------+------------+-------------------+ Assessment Data (2013): +----------+-----------+ | assessed | assessed | | files | responses | +----------+-----------+ | 273 | 2,035 | +----------+-----------+ 2. Contents ./docs/README.txt This file. ./data/{2011,2013}/contents.txt The data in this package are organized by the year of original release in order to clarify dependencies, highlight occassional differences in formats from one year to another, and to increase readability in documentation. The contents.txt file within each year's root directory provides a list of the contents for all subdirectories as well as specific details about file formats and contents. ./docs/all_files.md5 Paths (relative to the root of the corpus) and md5 checksums for all files included in the package. ./docs/guidelines/{2011,2013}/*.pdf The guidelines used by annotators in developing temporal slot filling queries, manual responses, and assessment data contained in this corpus. ./docs/task_descriptions/KBP2011_TaskDefinition.pdf Task Description for 2011 covering all of the TAC KBP tracks, written by evaluation track coordinators. Note that this document also describes tasks not relevant to this specific package. ./docs/task_descriptions/KBP2013_TaskDefinition_EnglishSlotFilling_1.1.pdf Task description for both the 2013 English Regular and Temporal Slot Filling evaluation tracks, written by track coordinators. ./dtd/kbpslotfill_temp2011.dtd The dtd against which to validate these files: ./data/2011/training/queries.xml ./data/2011/eval/queries.xml ./dtd/kbpslotfill_tempnew2011.dtd The dtd against which to validate this file: ./data/2011/training/new_queries.xml ./tools/scorers/KBP2013_English_TSF_slot-list.txt Temporal SF slot list file to be used with the 2013 scorer. ./tools/scorers/SFScore2013.java Scorers for temporal slot filling files for 2013, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/scorers/TSFScore2.java Scorers for temporal slot filling files for 2011, as provided to LDC by evaluation track coordinators, with no further testing. ./tools/validators/check_kbp_{2011,2013}_slot-filling}.pl Validators for temporal slot filling files for respective years, as provided to LDC by evaluation track coordinators, with no further testing. 3. Annotation tasks - Query Development, Manual Run Development, Slot Mapping, and Assessment Temporal Slot Filling (TSF) builds upon annotations typically developed for regular Slot Filling (SF) by adding temporal data. In SF, the values of specified attributes (or slots) are extracted for a given entity from large collections of natural language texts. Examples of slots include age, birthplace, and spouse for a person, or founder, top members, and website for organizations. The TSF task grounded a subset of these extracted values temporally by finding dates when these slot fillers were valid. The tasks conducted by LDC annotators in support of the TSF track consisted of a combined query development-manual run development subtask as well as assessment. Each of these subtasks are explained here. 3.1 Query and Manual Run Development TSF query development-manual run development consisted of identifying and capturing temporal information that indicated the period of time when a given relation between an SF query entity and slot filler held true. TSF query entities were extracted and manual run annotations made on temporalized slot fillers (relations that have some temporal aspect to them) for 13 KBP slots across sets of KBP source documents selected for richness of temporalized slot fillers. In 2011, the entity selection process for TSF utilized a different process than that used for regular Slot Filling, due to the sharing of temporally- labeled data between TAC KBP and the Machine Reading program. Rather than first selecting identifiable entities and then annotating slot fillers and temporal information for those fillers, a reverse selection process was used in which annotation preceded entity selection. The query development-manual run development process began by performing keyword searches on the source data to identify documents containing KBP (entity-slot) relations. The document set which resulted from this keyword search was then subsetted with a high keyword frequency threshold. Next, this document subset was screened for the presence of temporalized KBP relations. The resulting set of documents were then exhaustively annotated for KBP relations and their associated temporal information. In a post-annotation screening process, annotators selected identifiable entities in the annotation pool that were part of at least one temporalized KBP relation; these entities then served as training queries for the TSF manual run. In selecting the 2011 TSF evaluation queries, an additional screening process was then applied to ensure productivity of fillers and temporal information. Evaluation query selection relied on the same post-annotation screening process to select identifiable candidate entities annotated in at least one temporalized KBP relation. Annotators then performed a time- limited search in the KBP source data for these candidate entities, to determine how frequently they occurred in temporalized KBP relations. Entities were then selected from the candidate evaluation query entity set, with preference given to entities that occurred more frequently in temporalized KBP relations and that occurred in a greater variety of temporalized KBP relations. This entity selection process produced a set of TSF evaluation queries on which time-limited search and cross- document annotation of temporalized slot fillers could be carried out during the evaluation annotation task. The 2011 evaluation manual run was developed by combining two sets of annotations. The first set of annotations was a set of within-document annotations that were created in the initial temporal slot filling evaluation query selection and annotation process. The second set of annotations was a set of time-limited, cross-document annotations. This second set was created by searching for temporal slot filling information for the evaluation entities within a 2 hour time limit, across the entire KBP corpus. This process produced a larger set of temporal slot filling evaluation annotation data than the intra-document annotation process that had been used to produce the temporal slot filling training annotation data. In contrast to the 2011 task, in which queries consisted of entities alone, each 2013 input query was a binary relation between an entity and one slot filler. This allowed systems to focus on the temporal aspect of the task and ignore the slot filling extraction component. Also, for 2013 queries, annotators were able to select for more interesting temporal information, such as indicators of beginnings and endings. For each 2013 TSF query, annotators were given up to two hours to search the corpus and locate all valid fillers. Following the initial round of query development and manual run annotation, a quality control pass was conducted to flag any fillers that did not have adequate justification in the source document, or that might be at variance with the current guidelines. These flagged fillers were then adjudicated by senior annotators. 3.2 Assessment Assessment of TSF responses was divided into two tasks: assessment of slot fillers and assessment of temporal information connected to those fillers. The procedure used for assessing temporal slot fillers mirrored the process used for regular Slot Filling assessment. After slot fillers were returned for the query set from both the human manual run and from systems, annotators assessed and coreferenced the responses. Fillers were marked as correct if they were found to be both compatible with the slot descriptions and supported in the provided justification string(s) and/or its surrounding content. Fillers were assessed either as wrong if they did not meet both of the conditions for correctness, or inexact if insufficient or if extraneous text had been selected for an otherwise correct response. Justification was assessed as correct if it succinctly and completely supported the relation, wrong if it did not support the relation at all (or if the corresponding filler was marked wrong), inexact-short if part but not all of the information necessary to support the relation was provided, or inexact-long if it contained all information necessary to support the relation but also a great deal of extraneous text. After first passes of assessment were completed, quality control was performed on the data by senior annotators. Performing quality control ensured that the extent of each annotated filler and justification were correct, checked that entities assessed as correct were coreferenced in the appropriate equivalence class, and flagged potentially problematic assessments for additional review. After filler assessment was complete for the temporal data set, LDC compared the resulting list of documents containing correct, system- generated slot fillers with those annotated by humans during TSF. The purpose of this comparison was to identify all documents marked only by systems as containing temporal information for a given entity/slot- filler combination. Once these documents were identified, they were reviewed and annotated whenever temporal information relating to the specific entity-filler combination was present. 4. Using the Data As mentioned in the intro, note that the corresponding source document collections for this release are included in LDC2018T03: TAC KBP Comprehensive English Source Corpora 2009-2014. Also, the corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - can be obtained via LDC2014T16: TAC KBP Reference Knowledge Base. 4.1 Text normalization and offset calculation Text normalization of queries consisting of a 1-for-1 substitution of newline (0x0A) and tab (0x09) characters with space (0x20) characters was performed on the document text input to the response field. The values of the beg and end XML elements in the later queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial opening angle bracket of the element ( in DF sources), which is usually the initial character (character 0) of the source. Note as well that character counting includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. Note that although strings included in the annotation files (queries and gold standard mentions) generally match source documents, a few characters are normalized in order to enhance readability: Conversion of newlines to spaces, except where preceding characters were hyphens ("-"), in which case newlines were removed, and conversion of multiple spaces to a single space. 4.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 5. Acknowledgments This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dave Graff (LDC) Robert Parker (LDC) Neil Kuster (LDC) Heng Ji (RPI) Ralph Grishman (NYU) Mihai Surdeanu (UA) Hoa Dang (NIST) Boyan Onyshkevych (DARPA) 6. References Joe Ellis, Jeremy Getman, Justin Mott, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2013 Linguistic Resources for 2013 Knowledge Base Population Evaluations https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-workshop2013-linguistic-resources-kbp-eval.pdf TAC KBP 2013 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 18-19 Xuansong Li, Joe Ellis, Kira Griffit, Stephanie Strassel, Robert Parker, Jonathan Wright. 2011 Linguistic Resources for 2011 Knowledge Base Population Evaluation https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tac2011-linguistic-resources-kbp.pdf TAC 2011: Proceedings of the Fourth Text Analysis Conference, Gaithersburg, Maryland, November 14-15 7. Copyright Information (c) 2016 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, or the TAC KBP project, contact the following project staff at LDC: Jeremy Getman, Lead Annotator Stephanie Strassel, PI ----------------------------------------------------------------------------- README created by Neil Kuster on March 31, 2016 updated by Neil Kuster on April 5, 2016 updated by Neil Kuster on September 19, 2016