TAC KBP English Sentiment Slot Filling Comprehensive Training and Evaluation Data 2013-2014 Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel 1. Overview This package contains training and evaluation data produced in support of the TAC KBP Sentiment Slot Filling tracks in 2013 and 2014. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. Sentiment Slot Filling (SSF) is intended to supplement the data generated by the Entity Linking, Slot Filling, and Cold Start tracks with information about opinions held by KBP-valid entities (persons, organizations, and geo- political entities) toward other KBP-valid entities. As with the regular Slot Filling track (SF), SSF involves mining information about entities from text. However, SSF seeks to evaluate the quality of detectors for scoped and attributed, positive and negative sentiment. For more information about SSF, please refer to the Sentiment Slot Filling section of NIST's 2014 TAC KBP website (2014 was the last year in which a Sentiment Slot Filling evaluation was conducted, as of the time this package was created) at http://www.nist.gov/tac. This package contains all evaluation and training data developed in support of TAC KBP Sentiment Slot Filling in 2013 and 2014. This includes queries, "manual runs" (human-produced responses to the queries), and assessment results for both human- and system-produced responses to the queries (some of which were dually assessed). Note that the corresponding source document collections for this release are included in LDC2018T03 (TAC KBP Comprehensive English Source Corpora 2009-2014) and the corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - can be obtained via LDC2014T16 (TAC KBP Reference Knowledge Base). The data included in this package were originally released by LDC to TAC KBP coordinators and participants under the following ecorpora catalog IDs and titles: LDC2013E78: TAC 2013 KBP English Sentiment Slot Filling Training Queries and Annotations V1.1 LDC2013E89: TAC 2013 KBP English Sentiment Slot Filling Evaluation Queries and Annotations V1.1 LDC2013E100: TAC 2013 KBP English Sentiment Slot Filling Evaluation Assessment Results V1.1 LDC2014E72: TAC 2014 KBP English Sentiment Slot Filling Evaluation Queries and Annotations V1.1 LDC2014E85: TAC 2014 KBP English Sentiment Slot Filling Evaluation Assessment Results LDC2015E47: TAC KBP English Sentiment Slot Filling - Comprehensive Training and Evaluation Data 2013-2014 Summary of data included in this package (for more details see /data/{2013,2014}/contents.txt): Queries: +------+------------+-----+-----+-----+-------+ | year | set | PER | ORG | GPE | total | +------+------------+-----+-----+-----+-------+ | 2013 | evaluation | 54 | 53 | 53 | 160 | | 2013 | training | 55 | 54 | 54 | 163 | | 2014 | evaluation | 134 | 133 | 133 | 400 | +------+------------+-----+-----+-----+-------+ Manual Responses: +------+------------+------------------+ | year | set | manual responses | +------+------------+------------------+ | 2013 | evaluation | 977 | | 2013 | training | 986 | | 2014 | evaluation | 594 | +------+------------+------------------+ Assessment Data: +------+------------+-------------+-------------+ | year | set | total | total dual | | | | assessments | assessments | +------+------------+-------------+-------------+ | 2013 | evaluation | 5160 | 1145 | | 2014 | evaluation | 6383 | 0 | +------+------------+-------------+-------------+ 2. Contents ./docs/README.txt This file. ./data/{2013,2014}/contents.txt The data in this package are organized by the year of original release in order to clarify dependencies, highlight occassional differences in formats from one year to another, and to increase readability in documentation. The contents.txt file within each year's root directory provides a list of the contents for all subdirectories as well as details about file formats and contents. ./docs/all_files.md5 Paths (relative to the root of the corpus) and md5 checksums for all files included in the package. ./docs/guidelines/2013/TAC_KBP_2013_Assessment_Guidelines_V1.4.pdf ./docs/guidelines/2013/TAC_KBP_2013_Sentiment_Slot_Filling_Guidelines_V1.0.pdf The guidelines used by annotators in developing the 2013 Sentiment Slot Filling queries, gold standard data and assessments contained in this corpus. ./docs/guidelines/2014/TAC_KBP_2014_Assessment_Guidelines_V1.0.pdf ./docs/guidelines/2014/TAC_KBP_2014_Sentiment_SF_Query_Development_Guidelines_V1.0.pdf ./docs/guidelines/2014/TAC_KBP_2014_Sentiment_Slot_Filling_Guidelines_V1.2.pdf The guidelines used by annotators in developing the 2014 Sentiment Slot Filling queries, gold standard data and assessments contained in this corpus. ./docs/task_descriptions/KBP2013_SentimentSlotFillingTaskDescription_1.1.pdf Task Description for the 2013 Sentiment Slot Filling evaluation track, written by track coordinators. ./docs/task_descriptions/KBP2014_SentimentTaskDescription_v1.1.pdf Task Description for the 2014 Sentiment Slot Filling evaluation track, written by track coordinators. ./dtd/ssf_queries_2013.dtd The DTD for: tac_kbp_2013_english_sentiment_slot_filling_evaluation_queries.xml tac_kbp_2013_english_sentiment_slot_filling_training_queries.xml ./dtd/ssf_queries_2014.dtd The DTD for: tac_kbp_2014_english_sentiment_slot_filling_evaluation_queries.xml ./tools/* All items in the tools directory were provided to LDC by evaluation track coordinators and are included here as-is, with no additional modifications or testing. ./tools/check_kbp_sentiment-slot-filling.pl Validator for 2013 sentiment slot filling submission files ./tools/check_kbp2014_sentiment-slot-filling.pl Validator for 2014 sentiment slot filling submission files ./tools/2013_SFScore.java Scorer for 2013 sentiment slot filling submission files ./tools/SFScore.java Scorer for 2014 sentiment slot filling submission files ./tools/KBP2013_Sentiment_SF_slot-list Necessary input for the 2013 scorer (see scorer for more details) ./tools/KBP2014_Sentiment_SF_slot-list Necessary input for the 2014 scorer (see scorer for more details) 3. Annotation tasks The tasks conducted by LDC annotators in support of the Sentiment Slot Filling (SSF) track included query development, manual run development, and assessment of system- and human-produced responses to queries. 3.1 Query Development SSF queries include a query entity and a sentiment slot that indicates both query polarity (whether the sentiment is positive or negative) and directionality (whether the query entity is the holder or the target of the sentiment). Entity mentions, which are the basis of all SSF queries, are partly selected based on their level of non-confusability. A candidate query entity mention is considered non-confusable if it is "canonical", meaning that it is not an alias, and includes more than just a first or last name. Entity mentions are also rejected if they include objectionable content. Productivity is also heavily weighted in the selection of queries; candidates must usually contain at least two responses (a.k.a "slot fillers") in the source corpus. However, SSF query developers could also focus on selecting queries capable of generating edge-case or interesting fillers, and often both. For example, consider the two following sentences: "I think Michael Vick should have been executed for that", said Carlson. Carlson said he hated Michael Vick. Correctly extracting a response from the first of the two statements is more challenging due to the inference needed to derive the sentiment from the stated desired action. Following initial query development passes, a quality control pass was conducted to flag any fillers that did not have adequate justification in the source document, or that might be at variance with the guidelines in any way. These flagged fillers were then adjudicated by senior annotators who updated, removed, or replaced them as appropriate. 3.2 Manual Run Development In developing the "manual runs", or human-produced sets of responses for SSF queries, annotators have up to two hours per query to search the corpus and locate all valid fillers. In 2013, responses could be drawn from any document in the KB source corpus. However, in 2014, answers were only drawn from the same source documents as the queries. Justification - the minimum extents of provenance supporting the validity of a slot filler - is an adjunct annotation employed to pinpoint the sources of assertions and, thereby, reduce the effort required for assessment. Valid justification strings should clearly identify all three elements of a relation (i.e. the subject entity, the predicate slot, and the object filler), and the relation between them, with minimal extraneous text. In 2013, justification allowed for up to two discontiguous strings, each selected from separate documents. In 2014, justification was again altered to allow for up to four justification strings. This facilitated a greater potential for inferred relations that would be difficult to justify with only one or two text extents. Following the initial round of annotation for manual runs, a quality control pass was conducted to flag any fillers that did not have adequate justification in the source document, or that might be at variance with the guidelines in any way. These flagged fillers were then adjudicated by senior annotators who updated or removed them as appropriate. 3.3 Assessment In assessment, annotators judge and coreference slot filler responses returned for the query set from both the human manual run and from systems. Fillers are marked correct if they are found to be both compatible with the slot descriptions and supported in the provided justification string(s) and/or surrounding content. Fillers are assessed as wrong if they do not meet both of the conditions for correctness. Additionally, fillers are assessed as inexact if overly insufficient or extraneous text was returned for an otherwise correct response. Justification is assessed as correct if it succinctly and completely supports the relation and wrong if it does not support the relation at all (or if the corresponding filler is marked wrong). Justification can also be assessed as inexact-short (if part but not all of the information necessary to support the relation is provided) and inexact-long (if it contains all information necessary to support the relation but also a great deal of extraneous text). Starting in 2014, responses with justification comprising more than 600 characters in total were automatically marked as ignored and not reviewed during assessment. In 2013, dual assessment was performed on the responses to 40 of the SSF evaluation queries. For each of the four Sentiment Slot Filling slots, 10 queries were randomly selected for dual assessment of responses. After first passes of assessment were completed, quality control was performed on the data by senior annotators. Performing quality control ensured that the extent of each annotated filler and justification were correct and that entities assessed as correct were coreferenced in the appropriate equivalence class. 4. Using the Data As mentioned in the introduction, note that the corresponding source document collections for this release are included in LDC2018T03 (TAC KBP Comprehensive English Source Corpora 2009-2014). Also, the corresponding Knowledge Base (KB) for much of the data - a 2008 snapshot of Wikipedia - can be obtained via LDC2014T16 (TAC KBP Reference Knowledge Base). 4.1 Offset calculation Text normalization of queries consisting of a 1-for-1 substitution of newline (0x0A) and tab (0x09) characters with space (0x20) characters was performed on the document text input to the response field. The values of the beg and end XML elements in the queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 4.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 5. Acknowledgments This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dana Fore (LDC) Dave Graff (LDC) Margaret Mitchell (Microsoft) Hoa Dang (NIST) Claire Cardie (Cornell) Boyan Onyshkevych (DARPA) 6. References Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014 Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: Planning, Execution, and Results https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-2014-overview.pdf TAC KBP 2014 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 17-18 Joe Ellis, Jeremy Getman, Justin Mott, Xuansong Li, Kira Griffitt, Stephanie M. Strassel, Jonathan Wright. 2013 Linguistic Resources for 2013 Knowledge Base Population Evaluations www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-workshop2013-linguistic-resources-kbp-eval.pdf TAC KBP 2013 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 18-19 7. Copyright Information (c) 2021 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, contact the following project staff at LDC: Jeremy Getman, Lead Annotator Stephanie Strassel, PI ----------------------------------------------------------------------------- README created by Dana Fore January 7, 2016 updated by Dana Fore January 29, 2016 updated by Dana Fore April 4, 2016 updated by Neil Kuster September 14, 2016 updated by Joe Ellis on October 5, 2016 updated by Jeremy Getman on October 5, 2016 updated by Joe Ellis on October 6, 2016 updated by Joe Ellis on February 17, 2017