TAC KBP English Event Nugget Detection and Coreference Comprehensive Training and Evaluation Data 2014-2015 Authors: Ann Bies, Joe Ellis, Jeremy Getman, Zhiyi Song, Stephanie Strassel 1. Overview This package contains training and evaluation data produced in support of the TAC KBP English Event Nugget Detection and Coreference tasks in 2014 and 2015. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of the Event Nugget track (EN) is to evaluate system performance on the detection and coreference of sets of attributes referencing events in unstructured text. These Event Nuggets consist of a mention of the event from the text and labels to indicate event type, subtype, and realis. More information about the Event Nugget track and other TAC KBP evaluations can be found on the NIST TAC website, http://www.nist.gov/tac/. This package contains all evaluation and training data developed in support of the TAC KBP Event Nugget evaluations during the first two years in which the task was conducted, from 2014-2015. This includes gold standard event nugget annotations in multiple formats, coreference information for the nuggets, and tokenization of the source documents, as well as the source documents themselves. The data included in this package were originally released to TAC KBP as: LDC2014E121: DEFT Event Nugget Evaluation Training Data LDC2015E03: DEFT 2014 Event Nugget Evaluation Source Data LDC2015E69: DEFT 2014 Event Nugget Evaluation Annotation Data LDC2015E73: TAC KBP 2015 Event Nugget Training Data Annotation V2 LDC2015E94: TAC KBP 2015 Event Nugget and Event Coreference Linking Evaluation Source Corpus LDC2015R26: TAC KBP 2015 Event Nugget and Event Corefence Linking LDC2016E36: TAC KBP English Event Nugget Detection and Coreference Comprehensive Training and Evaluation Data 2014-2015 Summary of data included in this package: +------+------------------+---------+ | Year | Source Documents | Nuggets | +------+------------------+---------+ | 2014 | 351 | 10719 | | 2015 | 360 | 12976 | +------+------------------+---------+ 2. Contents ./docs/README.txt This file ./data/{2014,2015}/contents.txt The data in this package are organized by the year of original release in order to clarify dependencies, highlight occassional differences in formats from one year to another, and to increase readability in documentation. The contents.txt file within each year's root directory provides a list of the contents for all subdirectories as well as details about file formats and contents. ./docs/all_files.md5 Paths (relative to the root of the corpus) and md5 checksums for all files in the package. ./docs/2014/TAC-KBP-Event-Nugget-Detection-Annotation-Guidelines-v1.7.pdf The guidelines used by annotators in 2014 in developing the gold standard Event Nugget data contained in this corpus. ./docs/2014/Event_Nugget_Detection_Evaluation-v8.1.pdf Task Description for the 2014 Event Nugget evaluation track, written by track coordinators. ./docs/2014/Event-Nugget-Detection-scoring-v17.pdf Scoring Description for the 2014 Event Nugget evaluation track, written by track coordinators. ./docs/2015/DEFT_RICH_ERE_Annotation_Guidelines_English_Events_V2.9.pdf The guidelines used by annotators in 2015 in developing the gold standard Event Nugget data contained in this corpus. ./docs/2015/Event_Mention_Detection_and_Coreference-2015-v1.1.pdf Task Description for the 2015 Event Nugget evaluation track, written by track coordinators. ./docs/2015/Event-Mention-Detection-scoring-v27.pdf Scoring Description for the 2015 Event Nugget evaluation track, written by track coordinators. ./dtd/tackbp_event_hoppers.1.0.dtd DTD for event_hopper XML files. ./dtd/tackbp_event_nuggets.1.0.dtd DTD for event_nugget XML files. ./tools/2014/* This directory contains the scripts and tools prepared by CMU for the 2014 Event Nugget evaluation. The file README.md included here describes the content in this directory and provides documentation for each script/tool. ./tools/2015/* This directory contains the scripts and tools prepared by CMU for the 2015 Event Nugget evaluation. The file README.md included here describes the content in this directory and provides documentation for each script/tool. 3. Event Nugget Annotation The Event Nugget track (EN) began in 2014 as a pilot for DARPA's Deep Exploration and Filtering of Text (DEFT). In its first year, EN coordinators at CMU adapted the event annotation guidelines from LDC's Light Entities Relations and Events (ERE), an annotation task in which annotators extract, label, and coreference mentions of entities, relations, and events from unstructured texts. In 2014, the extracted reference to the event from the source text was primarily a single word (a verb, noun, adjective, or adverb) but the definition also allowed for continuous and discontinuous multi-word phrases. Additionally, the 2014 event inventory included 33 types and coreference was not performed on the events. For the 2015 EN evaluations, event 'triggers' - the textual extent indicating a reference to a valid event - was redefined as the smallest, contiguous extent of text that most saliently expresses the occurrence of an event. Additionally, annotators for the 2015 data were allowed to 'double tag' event triggers if they indicated more than one event, usually an indication of inferred events. Event coreference was also added to EN in 2015. Again taking from Rich ERE, EN adopted the notion of 'event hoppers', a more inclusive, less strict notion of event coreference as compared to previous approaches. Event mentions are added to an event hopper when they "feel" coreferential to an annotator, even if they do not meet a strict event identity requirement. Event nuggets can be placed into the same event hoppers even if they differ in temporal or trigger granularity, their arguments are non-coreferential or conflicting, or even, with some exceptions, if they differ in their realis attributes (an indication of whether an event has actually occurred or not). Gold standard EN data are developed by first having two annotators perform independent first passes for each document followed by an adjudication pass conducted by a senior annotator to resolve disagreements. Following adjudication of all documents, a corpus-wide quality control pass was also performed. In 2015, Event Nugget and Coreference annotation were performed as a single task. Annotators marked each event nugget and then immediately decided whether it should be coreferenced with any existing event hoppers. There are four primary checks performed during corpus wide quality control. Primarily, annotators manually scan event triggers to review event type and subtype values and then ensure that event mentions with a GENERIC realis label are not in the same hopper as event mentions with OTHER or ACTUAL realis labels. Additionally, all event hoppers are scanned to make sure that event mentions in the same hoppers have the same type and subtype value (with some exceptions; see guidelines for details) and to look for and correct if necessary any other anomalies. Following the annotation tasks, the resulting data were processed in a few different methods using tool kits provided by evaluation coordinators, in order to serve different evaluations within EN. Specifically, data produced by the Event Nugget annotation task in 2015 were used for the Event Nugget Detection evaluation, the Event Nugget Detection and Coreference evaluation, and for the Event Nugget Coreference evaluation. 4. Source Documents 4.1 Newswire Data Newswire data use the following markup framework: ... ...

...

...
where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "

...

" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML but are treated as plain text for annotation. 4.2 Multi-Post Discussion Forum Data Multi-Post Discussion Forum files (MPDFs) are derived from English Discussion Forum threads. They consist of a continuous run of posts from a thread but they are only approximately 800 words in length (excluding metadata and text within elements). When taken from a short thread, a MPDF may comprise the entire thread. However, when taken from longer threads, a MPDF is a truncated version of its source, though it will always start with the preliminary post. The MPDF files use the following markup framework, in which there may also be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "..." anchor tags): ... ... ... ... ... All the MPDF files are parseable as XML, but treated as plain text in annotation. 5. Using the Data 5.1 Offset Calculation All annotation XML files (file names "*.event_nuggets.xml") represent stand-off annotation of source files (file names "*.txt") and use offsets to refer to the text extents. The event_mention XML elements all have attributes or contain sub-elements which use character offsets to identify text extents in the source. The offset gives the start character of the text extent; offset counting starts from the initial character, character 0, of the source document (.txt file) and includes newlines as well as all characters comprising XML-like tags in the source data. 5.2 Proper ingesting of XML Because each source text document is extracted verbatim from source XML files, certain characters in its content (ampersands, angle brackets, etc.) are escaped according to the XML specification. The offsets of text extents are based on treating this escaped text as-is (e.g. "&" in a cmp.txt file is counted as five characters). Whenever any such string of "raw" text is included in a .Event_Nugget.xml file (as the text extent to which an annotation is applied), a second level of escaping has been applied, so that XML parsing of the XML file will produce a string that exactly matches the source text. For example, a reference to the corporation "AT&T" will appear in TXT as "AT&T". Event Nugget annotation on this string would cite a length of 8 characters (not 4), and the string is stored in the XML file as "AT&T" - when the XML file is parsed as intended, this will return "AT&T" to match the TXT content. 6. Acknowledgements This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dave Graff (LDC) Xiaoyi Ma (LDC) Justin Mott (LDC) Tom Reise (LDC) Hoa Dang (NIST) Eduard Hovy (CMU) Teruko Mitamura (CMU) Boyan Onyshkevych (DARPA) 7. Copyright Information (c) 2020 Trustees of the University of Pennsylvania 8. References Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann Bies, Stephanie Strassel. 2015. Overview of Linguistic Resources for the TAC KBP 2015 Evaluations: Methodologies and Results. https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp2015_overview.pdf TAC KBP 2015 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 16-17. 9. Contact Information Stephanie Strassel PI Jonathan Wright Technical oversight Zhiyi Song ERE annotation project manager Ann Bies ERE annotation coordinator Jeremy Getman TAC KBP lead annotator -------------------------------------------------------------------------- README created by Jeremy Getman on December 11, 2015 updated by Jeremy Getman on December 14, 2015 updated by Jeremy Getman on January 8, 2016 updated by Jeremy Getman on January 12, 2016 updated by Dana Fore on February 5, 2016 updated by Jeremy Getman on April 7, 2016 updated by Joe Ellis on June 15, 2016 updated by Joe Ellis on September 21, 2016 updated by Joe Ellis on November 28, 2016