TAC KBP English Event Nugget Detection and Coreference
Comprehensive Training and Evaluation Data 2014-2015
Authors: Ann Bies, Joe Ellis, Jeremy Getman, Zhiyi Song,
Stephanie Strassel
1. Overview
This package contains training and evaluation data produced in support of
the TAC KBP English Event Nugget Detection and Coreference tasks in 2014
and 2015.
Text Analysis Conference (TAC) is a series of workshops organized by the
National Institute of Standards and Technology (NIST). TAC was developed to
encourage research in natural language processing (NLP) and related
applications by providing a large test collection, common evaluation
procedures, and a forum for researchers to share their results. Through its
various evaluations, the Knowledge Base Population (KBP) track of TAC
encourages the development of systems that can match entities mentioned in
natural texts with those appearing in a knowledge base and extract novel
information about entities from a document collection and add it to a new
or existing knowledge base.
The goal of the Event Nugget track (EN) is to evaluate system performance on
the detection and coreference of sets of attributes referencing events in
unstructured text. These Event Nuggets consist of a mention of the event from
the text and labels to indicate event type, subtype, and realis. More
information about the Event Nugget track and other TAC KBP evaluations can be
found on the NIST TAC website, http://www.nist.gov/tac/.
This package contains all evaluation and training data developed in
support of the TAC KBP Event Nugget evaluations during the first two
years in which the task was conducted, from 2014-2015. This includes
gold standard event nugget annotations in multiple formats, coreference
information for the nuggets, and tokenization of the source documents,
as well as the source documents themselves.
The data included in this package were originally released to TAC KBP as:
LDC2014E121: DEFT Event Nugget Evaluation Training Data
LDC2015E03: DEFT 2014 Event Nugget Evaluation Source Data
LDC2015E69: DEFT 2014 Event Nugget Evaluation Annotation Data
LDC2015E73: TAC KBP 2015 Event Nugget Training Data Annotation V2
LDC2015E94: TAC KBP 2015 Event Nugget and Event Coreference Linking
Evaluation Source Corpus
LDC2015R26: TAC KBP 2015 Event Nugget and Event Corefence Linking
LDC2016E36: TAC KBP English Event Nugget Detection and Coreference
Comprehensive Training and Evaluation Data 2014-2015
Summary of data included in this package:
+------+------------------+---------+
| Year | Source Documents | Nuggets |
+------+------------------+---------+
| 2014 | 351 | 10719 |
| 2015 | 360 | 12976 |
+------+------------------+---------+
2. Contents
./docs/README.txt
This file
./data/{2014,2015}/contents.txt
The data in this package are organized by the year of original release
in order to clarify dependencies, highlight occassional differences in
formats from one year to another, and to increase readability in
documentation. The contents.txt file within each year's root directory
provides a list of the contents for all subdirectories as well as
details about file formats and contents.
./docs/all_files.md5
Paths (relative to the root of the corpus) and md5 checksums for all files
in the package.
./docs/2014/TAC-KBP-Event-Nugget-Detection-Annotation-Guidelines-v1.7.pdf
The guidelines used by annotators in 2014 in developing the gold
standard Event Nugget data contained in this corpus.
./docs/2014/Event_Nugget_Detection_Evaluation-v8.1.pdf
Task Description for the 2014 Event Nugget evaluation track, written
by track coordinators.
./docs/2014/Event-Nugget-Detection-scoring-v17.pdf
Scoring Description for the 2014 Event Nugget evaluation track, written
by track coordinators.
./docs/2015/DEFT_RICH_ERE_Annotation_Guidelines_English_Events_V2.9.pdf
The guidelines used by annotators in 2015 in developing the gold
standard Event Nugget data contained in this corpus.
./docs/2015/Event_Mention_Detection_and_Coreference-2015-v1.1.pdf
Task Description for the 2015 Event Nugget evaluation track, written
by track coordinators.
./docs/2015/Event-Mention-Detection-scoring-v27.pdf
Scoring Description for the 2015 Event Nugget evaluation track, written
by track coordinators.
./dtd/tackbp_event_hoppers.1.0.dtd
DTD for event_hopper XML files.
./dtd/tackbp_event_nuggets.1.0.dtd
DTD for event_nugget XML files.
./tools/2014/*
This directory contains the scripts and tools prepared by CMU for the
2014 Event Nugget evaluation. The file README.md included here describes
the content in this directory and provides documentation for each
script/tool.
./tools/2015/*
This directory contains the scripts and tools prepared by CMU for the
2015 Event Nugget evaluation. The file README.md included here describes
the content in this directory and provides documentation for each
script/tool.
3. Event Nugget Annotation
The Event Nugget track (EN) began in 2014 as a pilot for DARPA's Deep
Exploration and Filtering of Text (DEFT). In its first year, EN coordinators
at CMU adapted the event annotation guidelines from LDC's Light Entities
Relations and Events (ERE), an annotation task in which annotators extract,
label, and coreference mentions of entities, relations, and events from
unstructured texts.
In 2014, the extracted reference to the event from the source text was
primarily a single word (a verb, noun, adjective, or adverb) but the
definition also allowed for continuous and discontinuous multi-word phrases.
Additionally, the 2014 event inventory included 33 types and coreference
was not performed on the events.
For the 2015 EN evaluations, event 'triggers' - the textual extent
indicating a reference to a valid event - was redefined as the smallest,
contiguous extent of text that most saliently expresses the occurrence of
an event. Additionally, annotators for the 2015 data were allowed to
'double tag' event triggers if they indicated more than one event, usually
an indication of inferred events.
Event coreference was also added to EN in 2015. Again taking from Rich ERE,
EN adopted the notion of 'event hoppers', a more inclusive, less strict
notion of event coreference as compared to previous approaches. Event
mentions are added to an event hopper when they "feel" coreferential to an
annotator, even if they do not meet a strict event identity requirement.
Event nuggets can be placed into the same event hoppers even if they differ
in temporal or trigger granularity, their arguments are non-coreferential or
conflicting, or even, with some exceptions, if they differ in their realis
attributes (an indication of whether an event has actually occurred or not).
Gold standard EN data are developed by first having two annotators perform
independent first passes for each document followed by an adjudication pass
conducted by a senior annotator to resolve disagreements. Following
adjudication of all documents, a corpus-wide quality control pass was also
performed. In 2015, Event Nugget and Coreference annotation were performed
as a single task. Annotators marked each event nugget and then immediately
decided whether it should be coreferenced with any existing event hoppers.
There are four primary checks performed during corpus wide quality control.
Primarily, annotators manually scan event triggers to review event type and
subtype values and then ensure that event mentions with a GENERIC realis
label are not in the same hopper as event mentions with OTHER or ACTUAL
realis labels. Additionally, all event hoppers are scanned to make sure that
event mentions in the same hoppers have the same type and subtype value (with
some exceptions; see guidelines for details) and to look for and correct if
necessary any other anomalies.
Following the annotation tasks, the resulting data were processed in a few
different methods using tool kits provided by evaluation coordinators, in
order to serve different evaluations within EN. Specifically, data produced
by the Event Nugget annotation task in 2015 were used for the Event Nugget
Detection evaluation, the Event Nugget Detection and Coreference evaluation,
and for the Event Nugget Coreference evaluation.
4. Source Documents
4.1 Newswire Data
Newswire data use the following markup framework:
...
...
" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML but are treated as plain text for annotation. 4.2 Multi-Post Discussion Forum Data Multi-Post Discussion Forum files (MPDFs) are derived from English Discussion Forum threads. They consist of a continuous run of posts from a thread but they are only approximately 800 words in length (excluding metadata and text withinelements). When taken from a short thread, a MPDF may comprise the entire thread. However, when taken from longer threads, a MPDF is a truncated version of its source, though it will always start with the preliminary post. The MPDF files use the following markup framework, in which there may also be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "..." anchor tags): All the MPDF files are parseable as XML, but treated as plain text in annotation. 5. Using the Data 5.1 Offset Calculation All annotation XML files (file names "*.event_nuggets.xml") represent stand-off annotation of source files (file names "*.txt") and use offsets to refer to the text extents. The event_mention XML elements all have attributes or contain sub-elements which use character offsets to identify text extents in the source. The offset gives the start character of the text extent; offset counting starts from the initial character, character 0, of the source document (.txt file) and includes newlines as well as all characters comprising XML-like tags in the source data. 5.2 Proper ingesting of XML Because each source text document is extracted verbatim from source XML files, certain characters in its content (ampersands, angle brackets, etc.) are escaped according to the XML specification. The offsets of text extents are based on treating this escaped text as-is (e.g. "&" in a cmp.txt file is counted as five characters). Whenever any such string of "raw" text is included in a .Event_Nugget.xml file (as the text extent to which an annotation is applied), a second level of escaping has been applied, so that XML parsing of the XML file will produce a string that exactly matches the source text. For example, a reference to the corporation "AT&T" will appear in TXT as "AT&T". Event Nugget annotation on this string would cite a length of 8 characters (not 4), and the string is stored in the XML file as "AT&T" - when the XML file is parsed as intended, this will return "AT&T" to match the TXT content. 6. Acknowledgements This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dave Graff (LDC) Xiaoyi Ma (LDC) Justin Mott (LDC) Tom Reise (LDC) Hoa Dang (NIST) Eduard Hovy (CMU) Teruko Mitamura (CMU) Boyan Onyshkevych (DARPA) 7. Copyright Information (c) 2020 Trustees of the University of Pennsylvania 8. References Joe Ellis, Jeremy Getman, Dana Fore, Neil Kuster, Zhiyi Song, Ann Bies, Stephanie Strassel. 2015. Overview of Linguistic Resources for the TAC KBP 2015 Evaluations: Methodologies and Results. https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp2015_overview.pdf TAC KBP 2015 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 16-17. 9. Contact Information Stephanie Strassel ... ... .........PI Jonathan Wright Technical oversight Zhiyi Song ERE annotation project manager Ann Bies ERE annotation coordinator Jeremy Getman TAC KBP lead annotator -------------------------------------------------------------------------- README created by Jeremy Getman on December 11, 2015 updated by Jeremy Getman on December 14, 2015 updated by Jeremy Getman on January 8, 2016 updated by Jeremy Getman on January 12, 2016 updated by Dana Fore on February 5, 2016 updated by Jeremy Getman on April 7, 2016 updated by Joe Ellis on June 15, 2016 updated by Joe Ellis on September 21, 2016 updated by Joe Ellis on November 28, 2016