TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017

Authors: Jennifer Tracey, Michael Arrigo, Stephanie Strassel


1. Overview

This package contains training and evaluation data produced in support of
the TAC KBP Belief and Sentiment (BeSt) evaluation track in 2016 and 2017.

Text Analysis Conference (TAC) is a series of workshops organized by the
National Institute of Standards and Technology (NIST). TAC was developed to
encourage research in natural language processing (NLP) and related
applications by providing a large test collection, common evaluation
procedures, and a forum for researchers to share their results. Through its
various evaluations, the Knowledge Base Population (KBP) track of TAC
encourages the development of systems that can match entities mentioned in
natural texts with those appearing in a knowledge base and extract novel
information about entities from a document collection and add it to a new
or existing knowledge base.

The goal of the BeSt track is to provide information about beliefs and
sentiments held by entities toward other entities, as well as toward
events and relations. Given a document collection and a gold standard
or machines predicted set of labeled entities, relations, and events,
a BeSt system is required to automatically label belief and sentiment
about each possible target (entity, relation or event), as well as
identifying the entity that holds the belief or sentiment. More
information about the TAC KBP Belief and Sentiment track and other TAC
evaluations can be found on the NIST TAC website:
http://www.nist.gov/tac/.

Additional information about the BeSt evaluations and annotation can
be found in the following paper:

Jennifer Tracey, Owen Rambow, Michael Arrigo, Claire Cardie, Adam
Dalton, Hoa Trang Dang, Mona Diab, Bonnie Dorr, Louise Guthrie,
Magdalena Markowska, Smaranda Muresan, Vinodkumar Prabhakaran, Samira
Shaikh, Tomek Strzalkowski, Janyce Wiebe. (2022). BeSt: The Belief and
Sentiment Corpus. In Proceedings of the 13th Edition of the Language
Resources and Evaluation Conference, Marseille, June 20-25.
https://www.ldc.upenn.edu/sites/default/files/lrec2022-best-belief-and-sentiment.pdf

This package contains all of the source documents, gold standard entity,
relation, and event (ERE) annotation, and belief and sentiment
annotation  that was used in the 2016 and 2017 BeSt evaluations.

The data included in this package were originally released by LDC
to TAC KBP coordinators and performers under the following ecorpora
catalog IDs and titles:

LDC2016E27: DEFT English Belief and Sentiment Annotation
LDC2016E61: DEFT Chinese Belief and Sentiment Annotation
LDC2016E62: DEFT Spanish Belief and Sentiment Annotation
LDC2016E114: TAC KBP 2016 Belief and Sentiment Evaluation Gold Standard Annotation
LDC2017E80: TAC KBP 2017 Belief and Sentiment Evaluation Gold Standard Annotation

Summary of data included in this package:
+----------+------+---------------+------------------+
| Dataset  | Docs | Belief Labels | Sentiment Labels | 
+----------+------+---------------+------------------+
| training |  505 |         41513 |            80945 |
| 2016     |  494 |         46160 |            61863 |
| 2017     |  500 |         54412 |            65753 |
+----------+------+---------------+------------------+

Note that sentiment labels include the label "none" indicating no
sentiment toward the target entity, relation, or event.

2. Contents

./README.txt
 
  This file.

./data/{training,2016,2017} - files associated with the training data set, the 2016 evaluation, and the 2017 evelaution

Under each data set partition, files are arranged by language, with subdirectories for source data, ERE anntotation, and BeSt annotation

{cmn,eng,spa}/source 
{cmn,eng,spa}/ere 
{cmn,eng,spa}/annotation 

Note that in the training dataset, some long source documents were
split into multiple shorter sections for annotation. In such cases,
the source document appears in the source directory as a single file,
but the corresponding annotation appears as two separate files with
character offset ranges added to the source document filename. For
example, the source document 

SPA_DF_001258_20141021_F0000009Y.xml

has two corresponding annotation files
SPA_DF_001258_20141021_F0000009Y_0-5507.best.xml
SPA_DF_001258_20141021_F0000009Y_5509-6376.best.xml

where 
SPA_DF_001258_20141021_F0000009Y_0-5507.best.xml 
contains annotations on the portion of the source document from
character offset 0 to 5507, and
SPA_DF_001258_20141021_F0000009Y_5509-6376.best.xml 
contains annotations on the portion of the document from character
offset 5509 to 6376.

./docs/deft_anomaly_belief_sentiment_guidelines_v2.3.pdf

  The most up-to-date version of the BeSt annotation guidelines for annotating Belief and Sentiment

./docs/ere_guidelines/

  The ERE guidelines used to produce the gold standard entities,
  relations, and events that serve as targets of belief and sentiment
  annotation.

./dtd/belief_sentiment.2.1.0.dtd

  Document Type Definition for BeSt annotation xml files

./dtd/deft_rich_ere.1.2.dtd

  Document Type Definition for 2017 ERE annotation xml files

./dtd/deft_rich_ere.1.1.dtd

  Document Type Definition for 2016 ERE annotation xml files

./dtd/kbp_source_df.dtd

  Document Type Definition for discussion forum (DF) thread xml files

./dtd/kbp_source_newswire.dtd

  Document Type Definition for all newswire (NW) xml files

3.0 Annotation Task

Belief-Sentiment annotation has two components: belief and
sentiment. Belief annotation marks the belief-holder's commitment to a
belief in the occurrence of an event (event-target), the participation
of an entity in an annotated event (entity-target), and/or the
existence of a relation (relation-target). There are four categories
of belief annotation:

Committed Belief (CB) -- the holder believes the proposition
with certainty

Non-committed Belief (NCB) -- the holder believes the
proposition to be possibly, but not necessarily, true

Reported Belief (ROB) -- the holder reports the belief as
belonging to someone else, without specifying their own belief or lack
of belief in the proposition

Not Applicable (NA) -- the holder expresses some cognitive
attitude other than belief toward the proposition, such as desire,
intention, or obligation.

In addition to the target and belief-type, the holder of the belief is
explicitly indicated (and in the case of reported belief, a chain of
attribution is annotated), and the polarity of the belief is indicated
(positive polarity means belief, at the indicated level of commitment
that the event/relation/enitity-participation did occur, while
negative polarity means belief that it did not occur.

Sentiment is annotated with entities (independent of their role in an
event or relation), relations, and events as targets. Polarity
indicates positive or negative sentiment, and holder (including chain
of attribution where relevant) is indicated as in belief annotation.

The sarcasm attribute signals whether the polarity of the belief and
sentiment was tagged as the opposite of what a literal reading of the
text (without context) would suggest.

The targets and holders of belief and/or sentiment are entity,
relation, and event mentions annotated in DEFT Rich ERE. Beliefs and
sentiments toward other targets are not annotated.

Please see the annotation guidelines included in the docs directory of
this release for additional details.

4.0 Data Profile and Formats

Summary of data included in this package by language, dataset and
genre:

+----------+----------------+------+---------------+------------------+
| Language | Dataset/Genre  | Docs | Belief Labels | Sentiment Labels |
+----------+----------------+------+---------------+------------------+
| Chinese  | training/DF    |  200 |         13192 |            27982 |
| Chinese  | 2016/DF        |   82 |          4579 |            10650 |
| Chinese  | 2016/NW        |   79 |          7604 |             8330 |
| Chinese  | 2017/DF        |   84 |          7168 |            13494 |
| Chinese  | 2017/NW        |   83 |         11686 |            10267 |
| English  | training/DF    |  209 |         13900 |            32605 |
| English  | training/NW    |   37 |          5015 |             6059 |
| English  | 2016/DF        |   84 |          6286 |            11762 |
| English  | 2016/NW        |   81 |         15080 |            13717 | 
| English  | 2017/DF        |   84 |          7600 |            11402 |
| English  | 2017/NW        |   83 |         12430 |            10968 |
| Spanish  | training/DF    |   95 |          9406 |            14299 |
| Spanish  | 2016/DF        |   84 |          4778 |             9213 |
| Spanish  | 2016/NW        |   84 |          7833 |             8191 |
| Spanish  | 2017/DF        |   83 |          6549 |            10268 |
| Spanish  | 2017/NW        |   83 |          8979 |             9354 |
+----------+----------------+------+---------------+------------------+


4.1 Source Data Formats

Source documents are in several different formats. Newswire documents
are newswire XML.  Discussion Forum data may be either plain text or
XML.

Due to the length of many discussion forum threads, annotation of
entire threads for KBP was impractical.  Therefore, LDC selected units
we call Continuous Multi-Posts (CMPs), which consist of a continuous
run of posts from a single thread.  The length of a CMP is between
100-1000 words.  In the case of a short thread, this may include the
entire thread; in the case of longer threads, the CMP is a truncated
version of the thread (and it is possible that there may be more than
one CMP that comes from a single original thread).  Older CMPs are
named with a hexadecimal string.  These CMPs are present in the source
directories as cmp.txt files. Newer CMPs are named
<thread-id>_<beg>-<end>, where "beg" and "end" are offsets for the
beginning and end of the document, respectively.  For these documents,
the entire source thread is included as DF XML.

Note that each older-style CMP is an XML fragment.  Because of the method
used to extract the text from the original discussion forum thread data,
each CMP file contains residual markup tags and/or character entity
references, but is NOT a full XML document (it is not expected to pass
XML validation), and so should be treated as raw text.


4.1.1 Newswire XML

The following is a generalization of newswire markup framework:

  <DOC id="{doc_id_string}" type="{doc_type_label}">
  <HEADLINE>
  ...
  </HEADLINE>
  <DATELINE>
  ...
  </DATELINE>
  <AUTHOR> ..</AUTHOR>
  <TEXT>
  <P>
  ...
  </P>
  ...
  </TEXT>
  </DOC>

where the HEADLINE, DATELINE and AUTHOR tags are optional (not always
present), and the TEXT content may or may not include "<P> ... </P>"
tags (depending on whether or not the "doc_type_label" is "story").

All the newswire files are parseable as XML.

See relevant DTDs for exact details of newswire markup.

Text content within each markup are valid tagging regions for annotation.
Note that English and Spanish NW documents sometimes have Chinese authors
within the <AUTHOR> tags. These Chinese author names are tagged as PER
names.

4.1.2 Discussion Forum Data

The following is a generalization of DF thread markup framework, in
which there may also be arbitrarily deep nesting of quote elements,
and other elements may be present (e.g. "<a...>...</a>" anchor tags):

  <doc id="{doc_id_string}">
  <headline>
  ...
  </headline>
  <post ...>
  ...
  <quote ...>
  ...
  </quote>
  ...
  </post>
  ...
  </doc>

As noted above, some of the older DF source data is present as cmp.txt
files and should be treated as plain text. The DTD for DF XML applies
only to the DF source files that are present as XML files.

See relevant DTD for exact details of discussion forum markup.

Text contents within the <quote> elements are not valid tagging
regions for annotation.

4.2 Rich ERE and BeSt XML 

All ERE and BeSt XML files (file names "*.rich_ere.xml", "*.best.xml")
represent stand-off annotation of source files and use offsets to
refer to the text extents.

The offset gives the start character of the text extent; offset
counting starts from the initial character, character 0, of the source
document and includes newlines as well as all characters comprising
XML-like tags in the source data.

4.2 Proper ingesting of XML

Because each DF document is extracted verbatim from source XML files,
certain characters in its content (ampersands, angle brackets, etc.) are
escaped according to the XML specification.  The offsets of text extents
are based on treating this escaped text as-is (e.g. "&amp;" in a cmp.txt
file is counted as five characters).

Whenever any such string of "raw" text is included in a .rich_ere.xml
file (as the text extent to which an annotation is applied), a second
level of escaping has been applied, so that XML parsing of the ERE XML
file will produce a string that exactly matches the source text.

For example, a reference to the corporation "AT&T" will appear in CMP as
"AT&amp;T".  ERE annotation on this string would cite a length of 8
characters (not 4), and the string is stored in the ERE XML file as
"AT&amp;amp;T" - when the ERE XML file is parsed as intended, this will
return "AT&amp;T" to match the CMP TXT content.

6.0 Acknowledgements

This material is based on research sponsored by Air Force Research
Laboratory and Defense Advance Research Projects Agency under
agreement number FA8750-13-2-0045. The U.S. Government is authorized
to reproduce and distribute reprints for Governmental purposes
notwithstanding any copyright notation thereon. The views and
conclusions contained herein are those of the authors and should not
be interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of Air Force Research
Laboratory and Defense Advanced Research Projects Agency or the
U.S. Government.

LDC also wishes to acknolwedge the contributions of the following
individuals: Owen Rambow, Claire Cardie, Adam Dalton, Hoa Trang Dang,
Mona Diab, Bonnie Dorr, Louise Guthrie, Magdalena Markowska, Smaranda
Muresan, Vinodkumar Prabhakaran, Samira Shaikh, Tomek Strzalkowski.

7.0 Contacts

Stephanie Strassel <strassel@ldc.upenn.edu> - DEFT PI

8.0 Copyright

Portions © 2010 Agence France Presse, © 2013 New York Times, ©
2009-2010 The Associated Press, © 2013 Xinhua News Agency, © 2023
Trustees of the University of Pennsylvania