TAC KBP Belief and Sentiment - Comprehensive Training and Evaluation Data 2016-2017 Authors: Jennifer Tracey, Michael Arrigo, Stephanie Strassel 1. Overview This package contains training and evaluation data produced in support of the TAC KBP Belief and Sentiment (BeSt) evaluation track in 2016 and 2017. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The goal of the BeSt track is to provide information about beliefs and sentiments held by entities toward other entities, as well as toward events and relations. Given a document collection and a gold standard or machines predicted set of labeled entities, relations, and events, a BeSt system is required to automatically label belief and sentiment about each possible target (entity, relation or event), as well as identifying the entity that holds the belief or sentiment. More information about the TAC KBP Belief and Sentiment track and other TAC evaluations can be found on the NIST TAC website: http://www.nist.gov/tac/. Additional information about the BeSt evaluations and annotation can be found in the following paper: Jennifer Tracey, Owen Rambow, Michael Arrigo, Claire Cardie, Adam Dalton, Hoa Trang Dang, Mona Diab, Bonnie Dorr, Louise Guthrie, Magdalena Markowska, Smaranda Muresan, Vinodkumar Prabhakaran, Samira Shaikh, Tomek Strzalkowski, Janyce Wiebe. (2022). BeSt: The Belief and Sentiment Corpus. In Proceedings of the 13th Edition of the Language Resources and Evaluation Conference, Marseille, June 20-25. https://www.ldc.upenn.edu/sites/default/files/lrec2022-best-belief-and-sentiment.pdf This package contains all of the source documents, gold standard entity, relation, and event (ERE) annotation, and belief and sentiment annotation that was used in the 2016 and 2017 BeSt evaluations. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2016E27: DEFT English Belief and Sentiment Annotation LDC2016E61: DEFT Chinese Belief and Sentiment Annotation LDC2016E62: DEFT Spanish Belief and Sentiment Annotation LDC2016E114: TAC KBP 2016 Belief and Sentiment Evaluation Gold Standard Annotation LDC2017E80: TAC KBP 2017 Belief and Sentiment Evaluation Gold Standard Annotation Summary of data included in this package: +----------+------+---------------+------------------+ | Dataset | Docs | Belief Labels | Sentiment Labels | +----------+------+---------------+------------------+ | training | 505 | 41513 | 80945 | | 2016 | 494 | 46160 | 61863 | | 2017 | 500 | 54412 | 65753 | +----------+------+---------------+------------------+ Note that sentiment labels include the label "none" indicating no sentiment toward the target entity, relation, or event. 2. Contents ./README.txt This file. ./data/{training,2016,2017} - files associated with the training data set, the 2016 evaluation, and the 2017 evelaution Under each data set partition, files are arranged by language, with subdirectories for source data, ERE anntotation, and BeSt annotation {cmn,eng,spa}/source {cmn,eng,spa}/ere {cmn,eng,spa}/annotation Note that in the training dataset, some long source documents were split into multiple shorter sections for annotation. In such cases, the source document appears in the source directory as a single file, but the corresponding annotation appears as two separate files with character offset ranges added to the source document filename. For example, the source document SPA_DF_001258_20141021_F0000009Y.xml has two corresponding annotation files SPA_DF_001258_20141021_F0000009Y_0-5507.best.xml SPA_DF_001258_20141021_F0000009Y_5509-6376.best.xml where SPA_DF_001258_20141021_F0000009Y_0-5507.best.xml contains annotations on the portion of the source document from character offset 0 to 5507, and SPA_DF_001258_20141021_F0000009Y_5509-6376.best.xml contains annotations on the portion of the document from character offset 5509 to 6376. ./docs/deft_anomaly_belief_sentiment_guidelines_v2.3.pdf The most up-to-date version of the BeSt annotation guidelines for annotating Belief and Sentiment ./docs/ere_guidelines/ The ERE guidelines used to produce the gold standard entities, relations, and events that serve as targets of belief and sentiment annotation. ./dtd/belief_sentiment.2.1.0.dtd Document Type Definition for BeSt annotation xml files ./dtd/deft_rich_ere.1.2.dtd Document Type Definition for 2017 ERE annotation xml files ./dtd/deft_rich_ere.1.1.dtd Document Type Definition for 2016 ERE annotation xml files ./dtd/kbp_source_df.dtd Document Type Definition for discussion forum (DF) thread xml files ./dtd/kbp_source_newswire.dtd Document Type Definition for all newswire (NW) xml files 3.0 Annotation Task Belief-Sentiment annotation has two components: belief and sentiment. Belief annotation marks the belief-holder's commitment to a belief in the occurrence of an event (event-target), the participation of an entity in an annotated event (entity-target), and/or the existence of a relation (relation-target). There are four categories of belief annotation: Committed Belief (CB) -- the holder believes the proposition with certainty Non-committed Belief (NCB) -- the holder believes the proposition to be possibly, but not necessarily, true Reported Belief (ROB) -- the holder reports the belief as belonging to someone else, without specifying their own belief or lack of belief in the proposition Not Applicable (NA) -- the holder expresses some cognitive attitude other than belief toward the proposition, such as desire, intention, or obligation. In addition to the target and belief-type, the holder of the belief is explicitly indicated (and in the case of reported belief, a chain of attribution is annotated), and the polarity of the belief is indicated (positive polarity means belief, at the indicated level of commitment that the event/relation/enitity-participation did occur, while negative polarity means belief that it did not occur. Sentiment is annotated with entities (independent of their role in an event or relation), relations, and events as targets. Polarity indicates positive or negative sentiment, and holder (including chain of attribution where relevant) is indicated as in belief annotation. The sarcasm attribute signals whether the polarity of the belief and sentiment was tagged as the opposite of what a literal reading of the text (without context) would suggest. The targets and holders of belief and/or sentiment are entity, relation, and event mentions annotated in DEFT Rich ERE. Beliefs and sentiments toward other targets are not annotated. Please see the annotation guidelines included in the docs directory of this release for additional details. 4.0 Data Profile and Formats Summary of data included in this package by language, dataset and genre: +----------+----------------+------+---------------+------------------+ | Language | Dataset/Genre | Docs | Belief Labels | Sentiment Labels | +----------+----------------+------+---------------+------------------+ | Chinese | training/DF | 200 | 13192 | 27982 | | Chinese | 2016/DF | 82 | 4579 | 10650 | | Chinese | 2016/NW | 79 | 7604 | 8330 | | Chinese | 2017/DF | 84 | 7168 | 13494 | | Chinese | 2017/NW | 83 | 11686 | 10267 | | English | training/DF | 209 | 13900 | 32605 | | English | training/NW | 37 | 5015 | 6059 | | English | 2016/DF | 84 | 6286 | 11762 | | English | 2016/NW | 81 | 15080 | 13717 | | English | 2017/DF | 84 | 7600 | 11402 | | English | 2017/NW | 83 | 12430 | 10968 | | Spanish | training/DF | 95 | 9406 | 14299 | | Spanish | 2016/DF | 84 | 4778 | 9213 | | Spanish | 2016/NW | 84 | 7833 | 8191 | | Spanish | 2017/DF | 83 | 6549 | 10268 | | Spanish | 2017/NW | 83 | 8979 | 9354 | +----------+----------------+------+---------------+------------------+ 4.1 Source Data Formats Source documents are in several different formats. Newswire documents are newswire XML. Discussion Forum data may be either plain text or XML. Due to the length of many discussion forum threads, annotation of entire threads for KBP was impractical. Therefore, LDC selected units we call Continuous Multi-Posts (CMPs), which consist of a continuous run of posts from a single thread. The length of a CMP is between 100-1000 words. In the case of a short thread, this may include the entire thread; in the case of longer threads, the CMP is a truncated version of the thread (and it is possible that there may be more than one CMP that comes from a single original thread). Older CMPs are named with a hexadecimal string. These CMPs are present in the source directories as cmp.txt files. Newer CMPs are named _-, where "beg" and "end" are offsets for the beginning and end of the document, respectively. For these documents, the entire source thread is included as DF XML. Note that each older-style CMP is an XML fragment. Because of the method used to extract the text from the original discussion forum thread data, each CMP file contains residual markup tags and/or character entity references, but is NOT a full XML document (it is not expected to pass XML validation), and so should be treated as raw text. 4.1.1 Newswire XML The following is a generalization of newswire markup framework: ... ... ..

...

...
where the HEADLINE, DATELINE and AUTHOR tags are optional (not always present), and the TEXT content may or may not include "

...

" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. See relevant DTDs for exact details of newswire markup. Text content within each markup are valid tagging regions for annotation. Note that English and Spanish NW documents sometimes have Chinese authors within the tags. These Chinese author names are tagged as PER names. 4.1.2 Discussion Forum Data The following is a generalization of DF thread markup framework, in which there may also be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "..." anchor tags): ... ... ... ... ... As noted above, some of the older DF source data is present as cmp.txt files and should be treated as plain text. The DTD for DF XML applies only to the DF source files that are present as XML files. See relevant DTD for exact details of discussion forum markup. Text contents within the elements are not valid tagging regions for annotation. 4.2 Rich ERE and BeSt XML All ERE and BeSt XML files (file names "*.rich_ere.xml", "*.best.xml") represent stand-off annotation of source files and use offsets to refer to the text extents. The offset gives the start character of the text extent; offset counting starts from the initial character, character 0, of the source document and includes newlines as well as all characters comprising XML-like tags in the source data. 4.2 Proper ingesting of XML Because each DF document is extracted verbatim from source XML files, certain characters in its content (ampersands, angle brackets, etc.) are escaped according to the XML specification. The offsets of text extents are based on treating this escaped text as-is (e.g. "&" in a cmp.txt file is counted as five characters). Whenever any such string of "raw" text is included in a .rich_ere.xml file (as the text extent to which an annotation is applied), a second level of escaping has been applied, so that XML parsing of the ERE XML file will produce a string that exactly matches the source text. For example, a reference to the corporation "AT&T" will appear in CMP as "AT&T". ERE annotation on this string would cite a length of 8 characters (not 4), and the string is stored in the ERE XML file as "AT&T" - when the ERE XML file is parsed as intended, this will return "AT&T" to match the CMP TXT content. 6.0 Acknowledgements This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. LDC also wishes to acknolwedge the contributions of the following individuals: Owen Rambow, Claire Cardie, Adam Dalton, Hoa Trang Dang, Mona Diab, Bonnie Dorr, Louise Guthrie, Magdalena Markowska, Smaranda Muresan, Vinodkumar Prabhakaran, Samira Shaikh, Tomek Strzalkowski. 7.0 Contacts Stephanie Strassel - DEFT PI 8.0 Copyright Portions © 2010 Agence France Presse, © 2013 New York Times, © 2009-2010 The Associated Press, © 2013 Xinhua News Agency, © 2023 Trustees of the University of Pennsylvania