DEFT English Committed Belief Annotation Authors: Jennifer Tracey, Michael Arrigo, Stephanie Strassel Linguistic Data Consortium 1. Introduction DARPA's Deep Exploration and Filtering of Text (DEFT) program aims to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. In support of DEFT, LDC provided source data and core resources for system development, including annotation of "Committed Belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text. This package contains the training data and evaluation set for the December 2014 pilot evaluation for Committed Belief that was carried out under the DEFT program. The evaluation is described in Prabhakaran et al. (2015), which can be accessed here: https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/sem2015-new-dataset-belief-factuality.pdf. 2.0 Directory Structure README.txt - this file data/ source/ test/ - directory containing source text files (*.cmp.txt) for the test data train/ - directory containing source text files (*.cmp.txt) for the training data annotation/ test/ - directory containing annotation files (*.cb.xml) for the test data train/ - directory containing annotation files (*.cb.xml) for the training data docs/ DEFT_Committed_Belief_Guidelines_V5.2.pdf - annotation guidelines test_files.tab - document providing mapping from annotation file names to full thread names for files in the test set train_files.tab - document providing mapping from annotation file names to full thread names for files in the training set dtd/ deft_committed_belief.1.0.0.dtd - Document Type Definition for annotation xml files deft_committed_belief.1.0.0.rng - Relax NG schema describing the file format 3.0 Data Profile and Format Partition Files Source_Doc_Words Annotations ----------------------------------------------------------------- test 126 100037 17679 train 1091 852686 144023 Committed Belief annotation files have a cb.xml extension and are in XML format. For a full description of the structure, elements, and attributes of the Committed Belief annotation files, please see the DTD in the docs directory of this release. 3.1 Offset Calculation All XML annotation files for Committed Belief represent standoff annotation of source .txt files and use offsets to refer to the text extents of the annotated proposition head words. The offset gives the start character and length of the text extent. Adding the length to the start character offset gives the string end character. Offset counting starts from the initial character of the given *.cmp.txt file, character 0, of the source document and includes newlines, which are always rendered as a single line-feed character (unix-style). 3.2 Source Text Data All source documents are English Discussion Forum data. Due to the length of many discussion forum threads, annotation of entire threads for DEFT is impractical. Therefore, LDC has selected units we call Continuous Multi-Posts (CMPs), which consist of a continuous run of posts from a single thread. The length of a CMP is between 100-1000 words. In the case of a short thread, this may include the entire thread; in the case of longer threads, the CMP is a truncated version of the thread (and it is possible that there may be more than one CMP that comes from a single original thread). CMPs can be mapped back to the original full thread using the files.tab documents in the docs directory of this package. Note that each CMP is an XML fragment. Because of the method used to extract the text from the original discussion forum thread data, each CMP file contains residual markup tags and/or character entity references, but is NOT a full XML document (it is not expected to pass XML validation), and so should be treated as raw text. 3.3 XML Annotation Data Character offsets and lengths for text extents are calculated based on treating the corresponding source data file as "raw" text, without escaping XML metacharacters. However, the XML format for annotations includes an "annotation_text" element, which provides the data content of the text extent associated with the each annotation, and here, XML escaping is applied to angle-brackets and ampersand characters as needed, so that when the annotation xml file is read using an XML parser, the original "annotation_text" content will be returned, matching what is found at the given offset in the source text file. For example, the (raw) source text for a given file might contain a string like "R&D" (because it is a fragment that was extracted as-is from a larger XML stream). If the raw string "R&D" were part of an annotated text extent, it would appear in an "" element of the corresponding annotation XML file as "R&D", but the "length" attribute would be 7 (corresponding to the original "R&D" string in the source text). Therefore, an XML parser is required to correctly ingest the "annotation_text" elements from the *.cb.xml files in order for offsets and lengths to be meaningful. As it turns out, there is only one case of XML "meta-characters" in the "annotated_text" elements in this release: the source file 0728e53a1d2e11134fa398ac3ce22220.cmp.txt contains a token with a double-escaped character entity reference ("strike""), and this token appears in the corresponding cb.xml file as "strike"" - when the cb.xml file is read as intended via an XML parser, the result matches the source text at the given character offsets for this annotation. As a general rule, ALWAYS use an XML parser when handling files with the ".xml" file extension, and NEVER use an XML parser on files with the ".txt" extension. 4.0 Annotation Procedure Committed belief annotation involves exhaustive annotation of each document to identify all annotatable propositions and record the speaker/writer's belief in the proposition. There are four categories of belief annotation: Committed Belief (CB) -- the speaker-writer believes the proposition with certainty Non-committed Belief (NCB) -- the speaker-writer believes the proposition to be possibly, but not necessarily, true Reported Belief (ROB) -- the speaker-writer reports the belief as belonging to someone else, without specifying their own belief or lack of belief in the proposition Not Applicable (NA) -- the speaker-writer expresses some cognitive attitude other than belief toward the proposition, such as desire, intention, or obligation. Only the head of the proposition is marked, rather than the full text extent of the proposition. The head is usually a verb, but may be some other lexical class, as in the case of copular clauses, where the head of the proposition is considered to be the head of the structure following the copula (such as the head noun of a noun phrase). Please see the annotation guidelines included in the docs directory of this release for additional details. 5.0 Known Issues The XML fragments selected for the *.cmp.txt files included a scattering of improperly coded characters. In particular, 12 of these files contain "valid" UTF-8 characters in the range U+0085 - U+0097, which are non-displayable "control" characters in Unicode. These had originally been CP1252 "smart punctuation" characters (ellipsis, quotes, bullet, dashes), but instead of being replaced by their respective ASCII or Unicode equivalents, their CP1252 code-points were simply converted directly to unicode code-points. Strings containing these characters have shown up in the annotation_text elements in 3 of the annotation XML files. We have left the characters as-is, because altering the text content would have required recomputing all the annotation character offsets. Below is a list of the "invisible" code points that appear in this release: U+0085 horizontal ellipsis (U+2026 or ... no single-byte ascii equivalent) U+0091 left single quote (U+2018 or ` backtick) U+0092 right single quote (U+2019 or ' apostrophe) -- found in annotations U+0093 left double quote (U+201c or " double quote) U+0094 right double quote (U+201d or " double quote) U+0095 bullet (U+2022 or . period) U+0096 en dash (U+2013 or - hyphen) U+0097 em dash (U+2014 or - hyphen) -- found in annotations 6.0 Acknowledgements This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors would like to acknowledge the contributions of Owen Rambow to this dataset. 7.0 Contact Information If you have any questions about the data in this release, please contact the following personnel at LDC. Jennifer Tracey -DEFT Anomaly project manager Michael Arrigo -DEFT Anomaly annotation coordinator Stephanie Strassel -DEFT project PI