DEFT Spanish Committed Belief Annotation Authors: Jennifer Tracey, Michael Arrigo, Stephanie Strassel 1.0 Introduction DARPA's Deep Exploration and Filtering of Text (DEFT) program aims to address remaining capability gaps in state-of-the-art natural language processing technologies related to inference, causal relationships and anomaly detection. In support of DEFT, LDC provided source data and core resources for system development, including annotation of "Committed Belief," which marks the level of commitment displayed by the author to the truth of the propositions expressed in the text. This package contains Committed Belief annotation for Spanish. This data, combined with the data released in the Spanish sample package (LDC2015E109), comprises the full set of planned basic Committed Belief annotation for Spanish. 2.0 Directory Structure README.txt - this file data/ source/ - directory containing source text files (*.cmp.txt) annotation/ - directory containing annotation files (*.cb.xml) docs/ DEFT_Committed_Belief_Guidelines_Spanish_v1.3.pdf - annotation guidelines files.tab - document providing mapping from annotation file names to full thread names dtd/ deft_committed_belief.1.0.0.dtd - Document Type Definition for annotation xml files deft_committed_belief.1.0.0.rng - Relax NG schema describing the file format 3.0 Data Profile and Format Genre Files Source_Doc_Tokens Annotations ----------------------------------------------------------------- discussion forum 87 67395 11073 Committed Belief annotation files have a cb.xml extension and are in XML format. For a full description of the structure, elements, and attributes of the Committed Belief annotation files, please see the DTD in the dtd directory of this release. 3.1 Offset Calculation All XML annotation files for Committed Belief represent standoff annotation of source .txt files and use offsets to refer to the text extents of the annotated proposition head words. The offset gives the start character and length of the text extent. Adding the length to the start character offset gives the string end character. Offset counting starts from the initial character of the given *.cmp.txt file, character 0, of the source document and includes newlines, which are always rendered as a single line-feed character (unix-style). 3.2 Source Text Data All source documents are Spanish Discussion Forum data. Due to the length of many discussion forum threads, annotation of entire threads for DEFT is impractical. Therefore, LDC has selected units we call Continuous Multi-Posts (CMPs), which consist of a continuous run of posts from a single thread. The length of a CMP is between 100-1000 words. In the case of a short thread, this may include the entire thread; in the case of longer threads, the CMP is a truncated version of the thread (and it is possible that there may be more than one CMP that comes from a single original thread). CMPs can be mapped back to the original full thread using the files.tab document in the docs directory of this package. Note that each CMP is an XML fragment. Because of the method used to extract the text from the original discussion forum thread data, each CMP file contains residual markup tags and/or character entity references, but is NOT a full XML document (it is not expected to pass XML validation), and so should be treated as raw text. 3.3 XML Annotation Data Character offsets and lengths for text extents are calculated based on treating the corresponding source data file as "raw" text, without escaping XML metacharacters. However, the XML format for annotations includes an "annotation_text" element, which provides the data content of the text extent associated with the each annotation, and here, XML escaping is applied to angle-brackets and ampersand characters as needed, so that when the annotation xml file is read using an XML parser, the original "annotation_text" content will be returned, matching what is found at the given offset in the source text file. For example, the (raw) source text for a given file might contain a string like "R&D" (because it is a fragment that was extracted as-is from a larger XML stream). If the raw string "R&D" were part of an annotated text extent, it would appear in an "" element of the corresponding annotation XML file as "R&D", but the "length" attribute would be 7 (corresponding to the original "R&D" string in the source text). Therefore, an XML parser is required to correctly ingest the "annotation_text" elements from the *.cb.xml files in order for offsets and lengths to be meaningful. As a general rule, ALWAYS use an XML parser when handling files with the ".xml" file extension, and NEVER use an XML parser on files with the ".txt" extension. 4.0 Annotation Procedure Committed belief annotation involves exhaustive annotation of each document to identify all annotatable propositions and record the speaker/writer's belief in the proposition. There are four categories of belief annotation: Committed Belief (CB) -- the speaker-writer believes the proposition with certainty Non-committed Belief (NCB) -- the speaker-writer believes the proposition to be possibly, but not necessarily, true Reported Belief (ROB) -- the speaker-writer reports the belief as belonging to someone else, without specifying their own belief or lack of belief in the proposition Not Applicable (NA) -- the speaker-writer expresses some cognitive attitude other than belief toward the proposition, such as desire, intention, or obligation. Only the head of the proposition is marked, rather than the full text extent of the proposition. The head is usually a verb, but may be some other lexical class, as in the case of copular clauses, where the head of the proposition is considered to be the head of the structure following the copula (such as the head noun of a noun phrase). Please see the annotation guidelines included in the docs directory of this release for additional details. 5.0 Acknowledgements This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors would like to acknowledge the contributions of Owen Rambow to this dataset. 6.0 Contact Information If you have any questions about the data in this release, please contact the following personnel at LDC. Jennifer Tracey -DEFT Anomaly project manager Michael Arrigo -DEFT Anomaly annotation coordinator Stephanie Strassel -DEFT project PI