DEFT Spanish Committed Belief Annotation
Authors: Jennifer Tracey, Michael Arrigo, Stephanie Strassel


1.0 Introduction

DARPA's Deep Exploration and Filtering of Text (DEFT) program aims to
address remaining capability gaps in state-of-the-art natural language
processing technologies related to inference, causal relationships and
anomaly detection.  In support of DEFT, LDC provided source data and
core resources for system development, including annotation of
"Committed Belief," which marks the level of commitment displayed by
the author to the truth of the propositions expressed in the text.

This package contains Committed Belief annotation for Spanish. This data,
combined with the data released in the Spanish sample package (LDC2015E109),
comprises the full set of planned basic Committed Belief annotation for Spanish.

2.0 Directory Structure

README.txt     - this file

data/
  source/      - directory containing source text files (*.cmp.txt)
  annotation/  - directory containing annotation files (*.cb.xml)

docs/
  DEFT_Committed_Belief_Guidelines_Spanish_v1.3.pdf  - annotation guidelines
  files.tab - document providing mapping from annotation file names
              to full thread names
  
dtd/
  deft_committed_belief.1.0.0.dtd - Document Type Definition for
                                    annotation xml files
  deft_committed_belief.1.0.0.rng - Relax NG schema describing the file format

3.0 Data Profile and Format

Genre                  	Files 	Source_Doc_Tokens	Annotations
-----------------------------------------------------------------
discussion forum 	87  	67395         		11073

Committed Belief annotation files have a cb.xml extension and are in
XML format. For a full description of the structure, elements, and
attributes of the Committed Belief annotation files, please see the
DTD in the dtd directory of this release.

3.1 Offset Calculation

All XML annotation files for Committed Belief represent standoff
annotation of source .txt files and use offsets to refer to the text
extents of the annotated proposition head words. The offset gives the
start character and length of the text extent. Adding the length to
the start character offset gives the string end character.  Offset
counting starts from the initial character of the given *.cmp.txt
file, character 0, of the source document and includes newlines, which
are always rendered as a single line-feed character (unix-style).

3.2 Source Text Data

All source documents are Spanish Discussion Forum data.  Due to the
length of many discussion forum threads, annotation of entire threads
for DEFT is impractical.  Therefore, LDC has selected units we call
Continuous Multi-Posts (CMPs), which consist of a continuous run of
posts from a single thread.  The length of a CMP is between 100-1000
words.  In the case of a short thread, this may include the entire
thread; in the case of longer threads, the CMP is a truncated version
of the thread (and it is possible that there may be more than one CMP
that comes from a single original thread).  CMPs can be mapped back to
the original full thread using the files.tab document in the docs
directory of this package.

Note that each CMP is an XML fragment.  Because of the method used to
extract the text from the original discussion forum thread data, each
CMP file contains residual markup tags and/or character entity
references, but is NOT a full XML document (it is not expected to pass
XML validation), and so should be treated as raw text.

3.3 XML Annotation Data

Character offsets and lengths for text extents are calculated based on
treating the corresponding source data file as "raw" text, without
escaping XML metacharacters.  However, the XML format for annotations
includes an "annotation_text" element, which provides the data content
of the text extent associated with the each annotation, and here, XML
escaping is applied to angle-brackets and ampersand characters as
needed, so that when the annotation xml file is read using an XML
parser, the original "annotation_text" content will be returned,
matching what is found at the given offset in the source text file.

For example, the (raw) source text for a given file might contain a
string like "R&amp;D" (because it is a fragment that was extracted
as-is from a larger XML stream).  If the raw string "R&amp;D" were
part of an annotated text extent, it would appear in an
"<annotation_text>" element of the corresponding annotation XML file
as "R&amp;amp;D", but the "length" attribute would be 7 (corresponding
to the original "R&amp;D" string in the source text). 

Therefore, an XML parser is required to correctly ingest the
"annotation_text" elements from the *.cb.xml files in order for
offsets and lengths to be meaningful.

As a general rule, ALWAYS use an XML parser when handling files with
the ".xml" file extension, and NEVER use an XML parser on files with
the ".txt" extension.


4.0 Annotation Procedure

Committed belief annotation involves exhaustive annotation of each
document to identify all annotatable propositions and record the
speaker/writer's belief in the proposition. There are four categories
of belief annotation:

Committed Belief (CB) -- the speaker-writer believes the proposition
with certainty

Non-committed Belief (NCB) -- the speaker-writer believes the
proposition to be possibly, but not necessarily, true

Reported Belief (ROB) -- the speaker-writer reports the belief as
belonging to someone else, without specifying their own belief or lack
of belief in the proposition

Not Applicable (NA) -- the speaker-writer expresses some cognitive
attitude other than belief toward the proposition, such as desire,
intention, or obligation.

Only the head of the proposition is marked, rather than the full text
extent of the proposition. The head is usually a verb, but may be some
other lexical class, as in the case of copular clauses, where the head
of the proposition is considered to be the head of the structure
following the copula (such as the head noun of a noun phrase).

Please see the annotation guidelines included in the docs directory of
this release for additional details.


5.0 Acknowledgements

This material is based on research sponsored by Air Force Research
Laboratory and Defense Advance Research Projects Agency under
agreement number FA8750-13-2-0045. The U.S. Government is authorized
to reproduce and distribute reprints for Governmental purposes
notwithstanding any copyright notation thereon. The views and
conclusions contained herein are those of the authors and should not
be interpreted as necessarily representing the official policies or
endorsements, either expressed or implied, of Air Force Research
Laboratory and Defense Advanced Research Projects Agency or the
U.S. Government.

The authors would like to acknowledge the contributions of Owen Rambow
to this dataset.

6.0 Contact Information

If you have any questions about the data in this release, please
contact the following personnel at LDC.

Jennifer Tracey <garjen@ldc.upenn.edu>
                                        -DEFT Anomaly project manager
Michael Arrigo <marrigo@ldc.upenn.edu>
                                        -DEFT Anomaly annotation coordinator
Stephanie Strassel <strassel@ldc.upenn.edu>
                                        -DEFT project PI