Title: DEFT Chinese and English Light and Rich ERE Parallel Annotation
Authors: Song Chen, Justin Mott, Ann Bies, Stephanie Strassel
Catalog ID: LDC2026T04

1. Introduction

This package contains a set of Chinese source documents with Light and 
Rich ERE annotation, along with English translations also with Light and Rich
ERE annotation. This data was annotated for DARPA's Deep Exploration and
Filtering of Text (DEFT) program. Annotation on the Chinese source and
the English translation was performed independently. The data contained in
this package was previously distributed to the DEFT Program as LDC2014E114
and LDC2015E78.

The DEFT program aimed to address remaining capability gaps in state-of-the-
art natural language processing technologies related to inference, causal
relationships and anomaly detection (DARPA, 2012). ERE annotation is a core 
resource created by LDC under DEFT to provide training data for developing 
systems in detecting and coreferencing entities, relations and events. The task 
evolved over the course of the program, from a fairly lightweight treatment of 
entities, relations and events similar to ACE (LDC, 2006; Aguilar et al., 2014) 
to a richer representation of phenomena of interest to the program (Song et al., 
2015).

This release contains 179 Chinese documents and English translations annotated
following the Light ERE annotation guidelines, of which 171 document pairs 
were also annotated following the Rich ERE annotation guidelines. Additional 
annotation following Rich ERE guidelines was added to existing Light ERE 
annotation for these 171 document pairs. For further information about data and
annotation of this package, refer to Mott et al. (2016).

Source documents are in TXT format, and the annotation is in XML format.
Please refer to section 4 for details.


2. Contents

./dtd/
  deft_light_ere.2.0.0.dtd -- DTD for Light ERE XML annotation files
  deft_rich_ere.1.1.dtd --  DTD for Rich ERE XML annotation files

./docs/
  README.txt (this file)
  chinese_light_ere_stats.tab
  english_light_ere_stats.tab
                --Light ERE annotation statistics by document
  english_Rich_ere_stats.tab
  chinese_Rich_ere_stats.tab
                --Rich ERE annotation statistics by document
  parallel_mps.tab
                --Mapping between Chinese file ID and English file ID

./docs/guidelines
  DEFT_LIGHT_ERE_Chinese_Annotation_Guidelines_Entities_V1.1.pdf
  DEFT_LIGHT_ERE_Chinese_Annotation_Guidelines_Events_V1.0.pdf
  DEFT_LIGHT_ERE_Chinese_Annotation_Guidelines_Relations_V1.1.pdf
                --Chinese Light ERE annotation guidelines

  DEFT_LIGHT_ERE_English_Annotation_Guidelines_Entities_V1.8.pdf
  DEFT_LIGHT_ERE_English_Annotation_Guidelines_Events_V1.6.pdf
  DEFT_LIGHT_ERE_English_Annotation_Guidelines_Relations_V1.4.pdf
                --English Light ERE annotation guidelines

  DEFT_RICH_ERE_Chinese_Annotation_Guidelines_ArgumentFiller_V3.pdf
  DEFT_RICH_ERE_Chinese_Annotation_Guidelines_Entities_V3.pdf
  DEFT_RICH_ERE_Chinese_Annotation_Guidelines_Events_v3.pdf
  DEFT_RICH_ERE_Chinese_Annotation_Guidelines_Relations_V3.pdf
                --Chinese Rich ERE annotation guidelines

  DEFT_RICH_ERE_English_Annotation_Guidelines_ArgumentFiller_V2.3.pdf
  DEFT_RICH_ERE_English_Annotation_Guidelines_Entities_V2.4.pdf
  DEFT_RICH_ERE_English_Annotation_Guidelines_Events_V3.0.pdf
  DEFT_RICH_ERE_English_Annotation_Guidelines_Relations_V4.5.pdf
                --English Rich ERE annotation guidelines

./data/cmn/source/
  This directory contains all of the source documents in TXT format
  used for Chinese ERE annotation.

./data/eng/translation/
  This directory contains all of the English translations of Chinese
  source documents used for English annotation.

./data/cmn/light_ere
  This directory contains the Chinese Light ERE annotation files.

./data/cmn/rich_ere
  This directory contains the Chinese Rich ERE annotation files.

./data/eng/light_ere
  This directory contains the English Light ERE annotation files.

./data/eng/rich_ere
  This directory contains the English Rich ERE annotation files.

  Note: The IDs for each annotation (entity, entity mention, relation,
  filler, event hopper, event mention) are unique to each document, not to
  the entire corpus. Fillers are entity-like annotations that function as
  relation or event arguments.


3. Data Profile and Format

Entity / Relation / Event annotation volumes

Lang    ERE     Files  Characters Words Entities(mentions) Fillers Relations  Hoppers(mentions)
-----------------------------------------------------------------------------------------------
CMN     Light   179     135,075  90,050 5,338 (13,113)     N/A     1,707       415   (508)
ENG     Light   179     N/A     107,493 3,789 (12,090)     N/A     1,289       319   (382)
CMN     Rich    171     127,458  84,972 5,974 (14,102)     607     1,946     1,138 (1,491)
ENG     Rich    171     N/A     101,191 5,873 (16,055)     906     2,092     2,285 (2,933)
-----------------------------------------------------------------------------------------------

ERE annotation files have a .light_ere.xml or .rich_ere.xml extension, and
are in XML format. Word counts for Chinese are based on 1 word=1.5 Chinese 
characters.

For a full description of the elements, attributes, and structure of the ERE
annotation files, please see the DTD in the docs directory of this release.


4. Using the Data

All source documents are in the Discussion Forum (DF) genre. To allow for
parallel annotation, material previously translated as part of DARPA's
Broad Operational Language Translation (BOLT) program was used here.
Translated text in BOLT was in a multi-post (MP) format; within each
DF thread multiple posts were selected. These posts were not
necessarily contiguous.

The files ./docs/parallel_mps.tab list where the MPs were drawn from.
Note that an MP is an XML fragment rather than a full XML document; it
is intended to be used as raw text, and uses UNIX-style line
termination (line-feed only).

4.1 Offset Calculation

All ERE XML files (file names "*_ere.xml") represent stand-off annotation of
source files (file names "*.mp.txt") and use offsets to refer to the text
extents.

The entity_mention, relation_mention, and event_mention XML elements all
have attributes or contain sub-elements which use character offsets to
identify text extents in the source. The offset gives the start character
of the text extent; offset counting starts from the initial character,
character 0, of the source document (.mp.txt file) and includes newlines as
well as all characters comprising XML-like tags in the source data.

When the text extent being annotated contains any sort of whitespace,
including also tab, line feed and/or carriage return, the text presented in
the corresponding ERE XML annotation element has all strings of one or more
whitespace characters normalized to a single ASCII space (0x20).

4.2 Proper Ingesting of XML

Character offsets and lengths for text extents in ERE XML are calculated
based on "raw" multi-post data, where original (XML-fragment) meta-
characters are escaped. For example, a reference to the corporation "AT&T"
will appear in MP as "AT&amp;T". ERE annotation on this string will cite a
length of 8 characters (not 4). This string is stored in the ERE XML file
as "AT&amp;amp;T" because of XML escaping, but returns to "AT&amp;T" when
the ERE XML file is read using an XML parser, as intended.

With regard to whitespace characters in annotated text extents, the ERE XML
offset and length are again based on the "raw" MP data and will reflect the
original quantity of whitespace characters. But in the text string provided
in the ERE XML annotation element, whitespace has been normalized, as
described in 4.1 above, and may be shorter.


5. Light ERE and Rich ERE Annotation

5.1 Data Selection

All source data and English translation in this release were drawn from
LDC2017T05 (BOLT Chinese Discussion Forum Parallel Training Data). Documents
were vetted for annotation suitability. Documents that had previously
received other types of annotation (Chinese Treebank, English Parallel
Chinese Treebank, Word Alignment) were prioritized. Documents containing
sensitive information or no taggable content were deemed not suitable for
annotation.

5.2 Annotation

LDC annotators performed exhaustive ERE annotation independently on the 
Chinese source and the English translation. Annotation consisted of tagging 
all mentions of a set of targeted entities, relations and events, as well 
as marking coreference for entities and events.

Light ERE annotation labeled entity mentions for the target set of entity
types. Light ERE also labeled the target set of relation and event types
between and among those entities. Please refer to the annotation guidelines
for the target set of entity, relation and event types. Multiple mentions 
of the same entity or event within a document were coreferenced manually by 
annotators. Relation coreference was an automated process and was not 
manually performed by annotators (see section 5.3 for how relation 
coreference is produced).

In contrast to Light ERE annotation, Rich ERE annotation primarily expanded
types and taggability in the Entities, Relations, and Events annotation
tasks and replaced strict Event Coreference with a more loosely defined
Event Hopper annotation (Song, et al., 2015; Mott, et al., 2016).

Rich ERE annotation for this data was performed on top of completed Light
ERE annotation. Rich ERE annotators first performed exhaustive tagging and
coreference of valid entities in a provided source document. Afterwards,
valid relations from the document were annotated and entity or filler values
supplied for the relation arguments. Lastly, valid event mentions and event
hoppers were annotated, entity or filler values were supplied for event
arguments, and hopper-style coreference of event mentions was added. Just
as with Light ERE annotation, relation coreference was an automated process and
was not manually performed by annotators (see section 5.3 for how relation
coreference is produced).

For more information on the Light and Rich ERE annotation processes, please
refer to the annotation guidelines in the ../docs/guidelines directory.

5.3 ERE Annotation Workflow

Each document was annotated for all ERE tasks in a first pass (1P) by one
annotator and then second-pass annotated (2P) by a senior annotator or team
leader.  For 1P, a single annotator completed all annotation (entities, relations
and events) for a file.  For 2P, a more experienced senior annotator reviewed
the first-pass annotations and corrected any errors they found. After 2P,
additional corpus-wide quality control (QC) checks were conducted on
completed 2P data by the team leader and select senior annotators. Refer to
section 5.4 for detailed QC procedures.

The full annotation process for ERE annotation is represented below:

              1P: entities
                  relations
                  events
                  |
                  V
              2P: entities
                  relations
                  events
                  |
                  V
              QC: entities
                  relations
                  events

Coreference of relations was done automatically. Relation mentions that meet
the following criteria were processed after annotation as coreferenced:

        -- They have the same type and subtype
        -- They have the same realis attribute
        -- If relations are asymmetric, relation1.arg1 == relation2.arg1
           and relation1.arg2 == relation2.arg2
        -- If relations are symmetric, relation1.arg1 == relation2.arg1
             and relation1.arg2 == relation2.arg2
             or relation1.arg1 == relation2.arg2
             and relation1.arg2 == relation2.arg1
        -- The following three relation type-subtypes are symmetric:
                type            subtype
                personalsocial business
                personalsocial family
                personalsocial unspecified

(Relation mentions which have a filler as an argument were treated as
singletons, because fillers are not coreferenced.)

Sometimes the discussion forum documents contain quoted text either from an
external source or from the same document. The quoted text was annotated
if it contained taggable entities, relations or events.

5.4 Quality Control

After manual quality control on individual files, LDC also conducted a
corpus-wide scan of each language which includes:

    -- Manual scan of all entity mentions for outliers (the same text
       strings have different typing)
    -- Manual scan of heads of all NOM (nominal) mentions to correct errors
       or misses
    -- Manual scan of all NAM (name) mentions having different entity type
       values in different parts of the corpus
    -- Manual scan of event triggers to review event type and subtype values
    -- Scan all time fillers to make sure that all time fillers are
       normalized
    -- Scan all relation arguments to make sure that only allowable entity
       types were annotated as arguments
    -- Scan all relations to make sure that there are no duplicate relation
       mentions (i.e. relation arguments that refer to the same entity mentions)
    -- Scan all event arguments to make sure that only allowable entity
       types were annotated as arguments
    -- Scan all event hoppers to make sure that event mentions in the same
       hoppers have the same type and subtype value (except for mentions of the
       contact and transaction types, which only need to agree on type level)

All identified outliers were then manually reviewed and corrected if needed.

These manual QC checks were done in parallel with automatic validation
checks of the data during extraction and preparation of annotation files for
delivery.

In addition, some cross-lingual QC was performed at the conclusion of
Light ERE annotation. Pairs of files with substantially different
inventories of annotated items were flagged, reviewed by bilingual
annotators and corrected, if needed, by an annotator from the
appropriate team. No additional QC was performed across languages
following Rich ERE annotation.


6. Data Validation

For all text extent references, it was verified that the combination of
docid, offset, and length was a valid reference to a string identical to
content of the XML text extent element.

 - Verified trigger text extent references valid
 - Verified arg text extent references valid
 - Verified entity mention text extent references valid
 - Verified filler text extent references valid
 - Verified each ERE kits in delivery included annotation

Checks were also performed to identify and correct systematic errors that
occurred for certain event subtypes and argument types.

7. Acknowledgments

This material is based on research sponsored by Air Force Research
Laboratory and Defense Advanced Research Projects Agency under agreement
number FA8750-13-2-0045. The U.S. Government is authorized to reproduce and
distribute reprints for Governmental purposes notwithstanding any copyright
notation thereon. The views and conclusions contained herein are those of
the authors and should not be interpreted as necessarily representing the
official policies or endorsements, either expressed or implied, of Air Force
Research Laboratory and Defense Advanced Research Projects Agency or the
U.S. Government.

The authors acknowledge the following contributors to this data set:

David Graff
Jonathan Wright (LDC)
Tom Riese


8. References

Jacqueline Aguilar, Charley Beller, Paul McNamee, Benjamin Van Durme,
Stephanie Strassel, Zhiyi Song, Joe Ellis.  Comparison of the Events and
Relations Across ACE, ERE, TAC-KBP, and FrameNet Annotation Standards.
52nd Annual Meeting of the Association for Computational Linguistics,
Baltimore, 2nd Workshop on Events: Definition, Detection, Coreference,
and Representation. 2014.

DARPA. Broad Agency Announcement: Deep Exploration and Filtering 
of Text (DEFT). Defense Advanced Research Projects Agency, DARPA-BAA
-12-47. 2012. 

Justin Mott, Zhiyi Song, Ann Bies, Stephanie Strassel. Parallel
Chinese-English Entities, Relations and Events Corpora. LREC 2016:
10th Edition of the Language Resources and Evaluation Conference,
Portoroz, May 23-28. 2016.

Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese, Justin Mott, Joe Ellis, 
Jonathan Wright, Seth Kulick, Neville Ryant and Xiaoyi Ma. From Light to 
Rich ERE: Annotation of Entities, Relations, and Events. 3rd Workshop on 
EVENTS: Definition, Detection, Coreference, and Representation. Conference 
of the North American Chapter of the Association for Computational Linguistics 
- Human Language Technologies (NAACL HLT), 3rd Workshop on Events: 
Definition, Detection, Coreference, and Representation. 2015.

Christopher Walker, Stephanie Strassel, Julie Medero, Kazuaki Maeda. ACE
2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia:
Linguistic Data Consortium, 2006.


9. Contact Information

  Stephanie Strassel <strassel@ldc.upenn.edu> PI
  Jonathan Wright <jdwright@ldc.upenn.edu> Technical oversight
  Song Chen <zhiyi@ldc.upenn.edu> ERE annotation project manager


10. Copyright

© 2014-2015 Trustees of the University of Pennsylvania

-------------------

README Update Log
  Created: Song Chen, September 19, 2016
  Updated: Song Chen, February 17, 2017
  Updated: Song Chen, January 12, 2018
  Updated: Song Chen, June 27, 2019