Title: DEFT Chinese Light and Rich ERE Annotation Authors: Song Chen, Justin Mott, Stephanie Strassel Catalog ID: LDC2020T19 Linguistic Data Consortium August 17, 2020 1. Introduction This package contains the set of Chinese Light and Rich Entities, Relations and Events (ERE) annotation for DARPA's Deep Exploration and Filtering of Text (DEFT) program. The corpus consists of Chinese (CMN) text data in the Discussion Forum genre annotated for entities, relations and events by the Linguistic Data Consortium (LDC). This data was previously distributed as e-corpora (LDC2014E113, LDC2015E105) to DEFT performers. The DEFT program aimed to address remaining capability gaps in state-of-the- art natural language processing technologies related to inference, causal relationships and anomaly detection (DARPA, 2012). ERE annotation is a core resource created by LDC under DEFT to provide training data for developing systems in detecting and coreferencing entities, relations and events. The task evolved over the course of the program, from a fairly lightweight treatment of entities, relations and events similar to ACE (LDC, 2006; Aguilar et al., 2014) to a richer representation of phenomena of interest to the program (Song et al., 2015). This release contains 157 Chinese documents annotated following the Light ERE annotation guidelines, of which 149 documents were also annotated following the Rich ERE annotation guidelines. Additional annotation following Rich ERE guidelines was added to existing Light ERE annotation for these 149 documents. Source documents are in TXT format, and the annotation is in XML format. Please refer to section 4 for details. 2. Contents ./docs/README.txt This file ./dtd/ deft_light_ere.2.0.0.dtd -- DTD for Light ERE XML annotation files deft_rich_ere.1.1.dtd -- DTD for Rich ERE XML annotation files ./docs/ chinese_light_ere_stats.tab --Light ERE annotation statistics by document chinese_rich_ere_stats.tab --Rich ERE annotation statistics by document ./docs/guidelines DEFT_LIGHT_ERE_Chinese_Annotation_Guidelines_Entities_V1.1.pdf DEFT_LIGHT_ERE_Chinese_Annotation_Guidelines_Events_V1.0.pdf DEFT_LIGHT_ERE_Chinese_Annotation_Guidelines_Relations_V1.1.pdf --Chinese Light ERE annotation guidelines DEFT_RICH_ERE_Chinese_Annotation_Guidelines_Entities_V3.pdf DEFT_RICH_ERE_Chinese_Annotation_Guidelines_Events_V3.pdf DEFT_RICH_ERE_Chinese_Annotation_Guidelines_Relations_V3.pdf DEFT_RICH_ERE_Chinese_Annotation_Guidelines_ArgumentFiller_V3.pdf --Chinese Rich ERE annotation guidelines ./data/source/ This directory contains all of the source documents in TXT format used for Chinese ERE annotation. ./data/light_ere This directory contains the Chinese Light ERE annotation files. ./data/rich_ere This directory contains the Chinese Rich ERE annotation files. Note: The IDs for each annotation (entity, entity mention, relation, filler, event hopper, event mention) are unique to each document, not to the entire corpus. Fillers are entity-like annotations that function as relation or event arguments. 3. Data Profile and Format Entity / Relation / Event annotation volumes ERE Files Characters Entities(mentions) Fillers Relations Event Hoppers(mentions) ------------------------------------------------------------------------------------------ Light 157 164,038 6,444 (16,997) N/A 2,401 817 (1,107) Rich 149 134,745 6,924 (16,471) 792 2,298 1,736 (2,360) ------------------------------------------------------------------------------------------ ERE annotation files have a .light_ere.xml or .rich_ere.xml extension, and are in XML format. For a full description of the elements, attributes, and structure of the ERE annotation files, please see the DTD in the docs directory of this release. 4. Using the Data All source documents are in the Discussion Forum genre. Since the source Discussion Forum threads are very long, the threads were further split into continuous multi-post (CMP) units for ERE annotation. Note that a CMP unit is an XML fragment rather than a full XML document; it is intended to be used as raw text, and uses UNIX-style line termination (line-feed only). 4.1 Offset Calculation All ERE XML files (file names "*_ere.xml") represent stand-off annotation of source files (file names "*.mp.txt") and use offsets to refer to the text extents. The entity_mention, relation_mention, and event_mention XML elements all have attributes or contain sub-elements which use character offsets to identify text extents in the source. The offset gives the start character of the text extent; offset counting starts from the initial character, character 0, of the source document (.mp.txt file) and includes newlines as well as all characters comprising XML-like tags in the source data. When the text extent being annotated contains any sort of whitespace, including also tab, line feed and/or carriage return, the text presented in the corresponding ERE XML annotation element has all strings of one or more whitespace characters normalized to a single ASCII space (0x20). 4.2 Proper ingesting of XML Character offsets and lengths for text extents in ERE XML are calculated based on "raw" multi-post data, where original (XML-fragment) meta- characters are escaped. For example, a reference to the corporation "AT&T" will appear in MP as "AT&T". ERE annotation on this string will cite a length of 8 characters (not 4). This string is stored in the ERE XML file as "AT&T" because of XML escaping, but returns to "AT&T" when the ERE XML file is read using an XML parser, as intended. With regard to whitespace characters in annotated text extents, the ERE XML offset and length are again based on the "raw" MP data, and will reflect the original quantity of whitespace characters. But in the text string provided in the ERE XML annotation element, whitespace has been normalized, as described in 4.1 above, and may be shorter. 5. Light ERE and Rich ERE Annotation ERE Annotation consists of tagging all mentions of a set of targeted entities, relations and events, as well as marking coreference for entities and events. Light ERE annotation labels entity mentions for the target set of entity types. Light ERE also labels the target set of relation and event types between and among those entities. Please refer to the annotation guidelines for the target set of entity, relation and event types. Multiple mentions of the same entity or event within a document are coreferenced manually by annotators. Relation coreference is an automated process and is not manually performed by annotators (see section 5.2 for how relation coreference is produced). In contrast to Light ERE annotation, Rich ERE annotation primarily expands types and taggability in the Entities, Relations, and Events annotation tasks and replaces strict Event Coreference with a more loosely defined Event Hopper annotation (Song, et al., 2015). Rich ERE annotation for this data was performed on top of completed Light ERE annotation. Rich ERE annotators first performed exhaustive tagging and coreference of valid entities in a provided source document. Afterwards, valid relations from the document were annotated and entity or filler values supplied for the relation arguments. Lastly, valid event mentions and event hoppers were annotated, entity or filler values were supplied for event arguments, and hopper-style coreference of event mentions was added. Just as with Light ERE annotation, relation coreference is an automated process and is not manually performed by annotators (see section 5.2 for how relation coreference is produced). For more information on the Light and Rich ERE annotation processes, please refer to the annotation guidelines in the ../docs/guidelines directory. 5.1 Data selection All source data in this release were drawn from LDC2016T05 (BOLT Chinese Discussion Forums). Documents were vetted for annotation suitability. Document containing sensitive information or no taggable content were deemed not suitable for annotation. 5.2 ERE Annotation Workflow Each document is annotated for all ERE tasks in a first pass (1P) by one annotator and then second-pass annotated (2P) by a senior annotator or team leader. For 1P, a single annotator completes all tasks (entities, relations and events) for a file. For 2P, a more experienced senior annotator reviews the first-pass annotations and corrects any errors they find. After 2P, additional corpus-wide quality control (QC) checks are conducted on completed 2P data by the team leader and select senior annotators. Refer to section 5.3 for detailed QC procedures. The full annotation process for ERE annotation is represented below: 1P: entities relations events | V 2P: entities relations events | V QC: entities relations events Coreference of relations is done automatically. Relation mentions that meet the following criteria are processed after annotation as coreferenced: -- They have the same type and subtype -- They have the same realis attribute -- If relations are asymmetric, relation1.arg1 == relation2.arg1 and relation1.arg2 == relation2.arg2 -- If relations are symmetric, relation1.arg1 == relation2.arg1 and relation1.arg2 == relation2.arg2 or relation1.arg1 == relation2.arg2 and relation1.arg2 == relation2.arg1 -- The following three relation type-subtypes are symmetric: type subtype personalsocial business personalsocial family personalsocial unspecified (Relation mentions which have a filler as an argument are treated as singletons, because fillers are not coreferenced.) Sometimes the discussion forum documents contain quoted text either from an external source or from the same document. The quoted text is annotated if they contain taggable entities, relations or events. 5.3 Quality Control After manual quality control on individual files, LDC also conducts a corpus-wide scan of each language which includes: -- Manual scan of all entity mentions for outliers (the same text strings have different typing) -- Manual scan of heads of all NOM (nominal) mentions to correct errors or misses -- Manual scan of all NAM (name) mentions having different entity type values in different parts of the corpus -- Manual scan of event triggers to review event type and subtype values -- Scan all time fillers to make sure that all time fillers are normalized -- Scan all relation arguments to make sure that only allowable entity types were annotated as arguments -- Scan all relations to make sure that there are no duplicate relation mentions (i.e. relation arguments that refer to the same entity mentions) -- Scan all event arguments to make sure that only allowable entity types were annotated as arguments -- Scan all event hoppers to make sure that event mentions in the same hoppers have the same type and subtype value (except for mentions of the contact and transaction types, which only need to agree on type level) All identified outliers were then manually reviewed and corrected if needed. These manual QC checks were done in parallel with automatic validation checks of the data during extraction and preparation of annotation files for delivery. 6. Data Validation For all text extent references, it was verified that the combination of docid, offset, and length was a valid reference to a string identical to content of the XML text extent element. - Verified trigger text extent references valid - Verified arg text extent references valid - Verified entity mention text extent references valid - Verified filler text extent references valid - Verified each ERE kits in delivery included annotation Checks were also performed to identify and correct systematic errors that occurred for certain event subtypes and argument types. 7. Acknowledgments This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authorized to reproduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Ann Bies (LDC) David Graff (LDC) Jonathan Wright (LDC) Tom Riese Xuansong Li (LDC) Stephen Grimes 8. References Christopher Walker, Stephanie Strassel, Julie Medero, Kazuaki Maeda. ACE 2005 Multilingual Training Corpus LDC2006T06. Web Download. Philadelphia: Linguistic Data Consortium, 2006. DARPA. Broad Agency Announcement: Deep Exploration and Filtering of Text (DEFT). Defense Advanced Research Projects Agency, DARPA-BAA -12-47. 2012. Jacqueline Aguilar, Charley Beller, Paul McNamee, Benjamin Van Durme, Stephanie Strassel, Zhiyi Song, Joe Ellis. Comparison of the Events and Relations Across ACE, ERE, TAC-KBP, and FrameNet Annotation Standards. 52nd Annual Meeting of the Association for Computational Linguistics, Baltimore, 2nd Workshop on Events: Definition, Detection, Coreference, and Representation, 2014. Zhiyi Song, Ann Bies, Stephanie Strassel, Tom Riese, Justin Mott, Joe Ellis, Jonathan Wright, Seth Kulick, Neville Ryant and Xiaoyi Ma. From Light to Rich ERE: Annotation of Entities, Relations, and Events. 3rd Workshop on EVENTS: Definition, Detection, Coreference, and Representation. Conference of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies (NAACL HLT), 3rd Workshop on Events: Definition, Detection, Coreference, and Representation, 2015. 9. Contact Information Stephanie Strassel PI Jonathan Wright Technical oversight Song Chen ERE annotation project manager Justin Mott ERE Chinese Lead Annotator ------------------- README Update Log Created: Song Chen, September 19, 2016 Updated: Song Chen, February 17, 2017 Updated: Song Chen, January 12, 2018 Updated: Song Chen, June 26, 2019