Corpus Title: AIDA Scenario 2 Practice Topic Annotation LDC Catalog-ID: LDC2024T06 Authors: Jennifer Tracey, Stephanie Strassel, Jeremy Getman, Ann Bies, Kira Griffitt, David Graff, Chris Caruso 1. Introduction This corpus was developed by the Linguistic Data Consortium for the DARPA AIDA Program and consists of annotations for 29 documents in the AIDA Scenario 2 Practice Topic Source Data corpus. The AIDA (Active Interpretations of Disparate Alternatives) Program is designed to support development of technology that can assist in cultivating and maintaining understanding of events when there are conflicting accounts of what happened (e.g. who did what to whom and/or where and when events occurred). AIDA systems must extract entities, events, and relations from individual multimedia documents, aggregate that information across documents and languages, and produce multiple knowledge graph hypotheses that characterize the conflicting accounts that are present in the corpus (see https://www.darpa.mil/program/active-interpretation-of-disparate-alternatives for more information about the program). Each phase of the AIDA program focused on a different scenario, or broad topic area. The scenario for Phase 2 was the socioeconomic and political Crisis in Venezuela since 2010. In addition, each scenario had a set of specific subtopics within the scenario that were designated as either "practice topics" (released as for use in system development) or "evaluation topics" (reserved for use in the AIDA program evaluations for each phase). This corpus provides exhaustive annotation of events, relations, and entities for the subset of AIDA Phase 2 practice topic documents that were selected for annotation. The annotation of events, relations, and entities in AIDA Phase 2 was exhaustive for a set of manually selected document regions and ontology types, which vary by document. For each document, annotators decided which regions contained information that was relevant to a particular topic in the scenario. Similarly, a set of ontology types was selected for each document based on which types would be needed to annotate the topic-relevant information in the selected regions. A file indicating which types and regions are annotated for each document is included in the docs directory of the release. The Scenario 2 Practice Topics covered in this annotation are: T201 - 2014 Disease Outbreak in Venezuela T202 - 2017 Venezuelan Constituent Assembly Election T203 - Drone Explosions in Caracas The source documents referenced by annotation files in this package appear in: LDC2023XXX: AIDA Scenario 2 Practice Topic Source Data 2. Directory Structure and Content Summary The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: data/ -- contains subdirectories of annotation by topic T201/ -- subdirectories containing annotation files (see content description below) T202/ -- subdirectories containing annotation files (see content description below) T203/ -- subdirectories containing annotation files (see content description below) docs/ -- contains documentation about the annotation (see content description below) 2.1 Content Summary This release contains annotation for a total of 29 unique documents, with the following distribution across the three practice topics and languages: Topic ID Language Documents T201 ENG 2 T202 ENG 3 T203 ENG 4 T201 RUS 4 T202 RUS 0 T203 RUS 6 T201 SPA 3 T202 SPA 2 T203 SPA 5 3. Annotations The annotation tagset and annotation guidelines can be found in the docs directory, and the formats of annotations are described in the AIDA_phase_2_table_field_descriptions_v1.tab file in the docs directory; the sections below provide descriptions of the content of each type of annotation file. 3.1 Mentions A mention is a single reference in source data to a real-world entity or filler, event, or relation. A mention may occur in text, image, or video. A mention of an entity that takes part in an event or relations is called an argument There are three mentions tables for each topic: one for entities and fillers, one for relations, and one for events. These tables contain information about each annotated mention. Note the following: - All mentions.tab files include subtype and subsubtype fields. - All mentions.tab files include the root uid. - Video mentions do not specify the signal type (picture or sound), so the mediamention_signaltype field in the evt, rel, and arg mentions tables is always set to EMPTY_NA - Video mentions of events and relations include start and end time stamps for the mentions (no keyframe id or bounding box coordinates); video mentions of entities include bounding coordinates and keyframe id (no start and end time stamps) - Arg mentions include a mention status of "base" or "informative" indicating whether the entity/filler mention is the local mention that occupies an arg slot in a relation or event mention ("base") or whether it is an entity mention that is not connected to an event/relation mention but was annotated as part of exhaustive annotation of entities by type for the selected regions ("informative"). - Relation and event mentions can have the attributes "hedged" and/or "not". A "hedged" relation or event is one which the source data asserts as *possibly* true (or possibly not true). A "not" relation or event is one which the source data asserts as not havig occurred. 3.2 Slots A slot is a pre-defined role in an event or relation that is filled by an argument (entity mention). There are two slots tables per topic, one for relations and one for events. Relation and event mentions in the mentions tables must be looked up in the slots tables to find the arguments and fillers involved in the relation/event. Event mentions can occur as the arguments of other events, in addition to occurring as the arguments of relations. 3.3 KB Linking A knowledge base (KB) is a static set of reference entities. Entity mentions are "linked" to entries in the KB as a method of indicating the real-world entity to which an entity refers. When an entity does not appear in the reference KB, it is instead assigned a NIL ID. When more than one mention are assigned the same NIL ID, this indicates that the mentions are coreferent (i.e. the same entity). The KB linking tables in this release provide within-document coreference of events, relations, and entities. No linking to the reference knowledge base or cross-document NIL coreference is included. The KB IDs refer to AIDA Scenario 1 and 2 Reference Knowledge Base (LDC2023XXX). NIL ids are provided for each coreference cluster within a document. Clusters of the "same" entity, relation, or event in different documents will have different NIL ids since the coreference annotation is within-document only. 4. Documentation The following documents are present in the docs/ directory of this package: AIDA_Type_Restricted_Event_Relation_Annotation_Guidelines_V1.0.pdf - current annotation guidelines for exhaustive annotation of event and relation mentions (including their arguments and attributes) AIDA_Exhaustive_Entity-Filler_Guidelines_Text_V1.0.pdf - current guidelines for exhaustive annotation of entities and fillers in text AIDA_Entity-Filler_Guidelines_Images_V1.1.pdf - current guidelines for exhaustive annotation of entities and fillers in images and video keyframes AIDA_phase_2_table_field_descriptions_v1.tab - description of the structure of each type of annotation table. This table includes information about column headers, content of each field, and format of the contents doc_lang_topic.tab - provides the root uid, language, and topic for each document with annotations present in this release AIDA_Annotation_Ontology_Phase2_V1.1.xlsx - a copy of the annotation ontology for Phase 2 T201_T202_T203_topic_description_V1.pdf - descriptions of Phase 2 practice topics with queries and query IDs. Note that the queries are meant to draw annotators attention to expected points of informational conflict within the topic, but annotation is exhaustive by type for (the selected regions of) each document; therefore, annotations will include all event, relation, and entity mentions of the selected types, regardless of salience to the topics or queries. doc_regions_types_v5.tab - defines the types and document regions annotated for each document; one row per type per document region. Includes root uid, child uid, media type, type, subtype, subsubtype, and span. Span indicates the range of character offsets for text and time stamps for video that were annotated for the given type. For images the span is 'ENTIRE_DOCUMENT_ELEMENT' since the entire image is in scope for annotation. For entity annotations on video, the span consists of the keyframe id. 5. Copyright Information (c) 2023 Trustees of the University of Pennsylvania