Corpus Title: AIDA Scenario 1 Practice Topic Annotation LDC Catalog-ID: LDC2024T02 Authors: Jennifer Tracey, Stephanie Strassel, Jeremy Getman, Ann Bies, Kira Griffitt, David Graff, Chris Caruso 1. Introduction This corpus was developed by the Linguistic Data Consortium for the DARPA AIDA Program and consists of annotations for 212 documents in the AIDA Scenario 1 Practice Topic Source Data corpus. The AIDA (Active Interpretations of Disparate Alternatives) Program is designed to support development of technology that can assist in cultivating and maintaining understanding of events when there are conflicting accounts of what happened (e.g. who did what to whom and/or where and when events occurred). AIDA systems must extract entities, events, and relations from individual multimedia documents, aggregate that information across documents and languages, and produce multiple knowledge graph hypotheses that characterize the conflicting accounts that are present in the corpus (see https://www.darpa.mil/program/active-interpretation-of-disparate-alternatives for more information about the program). Each phase of the AIDA program focused on a different scenario, or broad topic area. The scenario for Phase 1 was political relations between Russia and Ukraine in the 2010s. In addition, each scenario had a set of specific subtopics within the scenario that were designated as either "practice topics" (released as for use in system development) or "evaluation topics" (reserved for use in the AIDA program evaluations for each phase). This corpus contains annotations for the set of practice topic documents designated for annotation for Phase 1. The annotations in this package were developed in a three-stage annotation process designed to support the needs of the AIDA program for Scenario 1. - Stage 1 consists of annotation of salient mentions of events and relations with local mentions of their arguments. - Stage 2 consists of quality control on existing annotations, plus new annotations of informative mentions of arguments and annotation of any additional, salient event or relation mentions identified during quality control. - Stage 3 consists of linking entity mentions to a KB, and performing cross-doc, cross-lingual clustering of NIL entity, event, and relation mentions. Although there are three linking tab files (one for each topic), KB linking and NIL clustering are cross-topic. The annotations in this package cover the three Scenario 1 Practice Topics: R103 - Who Started the Shooting at Maidan? R105 - Ukrainian War Ceasefire Violations in Battle of Debaltseve (January-February 2015) R107 - Donetsk and Luhansk Referendum, aka Donbass Status Referendum (May 2014) The source documents referenced by annotation files in this package appear in: LDC2023T11: AIDA Scenario 1 Practice Topic Source Data 2. Directory Structure and Content Summary The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: data/ -- contains subdirectories of annotation by topic R103/ -- subdirectories containing annotation files (see content description below) R105/ -- subdirectories containing annotation files (see content description below) R107/ -- subdirectories containing annotation files (see content description below) docs/ -- contains documentation about the annotation (see content description below) 2.1 Content Summary This release contains annotation for a total of 233 unique documents, with the following distribution across the three practice topics and languages: Topic ID Language Documents R103 ENG 26 R105 ENG 11 R107 ENG 18 R103 RUS 26 R105 RUS 26 R107 RUS 29 R103 UKR 33 R105 UKR 20 R107 UKR 23 3. Annotations The annotation tagset and annotation guidelines can be found in the docs directory, and the formats of annotations are described in the AIDA_phase_1_table_field_descriptions_v3.tab file in the docs directory. The sections below provide descriptions of the content of each type of annotation file. 3.1 Mentions A mention is a single reference in source data to a real-world entity or filler, event, or relation. A mention may occur in text, image, or video. A mention of an entity that takes part in an event or relations is called an argument. There are three mentions tables for each topic: one for entities and fillers, one for relations, and one for events. These tables contain information about each annotated mention. Note that the KB linking information is contained in a separate linking.tab file (see below). - Entity and filler mentions are in a file called TOPICID_arg_mentions.tab - All mentions.tab files include subtype and subsubtype fields. - All mentions.tab files include the root uid. - Video mentions specify the signal type (picture or sound), and video and audio mentions include start and end time stamps for the mentions. - Video "picture" mentions include keyframe id; images and video "picture" mentions include bounding box coordinates. - Arg mentions include a mention status of "base" or "informative" indicating whether the entity/filler mention is the local mention that occupies an arg slot in a relation or event mention ("base") or whether it is an additional mention of an entity that is not local to the event/relation mention ("informative"). - Relation and event mentions can have the attributes "hedged" and/or "not". A "hedged" relation or event is one which the source data asserts as *possibly* true (or possibly not true). A "not" relation or event is one which the source data asserts as not havig occurred. 3.2 Slots A slot is a pre-defined role in an event or relation that is filled by an argument (entity mention). There are two slots tables per topic, one for relations and one for events. Relation and event mentions in the mentions tables must be looked up in the slots tables to find the arguments and fillers involved in the relation/event. - Slot type labels use the role labels from the AIDA annotation tag set, prefaced by indicators of the relation/event type and arg number. For example the slot type "rel022arg02sponsor" refers to the arg 2 sponsor role in the relation that has index number ldc_rel_022 in the annotation tag set). To strip the slot_type to the bare role label, the first 11 characters can be removed, as this is a fixed-width preface. - The argmention_ids in the slots table correspond to "base" mentions in the arg_mentions table. Note that events which serve as arguments of sponsorship relations appear in the event mentions table, not the arg mentions table. 3.3 KB Linking A knowledge base (KB) is a static set of reference entities. Entity mentions are "linked" to entries in the KB as a method of indicating the real-world entity to which an entity refers. When an entity does not appear in the reference KB, it is instead assigned a NIL ID. When more than one mention are assigned the same NIL ID, this indicates that the mentions are coreferent (i.e. the same entity). The KB linking tables provide a KB ID or NIL ID for each entity, relation, and event mention. The KB IDs refer to AIDA Scenario 1 and 2 Reference Knowledge Base (LDC2023T10). In the case where annotators could not disambiguate between two or more possible KB links, multiple IDs are presented, separated by a pipe ("|") symbol. 4. Prevailing Theories A prevailing theory is a narrative about a particular topic that is prevalent in scenario-relevant source data. In the prevailing theories files, we provide a handful of natural language prevailing theories about "what happened" for each topic, and indicate which elements are required for each theory. Practice topics include approximately 2 prevailing theories per topic. Note that prevailing theories are *NOT* intended to exhaustively cover the possible topic-level hypotheses that might emerge from the data. Prevailing theories are in excel files, one file per topic, with one prevailing theory per tab. Each element within a prevailing theory has either a KB ID or a PT clustering ID. Each tab contains information at the top with the topic and natural language version of the theory. Below the natural language version is a matrix of elements that are required to fully support the theory, where each element is an event or relation with all its arguments. The first column assigns an ID number to each of the elements, the purpose of which is to make it easy to sort and tell which arguments go together under a particular relation or event. For each of the elements, one line represents the event or relation itself, and each argument is listed on a separate line under the event/relation. There are two columns containing KB IDs: - Column C (Event/Relation KB ID) contains the KB ID or clustering ID for the event or relation - Column I (Item KE) contains the KB ID or clustering ID for the argument populating the given event or relation slot. Entity and relations that do not appear in the AIDA Scenario 1 and 2 KB have PT clustering IDs formatted like PTE_R10#_### (for prevailing theory entities) or PTR_R10#_### (for prevailing theory relations). These IDs provide clustering information for the prevailing theories of the given topic. These are not NIL IDs, in that they do not correspond to any annotations in ./data, and only indicate which elements within a topic's prevailing theories are coreferent. Events within the prevailing theories all have NIL IDs. These IDs may also be present in the kb_linking.tab files in ./data, meaning they may have corresponding mention-level annotations. In addition to the KB IDs, each line has information about the type, subtype, and sub-subtype of each event/relation/argument as well as expected date, start date range, end date range, and attribute information where known. 5. Documentation The following documents are present in the docs/ directory of this package: AIDA_Annotation_Guidelines_Quality_Control_and_Informative_Mentions_V1.0.pdf - annotation guidelines for "Stage 2" of the annotation process. AIDA_Annotation_Guidelines_Salient_Mentions_V1.0.pdf - annotation guidelines for "Stage 1" of the annotation process. AIDA_Phase_1_table_field_descriptions_v3.tab - description of the structure of each type of annotation table. This table includes information about column headers, content of each field, and format of the contents. doc_lang_topic.tab - provides the root uid, language, and topic for each document with annotations present in this release. LDC_AIDAAnnotationOntology_V8.xlsx - a copy of the annotation tag set, also referred to as the annotation ontology. media_list.tab - provides the child uid and media type for all non-text assets cited as provenance in the annotations present in this release. R103_R105_R107_topic_description_V2.pdf - descriptions of R103, R105, and R107 topics with queries and query IDs. Note that the queries are meant to draw annotators' attention to expected points of informational conflict within the topic, but salience to the topic is defined more broadly than simply providing the answer to one of the queries. See the annotation guidelines for instructions provided to annotators on determining salience. {R103,R105,R107}_prevailing_theories_final.xlsx - these three files contain prevailing theories for topics R103, R105, and R107 respectively. 6. Known Issues Duplicate arg mentions -- some arg mentions may be annotated more than once when they appear as arguments of more than one relation/event; that is, the same type, subtype, and sub-subtype may be applied to the same text extent (or video/image provenance) more than once. Note that duplicate arg mentions each have a unique argmention_id. Seven orphaned argument mentions don't correspond to any relation or event mention: EMIC0011UQQ.000549 EMIC0011UQQ.000683 EMIC0011W6A.000251 EMIC00120LX.000071 EMIC00120LX.000084 EMIC0015NQC.001417 EMIC001K05A.001488 7. Copyright Information (c) 2023 Trustees of the University of Pennsylvania