Corpus Title:    AIDA Scenario 1 Practice Topic Annotation
LDC Catalog-ID:  LDC2024T02

Authors: Jennifer Tracey, Stephanie Strassel, Jeremy Getman, Ann Bies,
Kira Griffitt, David Graff, Chris Caruso

1. Introduction

This corpus was developed by the Linguistic Data Consortium for the DARPA
AIDA Program and consists of annotations for 212 documents in the AIDA
Scenario 1 Practice Topic Source Data corpus.

The AIDA (Active Interpretations of Disparate Alternatives) Program is
designed to support development of technology that can assist in
cultivating and maintaining understanding of events when there are
conflicting accounts of what happened (e.g. who did what to whom and/or
where and when events occurred).  AIDA systems must extract entities,
events, and relations from individual multimedia documents, aggregate that
information across documents and languages, and produce multiple knowledge
graph hypotheses that characterize the conflicting accounts that are
present in the corpus (see
https://www.darpa.mil/program/active-interpretation-of-disparate-alternatives
for more information about the program).

Each phase of the AIDA program focused on a different scenario, or
broad topic area. The scenario for Phase 1 was political relations
between Russia and Ukraine in the 2010s. In addition, each scenario
had a set of specific subtopics within the scenario that were
designated as either "practice topics" (released as for use in system
development) or "evaluation topics" (reserved for use in the AIDA
program evaluations for each phase). This corpus contains annotations
for the set of practice topic documents designated for annotation for
Phase 1.

The annotations in this package were developed in a three-stage annotation
process designed to support the needs of the AIDA program for Scenario 1.

- Stage 1 consists of annotation of salient mentions of events and
  relations with local mentions of their arguments.

- Stage 2 consists of quality control on existing annotations, plus
  new annotations of informative mentions of arguments and annotation
  of any additional, salient event or relation mentions identified
  during quality control.

- Stage 3 consists of linking entity mentions to a KB, and performing
  cross-doc, cross-lingual clustering of NIL entity, event, and relation
  mentions. Although there are three linking tab files (one for each
  topic), KB linking and NIL clustering are cross-topic.

The annotations in this package cover the three Scenario 1 Practice Topics:
R103 - Who Started the Shooting at Maidan?
R105 - Ukrainian War Ceasefire Violations in Battle of Debaltseve
(January-February 2015)
R107 - Donetsk and Luhansk Referendum, aka Donbass Status Referendum
(May 2014)

The source documents referenced by annotation files in this package
appear in:
LDC2023T11: AIDA Scenario 1 Practice Topic Source Data 


2. Directory Structure and Content Summary

The directory structure and contents of the package are summarized
below -- paths shown are relative to the base (root) directory of the
package:

  data/                   -- contains subdirectories of annotation by topic
        R103/		  -- subdirectories containing annotation files
                             (see content description below)
        R105/             -- subdirectories containing annotation files
                             (see content description below)
        R107/             -- subdirectories containing annotation files
                             (see content description below)
  docs/                   -- contains documentation about the annotation
                             (see content description below)

2.1 Content Summary

This release contains annotation for a total of 233 unique documents,
with the following distribution across the three practice topics and
languages:

Topic ID    Language   Documents
R103        ENG        26
R105        ENG        11
R107        ENG        18
R103        RUS        26
R105        RUS        26
R107        RUS        29
R103        UKR        33
R105        UKR        20
R107        UKR        23


3. Annotations

The annotation tagset and annotation guidelines can be found in the
docs directory, and the formats of annotations are described in the
AIDA_phase_1_table_field_descriptions_v3.tab file in the docs
directory. The sections below provide descriptions of the content of
each type of annotation file.

3.1 Mentions

A mention is a single reference in source data to a real-world
entity or filler, event, or relation. A mention may occur in text,
image, or video. A mention of an entity that takes part in an event
or relations is called an argument.

There are three mentions tables for each topic: one for entities and
fillers, one for relations, and one for events. These tables contain
information about each annotated mention. Note that the KB linking
information is contained in a separate linking.tab file (see below).

- Entity and filler mentions are in a file called
  TOPICID_arg_mentions.tab

- All mentions.tab files include subtype and subsubtype fields.

- All mentions.tab files include the root uid.

- Video mentions specify the signal type (picture or sound), and
  video and audio mentions include start and end time stamps for the
  mentions.

- Video "picture" mentions include keyframe id; images and video
  "picture" mentions include bounding box coordinates.

- Arg mentions include a mention status of "base" or "informative"
  indicating whether the entity/filler mention is the local mention
  that occupies an arg slot in a relation or event mention ("base") or
  whether it is an additional mention of an entity that is not local
  to the event/relation mention ("informative").

- Relation and event mentions can have the attributes "hedged" and/or
  "not". A "hedged" relation or event is one which the source data
  asserts as *possibly* true (or possibly not true). A "not" relation
  or event is one which the source data asserts as not havig occurred.

3.2 Slots

A slot is a pre-defined role in an event or relation that is filled
by an argument (entity mention).

There are two slots tables per topic, one for relations and one for
events. Relation and event mentions in the mentions tables must be
looked up in the slots tables to find the arguments and fillers
involved in the relation/event.

- Slot type labels use the role labels from the AIDA annotation
  tag set, prefaced by indicators of the relation/event type and arg
  number. For example the slot type "rel022arg02sponsor" refers to the
  arg 2 sponsor role in the relation that has index number ldc_rel_022
  in the annotation tag set). To strip the slot_type to the bare role
  label, the first 11 characters can be removed, as this is a
  fixed-width preface.

- The argmention_ids in the slots table correspond to "base" mentions
  in the arg_mentions table. Note that events which serve as arguments
  of sponsorship relations appear in the event mentions table, not the
  arg mentions table.

3.3 KB Linking

A knowledge base (KB) is a static set of reference entities. Entity
mentions are "linked" to entries in the KB as a method of indicating
the real-world entity to which an entity refers. When an entity does
not appear in the reference KB, it is instead assigned a NIL ID.
When more than one mention are assigned the same NIL ID, this
indicates that the mentions are coreferent (i.e. the same entity).

The KB linking tables provide a KB ID or NIL ID for each entity, 
relation, and event mention. The KB IDs refer to AIDA Scenario 1 and 2
Reference Knowledge Base (LDC2023T10).

In the case where annotators could not disambiguate between two or more
possible KB links, multiple IDs are presented, separated by a pipe ("|")
symbol.


4. Prevailing Theories

A prevailing theory is a narrative about a particular topic that is
prevalent in scenario-relevant source data.

In the prevailing theories files, we provide a handful of natural
language prevailing theories about "what happened" for each topic, 
and indicate which elements are required for each theory. Practice
topics include approximately 2 prevailing theories per topic. Note
that prevailing theories are *NOT* intended to exhaustively cover the
possible topic-level hypotheses that might emerge from the data.

Prevailing theories are in excel files, one file per topic, with one
prevailing theory per tab. Each element within a prevailing theory has 
either a KB ID or a PT clustering ID.

Each tab contains information at the top with the topic and natural
language version of the theory. Below the natural language version is
a matrix of elements that are required to fully support the theory,
where each element is an event or relation with all its arguments. The
first column assigns an ID number to each of the elements, the purpose
of which is to make it easy to sort and tell which arguments go together
under a particular relation or event. For each of the elements, one line
represents the event or relation itself, and each argument is listed on
a separate line under the event/relation.

There are two columns containing KB IDs:

- Column C (Event/Relation KB ID) contains the KB ID or clustering ID
  for the event or relation

- Column I (Item KE) contains the KB ID or clustering ID for the 
  argument populating the given event or relation slot.

Entity and relations that do not appear in the AIDA Scenario 1 and 2
KB have PT clustering IDs formatted like PTE_R10#_### (for prevailing
theory entities) or PTR_R10#_### (for prevailing theory relations).
These IDs provide clustering information for the prevailing theories of
the given topic. These are not NIL IDs, in that they do not correspond
to any annotations in ./data, and only indicate which elements within a
topic's prevailing theories are coreferent.

Events within the prevailing theories all have NIL IDs. These IDs
may also be present in the kb_linking.tab files in ./data, meaning
they may have corresponding mention-level annotations.

In addition to the KB IDs, each line has information about the type,
subtype, and sub-subtype of each event/relation/argument as well as
expected date, start date range, end date range, and attribute
information where known.


5. Documentation

The following documents are present in the docs/ directory of this package:

AIDA_Annotation_Guidelines_Quality_Control_and_Informative_Mentions_V1.0.pdf
- annotation guidelines for "Stage 2" of the annotation process.

AIDA_Annotation_Guidelines_Salient_Mentions_V1.0.pdf - annotation
guidelines for "Stage 1" of the annotation process.

AIDA_Phase_1_table_field_descriptions_v3.tab - description of the
structure of each type of annotation table. This table includes
information about column headers, content of each field, and format of
the contents.

doc_lang_topic.tab - provides the root uid, language, and topic for
each document with annotations present in this release.

LDC_AIDAAnnotationOntology_V8.xlsx - a copy of the annotation tag set,
also referred to as the annotation ontology.

media_list.tab - provides the child uid and media type for all
non-text assets cited as provenance in the annotations present in this
release.

R103_R105_R107_topic_description_V2.pdf - descriptions of R103, R105,
and R107 topics with queries and query IDs. Note that the queries are
meant to draw annotators' attention to expected points of informational
conflict within the topic, but salience to the topic is defined more
broadly than simply providing the answer to one of the queries. See
the annotation guidelines for instructions provided to annotators on
determining salience.

{R103,R105,R107}_prevailing_theories_final.xlsx - these three files
contain prevailing theories for topics R103, R105, and R107
respectively.


6. Known Issues

Duplicate arg mentions -- some arg mentions may be annotated more than
once when they appear as arguments of more than one relation/event;
that is, the same type, subtype, and sub-subtype may be applied to the
same text extent (or video/image provenance) more than once. Note that
duplicate arg mentions each have a unique argmention_id.

Seven orphaned argument mentions don't correspond to any relation or
event mention:

EMIC0011UQQ.000549
EMIC0011UQQ.000683
EMIC0011W6A.000251
EMIC00120LX.000071
EMIC00120LX.000084
EMIC0015NQC.001417
EMIC001K05A.001488


7. Copyright Information

   (c) 2023 Trustees of the University of Pennsylvania