Corpus Title:    AIDA Scenario 2 Practice Topic Annotation
LDC Catalog-ID:  LDC2024T06

Authors: Jennifer Tracey, Stephanie Strassel, Jeremy Getman, Ann Bies,
Kira Griffitt, David Graff, Chris Caruso

1. Introduction

This corpus was developed by the Linguistic Data Consortium for the DARPA
AIDA Program and consists of annotations for 29 documents in the AIDA
Scenario 2 Practice Topic Source Data corpus.

The AIDA (Active Interpretations of Disparate Alternatives) Program is
designed to support development of technology that can assist in
cultivating and maintaining understanding of events when there are
conflicting accounts of what happened (e.g. who did what to whom and/or
where and when events occurred).  AIDA systems must extract entities,
events, and relations from individual multimedia documents, aggregate that
information across documents and languages, and produce multiple knowledge
graph hypotheses that characterize the conflicting accounts that are
present in the corpus (see
https://www.darpa.mil/program/active-interpretation-of-disparate-alternatives
for more information about the program).

Each phase of the AIDA program focused on a different scenario, or broad
topic area. The scenario for Phase 2 was the socioeconomic and political
Crisis in Venezuela since 2010. In addition, each scenario had a set of
specific subtopics within the scenario that were designated as either
"practice topics" (released as for use in system development) or
"evaluation topics" (reserved for use in the AIDA program evaluations for
each phase).

This corpus provides exhaustive annotation of events, relations, and
entities for the subset of AIDA Phase 2 practice topic documents that
were selected for annotation. The annotation of events, relations, and
entities in AIDA Phase 2 was exhaustive for a set of manually selected
document regions and ontology types, which vary by document. For each
document, annotators decided which regions contained information that
was relevant to a particular topic in the scenario. Similarly, a set
of ontology types was selected for each document based on which types
would be needed to annotate the topic-relevant information in the
selected regions. A file indicating which types and regions are
annotated for each document is included in the docs directory of the
release.

The Scenario 2 Practice Topics covered in this annotation are:
T201 - 2014 Disease Outbreak in Venezuela
T202 - 2017 Venezuelan Constituent Assembly Election
T203 - Drone Explosions in Caracas

The source documents referenced by annotation files in this package appear in:
LDC2023XXX: AIDA Scenario 2 Practice Topic Source Data


2. Directory Structure and Content Summary

The directory structure and contents of the package are summarized
below -- paths shown are relative to the base (root) directory of the
package:

  data/                   -- contains subdirectories of annotation by topic
        T201/		  -- subdirectories containing annotation files
                             (see content description below)
        T202/             -- subdirectories containing annotation files
                             (see content description below)
        T203/             -- subdirectories containing annotation files
                             (see content description below)
  docs/                   -- contains documentation about the annotation
                             (see content description below)

2.1 Content Summary

This release contains annotation for a total of 29 unique documents,
with the following distribution across the three practice topics and
languages:

Topic ID    Language   Documents
T201        ENG	       2        
T202        ENG        3
T203        ENG        4
T201        RUS        4
T202        RUS        0
T203        RUS        6
T201        SPA        3
T202        SPA        2
T203        SPA        5


3. Annotations

The annotation tagset and annotation guidelines can be found in the
docs directory, and the formats of annotations are described in the
AIDA_phase_2_table_field_descriptions_v1.tab file in the docs
directory; the sections below provide descriptions of the content of
each type of annotation file.

3.1 Mentions

A mention is a single reference in source data to a real-world
entity or filler, event, or relation. A mention may occur in text,
image, or video. A mention of an entity that takes part in an event
or relations is called an argument

There are three mentions tables for each topic: one for entities and
fillers, one for relations, and one for events. These tables contain
information about each annotated mention. Note the following:

- All mentions.tab files include subtype and subsubtype fields.

- All mentions.tab files include the root uid.

- Video mentions do not specify the signal type (picture or sound),
  so the mediamention_signaltype field in the evt, rel, and arg
  mentions tables is always set to EMPTY_NA

- Video mentions of events and relations include start and end time
  stamps for the mentions (no keyframe id or bounding box
  coordinates); video mentions of entities include bounding
  coordinates and keyframe id (no start and end time stamps)

- Arg mentions include a mention status of "base" or "informative"
  indicating whether the entity/filler mention is the local mention
  that occupies an arg slot in a relation or event mention ("base") or
  whether it is an entity mention that is not connected to an
  event/relation mention but was annotated as part of exhaustive
  annotation of entities by type for the selected regions
  ("informative").

- Relation and event mentions can have the attributes "hedged" and/or
  "not". A "hedged" relation or event is one which the source data
  asserts as *possibly* true (or possibly not true). A "not" relation
  or event is one which the source data asserts as not havig occurred.

3.2 Slots

A slot is a pre-defined role in an event or relation that is filled
by an argument (entity mention).

There are two slots tables per topic, one for relations and one for
events. Relation and event mentions in the mentions tables must be
looked up in the slots tables to find the arguments and fillers
involved in the relation/event.

Event mentions can occur as the arguments of other events, in
addition to occurring as the arguments of relations.

3.3 KB Linking

A knowledge base (KB) is a static set of reference entities. Entity
mentions are "linked" to entries in the KB as a method of indicating
the real-world entity to which an entity refers. When an entity does
not appear in the reference KB, it is instead assigned a NIL ID.
When more than one mention are assigned the same NIL ID, this
indicates that the mentions are coreferent (i.e. the same entity).

The KB linking tables in this release provide within-document
coreference of events, relations, and entities. No linking to the
reference knowledge base or cross-document NIL coreference is
included. The KB IDs refer to AIDA Scenario 1 and 2 Reference
Knowledge Base (LDC2023XXX).

NIL ids are provided for each coreference cluster within a
document. Clusters of the "same" entity, relation, or event in
different documents will have different NIL ids since the coreference
annotation is within-document only.


4. Documentation

The following documents are present in the docs/ directory of this
package:

AIDA_Type_Restricted_Event_Relation_Annotation_Guidelines_V1.0.pdf -
current annotation guidelines for exhaustive annotation of event and
relation mentions (including their arguments and attributes)

AIDA_Exhaustive_Entity-Filler_Guidelines_Text_V1.0.pdf - current
guidelines for exhaustive annotation of entities and fillers in text

AIDA_Entity-Filler_Guidelines_Images_V1.1.pdf - current guidelines for
exhaustive annotation of entities and fillers in images and video
keyframes

AIDA_phase_2_table_field_descriptions_v1.tab - description of the
structure of each type of annotation table. This table includes
information about column headers, content of each field, and format of
the contents

doc_lang_topic.tab - provides the root uid, language, and topic for
each document with annotations present in this release

AIDA_Annotation_Ontology_Phase2_V1.1.xlsx - a copy of the annotation
ontology for Phase 2

T201_T202_T203_topic_description_V1.pdf - descriptions of Phase 2
practice topics with queries and query IDs. Note that the queries are
meant to draw annotators attention to expected points of informational
conflict within the topic, but annotation is exhaustive by type for
(the selected regions of) each document; therefore, annotations will
include all event, relation, and entity mentions of the selected
types, regardless of salience to the topics or queries.

doc_regions_types_v5.tab - defines the types and document regions
annotated for each document; one row per type per document
region. Includes root uid, child uid, media type, type, subtype,
subsubtype, and span. Span indicates the range of character offsets
for text and time stamps for video that were annotated for the given
type. For images the span is 'ENTIRE_DOCUMENT_ELEMENT' since the
entire image is in scope for annotation. For entity annotations on
video, the span consists of the keyframe id.


5. Copyright Information

   (c) 2023 Trustees of the University of Pennsylvania