AIDA Scenario 3 Practice Topic Source Data and Annotation

Item Name: AIDA Scenario 3 Practice Topic Source Data and Annotation
Author(s): Jennifer Tracey, Stephanie Strassel, Jeremy Getman, Ann Bies, Kira Griffitt, David Graff, Christopher Caruso
LDC Catalog No.: LDC2025T02
ISLRN: 141-368-488-003-3
DOI: https://doi.org/10.35111/a9kv-ct74
Release Date: February 17, 2025
Member Year(s): 2025
DCMI Type(s): MovingImage, Software, StillImage, Text
Data Source(s): discussion forum, newswire, web collection, weblogs
Project(s): AIDA
Application(s): entity extraction, information extraction
Language(s): English, Russian, Spanish
Language ID(s): eng, rus, spa
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2025T02 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Tracey, Jennifer, et al. AIDA Scenario 3 Practice Topic Source Data and Annotation LDC2025T02. Web Download. Philadelphia: Linguistic Data Consortium, 2025.
Related Works: View

Introduction

AIDA Scenario 3 Practice Topic Source Data and Annotation was developed by the Linguistic Data Consortium (LDC) and is comprised of English, Russian and Spanish web documents (text, video, image) and annotations.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed to develop a multi-hypothesis semantic engine to generate explicit alternative interpretations of events, situations and trends from a variety of unstructured sources. LDC supported AIDA by collecting, creating and annotating multimodal linguistic resources in multiple languages.

Each phase of the AIDA program centered on a specific scenario, or broad topic area, with related subtopics designated as either practice subtopics or evaluation subtopics. The Phase 3 scenario focused on the COVID-19 global pandemic. This corpus contains source documents and annotations for the Scenario 3 practice topics.

Data

Source documents were collected from the web by a combination of automatic and manual processes. HTML content was converted from its original form into XML. To the extent possible, all resources referenced by a given "root" HTML page (style sheets, javascript, images, media files, etc.) were stored as separate files of the given data type and assigned separate 9-character file-IDs (the same form of ID used for the "root" HTML page).

The corpus contains 1417 root documents; 279 documents were annotated. Annotations include:

  • Event, relation and entity annotation (64 documents)
  • Claim frame annotation: claims (true or not) relating to the COVID-19 pandemic (203 documents)
  • Practice topic query claim frames: example claim frames intended to be used by systems as queries to extract similar claims from additional documents (30 documents)

Claim frame annotations were produced by LDC; University of Colorado Boulder; Johns Hopkins University; Language Technologies Institute, Carnegie Mellon University; and Univeristy of Illinois Urbana-Champaign.

Annotations are presented as tab separated files.  

Sponsorship

This material is based upon work supported by Air Force Research Laboratory (AFRL) and the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-18-C-0013.

Samples

Please view the following samples:

 

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee