KAIROS Schema Learning Background Source Data

Item Name: KAIROS Schema Learning Background Source Data
Author(s): Jennifer Tracey, Song Chen, Christopher Caruso, Stephanie Strassel
LDC Catalog No.: LDC2026T02
ISLRN: 515-437-467-737-2
DOI: https://doi.org/10.35111/r50m-6x32
Release Date: February 16, 2026
Member Year(s): 2026
DCMI Type(s): Image, MovingImage, Software, Sound, StillImage, Text
Data Source(s): web collection
Project(s): KAIROS
Application(s): entity extraction, event detection, information extraction, knowledge representation
Language(s): English, Spanish
Language ID(s): eng, spa
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2026T02 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Tracey, Jennifer, et al. KAIROS Schema Learning Background Source Data LDC2026T02. Web Download. Philadelphia: Linguistic Data Consortium, 2026.
Related Works: View

Introduction

KAIROS Schema Learning Background Source Data was developed by the Linguistic Data Consortium (LDC). It contains over 14,000 English and Spanish documents representing text, audio, video, image, and multimedia resources collected during the DARPA KAIROS program as supplemental background source data for the KAIROS Schema Learning Corpus (SLC). The complete set of SLC background source data was comprised of 16.2 million English, Russian and Spanish documents and more than 125,000 audio, video, image, or multimedia resources.

The purpose of the supplemental collection was to increase the amount of English and Spanish data with multimedia components for schema learning and to add domains not well represented in the existing Spanish data. The supplemental data in this release includes material from the business and logistics domains, instructional documents and multimedia news.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over Schemas) program aimed to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilized formal event representations in the form of schema libraries that specified the steps, preconditions and constraints for an open set of complex events; schemas were then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus.

The SLC and KAIROS Schema Learning Complex Event Annotation (LDC2025T07), which contains English and Spanish text, audio, video, and image material labeled for 93 real-world complex events, constitute the data used by KAIROS system developers for schema learning.

Data

In addition to the supplemental data contained in this publication, SLC background data included English, Russian and Spanish mutlimedia resources from pre-existing LDC datasets. A list of those dastasets is contained in the documentation accompanying this release.

Source data was collected primarily from the web by LDC and is presented in various formats, including .gif, .jpg, .ltf, .mp4, .png, .psm, and .svg.

Software tools are also included in this release. The tools recreate original source data from the processed XML material.

  • ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)
  • ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives

Samples

Please view the following samples:

 

Sponsorship

KAIROS was sponsored by the Air Force Research Laboratory (AFRL) and the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-19-S-0014.

Updates

No updates at this time.

Available Media

View Fees





Login for the applicable fee