Corpus Title: KAIROS Phase 1 Evaluation Source Data, Annotation, Assessment
LDC Catalog-ID: LDC2026T07

Authors: Song Chen, Jennifer Tracey, Justin Mott, Ann Bies, Michael
         Arrigo, Christopher Caruso, David Graff, Stephanie Strassel


1.0 Introduction

The KAIROS Phase 1 Evaluation Source, Annotation, Assessment corpus
contains the English and Spanish source data (including text, video
and images), manual annotations, system output that was assessed
during the evaluation, and human assessment results from the Phase
1 evaluation of the DARPA KAIROS Program (DARPA, 2018). The
Knowledge-directed Artificial Intelligence Reasoning Over Schemas
(KAIROS) Program aimed to develop technology capable of understanding
and reasoning about complex real-world events in order to provide
actionable insights to end users. KAIROS systems utilize formal event
representations in the form of schema libraries that specify the
steps, preconditions and constraints for an open set of complex
events; schemas are then used in combination with event extraction to
characterize and make predictions about real-world events in a large
multilingual, multimedia corpus. Each KAIROS evaluation focused on a
real-world scenario and several real-world Complex Events (CEs) within
that scenario, along with the possibility of surprise CEs in different
but related scenarios.

The Phase 1 evaluation focused on the Improvised Explosive (IED)
bombing scenario with nine IED CEs, along with two surprise CEs in the
mass shooting scenario. The Phase 1 evaluation CE names and IDs are as
follows:

    - ce1005: Sidney Aeroplane Bomb Plot, Australia, 2017
    - ce1006: Stockholm Bombings, Sweden, 2010
    - ce1007: Manchester Arena Bombing, England, 2017
    - ce1008: Taxi Detonation, Canada, 2016
    - ce1009: Spokane Bombing Attempt, Washington, 2011
    - ce1010: Derry Bombing, Northern Ireland, 2019
    - ce1011: Bogotá Police Academy Car Bombing, Colombia, January 2019
    - ce1012: Kansas City Hospital Bombing, Missouri, 2020
    - ce1013: Attempted bombing in Moses Lake, Washington, 2018
    - ce1020: El Paso Walmart Shooting, Texas, 2019
    - ce1021: Orlando nightclub shooting, Florida, 2016

For the KAIROS Phase 1 Evaluation, systems combined schemas from a
schema library of complex events with knowledge elements (events,
relations, and their arguments) extracted from the input data set, to
create an instantiated system CE. Performers in the first of two
Technical Areas (TA1) created the schema libraries. Performers in the
second Technical Area (TA2) populated the schemas with extracted
knowledge elements. The resulting instantiated CE was the input to the
manual assessment process in an end-to-end evaluation of the TA2
system’s ability to detect complex events and to select and
instantiate the TA1 system’s schema.

For each CE, the source data for evaluation is an input data set
consisting of 10-15 documents. Source data includes both CE-relevant
documents and off-topic distractor documents in English and Spanish,
covering text, image and video sources. Manual annotation and
assessment of the CE-relevant documents for 10 of the CEs are included
in this package. The off-topic distractor documents were not manually
annotated, and ce1021 was not annotated or assessed as part of the
evaluation.

The annotation data includes gold standard reference annotations
created by trained annotators who labeled the scenario-relevant events
and relations in each document set using the pre-defined KAIROS
program ontology, resulting in a structured representation of the
temporally-ordered events, relations, and arguments necessary to fully
express the scenario-relevant events in each CE.

The graph data consists of a reference knowledge graph for each CE,
known as Graph G. Each Graph G contains the subset of manually labeled
events and relations that were selected by the KAIROS program to
constitute the ground truth reference CE for evaluation in the oracle
condition evaluation task, in which the systems were expected to match 
the Graph G with a given schema library, while bypassing the extraction
step. Some manually labeled events and relations were intentionally 
omitted from Graph G. The oracle condition evaluation task did not rely 
on manual assessment, but rather used Graph G as the ground truth for 
automatically scoring this evaluation task.

The assessment data includes both human assessment judgments and also
the system output that was manually assessed for the end-to-end
evaluation task. System output consisted of a TA1 CE schema populated
with TA2 events and relations, and their arguments. Trained assessors
reviewed the system output along with the reference annotation to
determine CE matching, knowledge element (KE) matching, and
correctness.


2.0 Directory Structure and Content Summary

2.1 Directory Structure

The directory structure and contents of the package are summarized
below. Paths shown are relative to the base (root) directory of the
package:

  ./data/source -- source data in subdirectories by data type
  ./data/annotation
        – contains annotation for each CE in subdirectories by CE ID
   ./data/graph – contains Graph G for each CE
   ./data/assessment/system_output/sdf
        – contains system output that was manually assessed for each CE in in SDF JSON format
   ./data/assessment/system_output/tab
        – contains system output that was manually assessed for each CE in in tab-delimited format
 ./data/assessment/assessment_result
        – contains manual assessment results in tables by assessment task
  ./docs -- contains this README file and other documentation about the corpus
  ./docs/ce_profile -- contains narrative profiles of the incident event for each CE
  ./tools -- contains software for LTF data manipulation

2.2 Source Data

A total of 139 root web pages were collected and processed, yielding
131 text data files, 1176 image files, and 27 video files present in
the corpus.

The "./data/source" directory has a separate subdirectory for each of
the following data types, and each directory contains data files of
the given type; the list shows the directory and file-extension
strings used for the data files of each type:

    gif -- contains "gif/*.gif.ldcc" (image data)
    jpg  -- contains "jpg/*.jpg.ldcc" (image data)
    png -- contains "png/*.png.ldcc" (image data)
    svg -- contains "svg/*.svg.ldcc" (image data)
    mp4 -- contains "mp4/*.mp4.ldcc" (video data)

    ltf -- contains "ltf/*.ltf.xml" (segmented/tokenized text data)
    psm -- contains "psm/*.psm.xml" (companion to ltf.xml)

Data types in the first group consist of original source materials
presented in "ldcc wrapper" file format (see section 3.2 below).  The
latter group (ltf and psm) are created by LDC from source HTML data,
by way of an intermediate XML reduction of the original HTML content
for "root" web pages (see section 3.1 for a description of the
process, and section 3.3 for details on the LTF and PSM file formats).

The data files use 9-character file-IDs. For example:

        svg/JC002YBYQ.svg.ldcc

The "ldcc" file format is explained in more detail in section 3.2
below.

2.3 Annotation

The ./data/annotation/ce10xx directories include the reference
annotation for each Complex Event (CE) in tab-delimited format in a
subdirectory named with the CE ID.

2.4 Graph G

Graph Gs are included in the ./data/graph/ce10xx directories. Graph G
is in JSON format. For convenience only, graph G information is also
included in human-readable Excel (.xlsx) files.

2.5 Assessment

The ./data/assessment/assessment_result directory includes the manual
assessment results divided into separate tables for each assessment
task. Each table includes all assessment results for the task, across
all assessed system runs and CEs.

The ./data/assessment/system_output directory includes the anonymized
system output that was manually assessed for the Phase 1
evaluation. The system output is included in two formats: the original
SDF JSON format (.data/assessment/system_output/sdf) and a
tab-delimited format that renders the SDF JSON series human readable
(./data/assessment/system_output/tab).

2.6 Software Tools

The software tool in this release can be found in the ./tools/ltf2txt
directory.

A data file in ltf.xml format (for source data) can be conditioned to
recreate exactly the "raw source data" text stream (the rsd.txt file)
from which the LTF was created.  The tool described here can be used
to apply that conditioning to a directory containing ltf.xml data.
The script validates each output rsd.txt stream by comparing its MD5
checksum against the reference MD5 checksum of the original rsd.txt
file from which the LTF was created.  (This reference checksum is
stored as an attribute of the "DOC" element in the ltf.xml structure;
there is also an attribute that stores the character count of the
original rsd.txt file.)

The script contains user documentation as part of the script content;
you can run "perldoc" to view the documentation as a typical unix man
page, or you can simply view the script content directly by whatever
means to read the documentation.  Also, running the script without any
command-line arguments will cause it to display a one-line synopsis of
its usage, and then exit.

   ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)


3.0 Source Data Preparation

Trained KAIROS annotators searched the web for suitable CE documents;
these documents were first harvested from various sources using an
automated system developed by LDC, and then processed to produce a
standardized format for use in downstream tasks.

3.1 Treatment of Original HTML Text Content

All harvested HTML content was initially converted from its original
form into a relatively uniform XML format; this stage of conversion
eliminated irrelevant content (menus, ads, headers, footers, etc.),
and placed the content of interest into a simplified, consistent
markup structure.

The "homogenized" XML format then served as input for the creation of
a reference "raw source data" (rsd) plain text form of the web page
content; at this stage, the text was also conditioned to normalize
white-space characters, and to apply transliteration and/or other
character normalization, as appropriate to the given language.

This processing creates the ltf.xml and psm.xml files for each
harvested "root" web page; these file formats are described in more
detail in ./docs/README_docs.txt.

3.2 Treatment of Non-HTML Data Types: "ldcc" File Format

To the fullest extent possible, all discrete resources referenced by a
given "root" HTML page (style sheets, javascript, images, video, audio
and other media files, etc.) are stored as separate files of the given
data type, and assigned separate 9-character file-IDs (the same form
of ID as is used for the "root" HTML page).

In order to present these attached resources in a stable and
consistent way, we developed a "wrapper" or "container" file format,
which presents the original data as-is, together with a specialized
header block prepended to the data.  The header block provides
metadata about the file contents, including the MD5 checksum (for
self-validation), the data type and byte count, url, and citations of
source-ID and parent (HTML) file-ID.

The LDCC header block always begins with a 16-byte ASCII signature, as
shown between double-quotes on the following line (where "\n"
represents the ASCII "newline" character 0x0A):

 "LDCc   \n1024   \n"

Note that the "1024" on the second line of the signature represents
the exact byte count of the LDCC header block.  (If/when this header
design needs to accommodate larger quantities of metadata, the header
byte count can be expanded as needed in increments of 1024 bytes. Such
expansion does not arise in the present release.)

Immediately after the 16-byte signature, a YAML string presents a data
structure comprising the file-specific header content, expressed as a
set of "key: value" pairings in UTF-8 encoding.

The YAML string is padded at the end with space characters, such that
when the following 8-byte string is appended, the full header block
size is exactly 1024 bytes (or whatever size is stated in the initial
signature):

 "endLDCc\n"

In order to process the content of an LDCC header:

 - read the initial block of 1024 bytes from the *.ldcc data file
 - check that it begins with "LDCc   \n1024   \n" and ends with "endLDCc\n"
 - strip off those 16- and 8-byte portions
 - pass the remainder of the block to a YAML parser.

In order to access the original content of the data file, simply skip
or remove the initial 1024 bytes.

3.3 Overview of XML Data Structures

3.3.1 PSM.xml -- Primary Source Markup Data

The "homogenized" XML format described above preserves the minimum set
of tags needed to represent the structure of the relevant text as seen
by the human web-page reader.  When the text content of the XML file
is extracted to create the "rsd" format (which contains no markup at
all), the markup structure is preserved in a separate "primary source
markup" (psm.xml) file, which enumerates the structural tags in a
uniform way, and indicates, by means of character offsets into the
rsd.txt file, the spans of text contained within each structural
markup element.

For example, in a discussion-forum or web-log page, there would be a
division of content into the discrete "posts" that make up the given
thread, along with "quote" regions and paragraph breaks within each
post.  After the HTML has been reduced to uniform XML, and the tags
and text of the latter format have been separated, information about
each structural tag is kept in a psm.xml file, preserving the type of
each relevant structural element, along with its essential attributes
("post_author", "date_time", etc.), and the character offsets of the
text span comprising its content in the corresponding rsd.txt file.

3.3.2 LTF.xml -- Logical Text Format Data

The "ltf.xml" data format is derived from rsd.txt, and contains a
fully segmented and tokenized version of the text content for a given
web page.  Segments (sentences) and the tokens (words) are marked off
by XML tags (SEG and TOKEN), with "id" attributes (which are only
unique within a given XML file) and character offset attributes
relative to the corresponding rsd.txt file; TOKEN tags have additional
attributes to describe the nature of the given word token.

The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly
by suitable punctuation in the original source data.  To the extent
that sentence boundaries cannot be accurately detected (due to
variability or ambiguity in the source data), the segmentation process
will tend to err more often on the side of missing actual sentence
boundaries, and less often on the side of asserting false sentence 
breaks.

The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play
particular roles in web-based text (e.g. URLs, email addresses and
hashtags).  To the extent that word boundaries are not explicitly
marked in the source text, the LTF tokenization is intended to divide
the raw-text character stream into units that correspond to "words" in
the linguistic sense (i.e. basic units of lexical meaning).


4.0 Reference Annotation

See ./docs/README_annotation.txt for a description of the reference
annotation data, approach and results.

See ./docs/annotation_table_description.pdf for details about the
fields and content contained in each annotation table.


5.0 Manual Assessment and System Output

The assessment approach, system output and assessment results are
described in ./docs/README_assessment.txt. This document also contains
information on mapping between the assessment results tables and the
system output tables.

See ./docs/assessment_result_description.pdf for details about the
fields and content contained in each assessment results table.

See ./docs/system_output_description.pdf for details about the fields,
values and content of the system output tables.


6.0 Documentation Included in this Package

See ./docs/REAMDE_docs.txt for details about documentation files
included in this package.


7.0 References

DARPA. 2018. Knowledge-directed Artificial Intelligence Reasoning Over
Schemas (KAIROS). Defense Advanced Research Projects Agency, DARPA BAA
HR001119S0014.


8.0 Sponsorship

KAIROS was sponsored by the Air Force Research Laboratory (AFRL) and the
Defense Advanced Research Projects Agency (DARPA) under Contract
No. HR0011-19-S-0014.


9.0 Copyright

Portions © 2019 ABC, © 2008 ACTUALIDAD MEDIA GROUP, LLC, © 2011
Autonomous Nonprofit Organization “TV-Novosti,” © 2010, 2014,
2019-2020 BBC, © 2019 BelfastTelegraph.co.uk, © 2016, 2020 Cable News
Network. A Warner Media Company, © 2016 CBC/Radio-Canada, © 2010
Channel Four Television Corporation, © 2020 Chicago Tribune, © 2019
EDICIONES EL PAÍS SL, © 2019 France 24, © 2020 Google Inc. © 2010
Guardian News & Media Limited or its affiliated companies, © 2020
Heavy Inc., © 2020 Kansas City Public Radio, © 2020 KCTV5 News
(A Meredith Corporation Station), © 2011, 2018 KHQ, © 2019 Morgan
Murphy Media, © 2019-2020 NBC UNIVERSAL, © 2016, 2019 News Group
Newspapers Limited in England, © 2019 npr, © 2019 NYP Holdings, Inc.,
© 2018 NYWA, © 2011, 2020 Reuters, © 2018 Sinclair Broadcast Group,
Inc., © 2016 Sky UK, © 2017 Spanish Radio and Television Corporation,
© 2019 Special Broadcasting Service Corporation, © 2010 Standard.co.uk,
© 2020 The Associated Press, © 2019-2020 The City Paper Bogotá, © 2019
The Epoch Times, © 2019 THE IRISH TIMES, © 2017 The Lion of El Español
Publications SA, © 2019 Unidad Editorial Informacion General, SLU,
© 2020 Trustees of the University of Pennsylvania


10.0 Contacts

Ann Bies <bies@ldc.upenn.edu> - KAIROS Project Coordinator
Christopher Caruso <caruso@ldc.upenn.edu> - KAIROS Tech Lead
Brian Gainor <bgainor@ldc.upenn.edu> - KAIROS Tech Lead
Song Chen <zhiyi@ldc.upenn.edu> - KAIROS Project Manager


------
README created by Song Chen on November 28, 2023
       Updated by Ann Bies on November 28, 2023
       Updated by Ann Bies and Song Chen on April 4, 2025
       Updated by Ann Bies and Song Chen on April 9, 2025
       Updated by Ann Bies and Song Chen on April 10, 2025
       Updated by Song Chen on April 15, 2025
       Updated by Ann O'Brien on May 1, 2025
       Updated by Song Chen on May 15, 2025
       Updated by Ann Bies and Song Chen on May 16, 2025