Corpus Title: KAIROS Phase 1 Evaluation Source Data, Annotation, Assessment LDC Catalog-ID: LDC2026T07 Authors: Song Chen, Jennifer Tracey, Justin Mott, Ann Bies, Michael Arrigo, Christopher Caruso, David Graff, Stephanie Strassel 1.0 Introduction The KAIROS Phase 1 Evaluation Source, Annotation, Assessment corpus contains the English and Spanish source data (including text, video and images), manual annotations, system output that was assessed during the evaluation, and human assessment results from the Phase 1 evaluation of the DARPA KAIROS Program (DARPA, 2018). The Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS) Program aimed to develop technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilize formal event representations in the form of schema libraries that specify the steps, preconditions and constraints for an open set of complex events; schemas are then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus. Each KAIROS evaluation focused on a real-world scenario and several real-world Complex Events (CEs) within that scenario, along with the possibility of surprise CEs in different but related scenarios. The Phase 1 evaluation focused on the Improvised Explosive (IED) bombing scenario with nine IED CEs, along with two surprise CEs in the mass shooting scenario. The Phase 1 evaluation CE names and IDs are as follows: - ce1005: Sidney Aeroplane Bomb Plot, Australia, 2017 - ce1006: Stockholm Bombings, Sweden, 2010 - ce1007: Manchester Arena Bombing, England, 2017 - ce1008: Taxi Detonation, Canada, 2016 - ce1009: Spokane Bombing Attempt, Washington, 2011 - ce1010: Derry Bombing, Northern Ireland, 2019 - ce1011: Bogotá Police Academy Car Bombing, Colombia, January 2019 - ce1012: Kansas City Hospital Bombing, Missouri, 2020 - ce1013: Attempted bombing in Moses Lake, Washington, 2018 - ce1020: El Paso Walmart Shooting, Texas, 2019 - ce1021: Orlando nightclub shooting, Florida, 2016 For the KAIROS Phase 1 Evaluation, systems combined schemas from a schema library of complex events with knowledge elements (events, relations, and their arguments) extracted from the input data set, to create an instantiated system CE. Performers in the first of two Technical Areas (TA1) created the schema libraries. Performers in the second Technical Area (TA2) populated the schemas with extracted knowledge elements. The resulting instantiated CE was the input to the manual assessment process in an end-to-end evaluation of the TA2 system’s ability to detect complex events and to select and instantiate the TA1 system’s schema. For each CE, the source data for evaluation is an input data set consisting of 10-15 documents. Source data includes both CE-relevant documents and off-topic distractor documents in English and Spanish, covering text, image and video sources. Manual annotation and assessment of the CE-relevant documents for 10 of the CEs are included in this package. The off-topic distractor documents were not manually annotated, and ce1021 was not annotated or assessed as part of the evaluation. The annotation data includes gold standard reference annotations created by trained annotators who labeled the scenario-relevant events and relations in each document set using the pre-defined KAIROS program ontology, resulting in a structured representation of the temporally-ordered events, relations, and arguments necessary to fully express the scenario-relevant events in each CE. The graph data consists of a reference knowledge graph for each CE, known as Graph G. Each Graph G contains the subset of manually labeled events and relations that were selected by the KAIROS program to constitute the ground truth reference CE for evaluation in the oracle condition evaluation task, in which the systems were expected to match the Graph G with a given schema library, while bypassing the extraction step. Some manually labeled events and relations were intentionally omitted from Graph G. The oracle condition evaluation task did not rely on manual assessment, but rather used Graph G as the ground truth for automatically scoring this evaluation task. The assessment data includes both human assessment judgments and also the system output that was manually assessed for the end-to-end evaluation task. System output consisted of a TA1 CE schema populated with TA2 events and relations, and their arguments. Trained assessors reviewed the system output along with the reference annotation to determine CE matching, knowledge element (KE) matching, and correctness. 2.0 Directory Structure and Content Summary 2.1 Directory Structure The directory structure and contents of the package are summarized below. Paths shown are relative to the base (root) directory of the package: ./data/source -- source data in subdirectories by data type ./data/annotation – contains annotation for each CE in subdirectories by CE ID ./data/graph – contains Graph G for each CE ./data/assessment/system_output/sdf – contains system output that was manually assessed for each CE in in SDF JSON format ./data/assessment/system_output/tab – contains system output that was manually assessed for each CE in in tab-delimited format ./data/assessment/assessment_result – contains manual assessment results in tables by assessment task ./docs -- contains this README file and other documentation about the corpus ./docs/ce_profile -- contains narrative profiles of the incident event for each CE ./tools -- contains software for LTF data manipulation 2.2 Source Data A total of 139 root web pages were collected and processed, yielding 131 text data files, 1176 image files, and 27 video files present in the corpus. The "./data/source" directory has a separate subdirectory for each of the following data types, and each directory contains data files of the given type; the list shows the directory and file-extension strings used for the data files of each type: gif -- contains "gif/*.gif.ldcc" (image data) jpg -- contains "jpg/*.jpg.ldcc" (image data) png -- contains "png/*.png.ldcc" (image data) svg -- contains "svg/*.svg.ldcc" (image data) mp4 -- contains "mp4/*.mp4.ldcc" (video data) ltf -- contains "ltf/*.ltf.xml" (segmented/tokenized text data) psm -- contains "psm/*.psm.xml" (companion to ltf.xml) Data types in the first group consist of original source materials presented in "ldcc wrapper" file format (see section 3.2 below). The latter group (ltf and psm) are created by LDC from source HTML data, by way of an intermediate XML reduction of the original HTML content for "root" web pages (see section 3.1 for a description of the process, and section 3.3 for details on the LTF and PSM file formats). The data files use 9-character file-IDs. For example: svg/JC002YBYQ.svg.ldcc The "ldcc" file format is explained in more detail in section 3.2 below. 2.3 Annotation The ./data/annotation/ce10xx directories include the reference annotation for each Complex Event (CE) in tab-delimited format in a subdirectory named with the CE ID. 2.4 Graph G Graph Gs are included in the ./data/graph/ce10xx directories. Graph G is in JSON format. For convenience only, graph G information is also included in human-readable Excel (.xlsx) files. 2.5 Assessment The ./data/assessment/assessment_result directory includes the manual assessment results divided into separate tables for each assessment task. Each table includes all assessment results for the task, across all assessed system runs and CEs. The ./data/assessment/system_output directory includes the anonymized system output that was manually assessed for the Phase 1 evaluation. The system output is included in two formats: the original SDF JSON format (.data/assessment/system_output/sdf) and a tab-delimited format that renders the SDF JSON series human readable (./data/assessment/system_output/tab). 2.6 Software Tools The software tool in this release can be found in the ./tools/ltf2txt directory. A data file in ltf.xml format (for source data) can be conditioned to recreate exactly the "raw source data" text stream (the rsd.txt file) from which the LTF was created. The tool described here can be used to apply that conditioning to a directory containing ltf.xml data. The script validates each output rsd.txt stream by comparing its MD5 checksum against the reference MD5 checksum of the original rsd.txt file from which the LTF was created. (This reference checksum is stored as an attribute of the "DOC" element in the ltf.xml structure; there is also an attribute that stores the character count of the original rsd.txt file.) The script contains user documentation as part of the script content; you can run "perldoc" to view the documentation as a typical unix man page, or you can simply view the script content directly by whatever means to read the documentation. Also, running the script without any command-line arguments will cause it to display a one-line synopsis of its usage, and then exit. ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data) 3.0 Source Data Preparation Trained KAIROS annotators searched the web for suitable CE documents; these documents were first harvested from various sources using an automated system developed by LDC, and then processed to produce a standardized format for use in downstream tasks. 3.1 Treatment of Original HTML Text Content All harvested HTML content was initially converted from its original form into a relatively uniform XML format; this stage of conversion eliminated irrelevant content (menus, ads, headers, footers, etc.), and placed the content of interest into a simplified, consistent markup structure. The "homogenized" XML format then served as input for the creation of a reference "raw source data" (rsd) plain text form of the web page content; at this stage, the text was also conditioned to normalize white-space characters, and to apply transliteration and/or other character normalization, as appropriate to the given language. This processing creates the ltf.xml and psm.xml files for each harvested "root" web page; these file formats are described in more detail in ./docs/README_docs.txt. 3.2 Treatment of Non-HTML Data Types: "ldcc" File Format To the fullest extent possible, all discrete resources referenced by a given "root" HTML page (style sheets, javascript, images, video, audio and other media files, etc.) are stored as separate files of the given data type, and assigned separate 9-character file-IDs (the same form of ID as is used for the "root" HTML page). In order to present these attached resources in a stable and consistent way, we developed a "wrapper" or "container" file format, which presents the original data as-is, together with a specialized header block prepended to the data. The header block provides metadata about the file contents, including the MD5 checksum (for self-validation), the data type and byte count, url, and citations of source-ID and parent (HTML) file-ID. The LDCC header block always begins with a 16-byte ASCII signature, as shown between double-quotes on the following line (where "\n" represents the ASCII "newline" character 0x0A): "LDCc \n1024 \n" Note that the "1024" on the second line of the signature represents the exact byte count of the LDCC header block. (If/when this header design needs to accommodate larger quantities of metadata, the header byte count can be expanded as needed in increments of 1024 bytes. Such expansion does not arise in the present release.) Immediately after the 16-byte signature, a YAML string presents a data structure comprising the file-specific header content, expressed as a set of "key: value" pairings in UTF-8 encoding. The YAML string is padded at the end with space characters, such that when the following 8-byte string is appended, the full header block size is exactly 1024 bytes (or whatever size is stated in the initial signature): "endLDCc\n" In order to process the content of an LDCC header: - read the initial block of 1024 bytes from the *.ldcc data file - check that it begins with "LDCc \n1024 \n" and ends with "endLDCc\n" - strip off those 16- and 8-byte portions - pass the remainder of the block to a YAML parser. In order to access the original content of the data file, simply skip or remove the initial 1024 bytes. 3.3 Overview of XML Data Structures 3.3.1 PSM.xml -- Primary Source Markup Data The "homogenized" XML format described above preserves the minimum set of tags needed to represent the structure of the relevant text as seen by the human web-page reader. When the text content of the XML file is extracted to create the "rsd" format (which contains no markup at all), the markup structure is preserved in a separate "primary source markup" (psm.xml) file, which enumerates the structural tags in a uniform way, and indicates, by means of character offsets into the rsd.txt file, the spans of text contained within each structural markup element. For example, in a discussion-forum or web-log page, there would be a division of content into the discrete "posts" that make up the given thread, along with "quote" regions and paragraph breaks within each post. After the HTML has been reduced to uniform XML, and the tags and text of the latter format have been separated, information about each structural tag is kept in a psm.xml file, preserving the type of each relevant structural element, along with its essential attributes ("post_author", "date_time", etc.), and the character offsets of the text span comprising its content in the corresponding rsd.txt file. 3.3.2 LTF.xml -- Logical Text Format Data The "ltf.xml" data format is derived from rsd.txt, and contains a fully segmented and tokenized version of the text content for a given web page. Segments (sentences) and the tokens (words) are marked off by XML tags (SEG and TOKEN), with "id" attributes (which are only unique within a given XML file) and character offset attributes relative to the corresponding rsd.txt file; TOKEN tags have additional attributes to describe the nature of the given word token. The segmentation is intended to partition each text file at sentence boundaries, to the extent that these boundaries are marked explicitly by suitable punctuation in the original source data. To the extent that sentence boundaries cannot be accurately detected (due to variability or ambiguity in the source data), the segmentation process will tend to err more often on the side of missing actual sentence boundaries, and less often on the side of asserting false sentence breaks. The tokenization is intended to separate punctuation content from word content, and to segregate special categories of "words" that play particular roles in web-based text (e.g. URLs, email addresses and hashtags). To the extent that word boundaries are not explicitly marked in the source text, the LTF tokenization is intended to divide the raw-text character stream into units that correspond to "words" in the linguistic sense (i.e. basic units of lexical meaning). 4.0 Reference Annotation See ./docs/README_annotation.txt for a description of the reference annotation data, approach and results. See ./docs/annotation_table_description.pdf for details about the fields and content contained in each annotation table. 5.0 Manual Assessment and System Output The assessment approach, system output and assessment results are described in ./docs/README_assessment.txt. This document also contains information on mapping between the assessment results tables and the system output tables. See ./docs/assessment_result_description.pdf for details about the fields and content contained in each assessment results table. See ./docs/system_output_description.pdf for details about the fields, values and content of the system output tables. 6.0 Documentation Included in this Package See ./docs/REAMDE_docs.txt for details about documentation files included in this package. 7.0 References DARPA. 2018. Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS). Defense Advanced Research Projects Agency, DARPA BAA HR001119S0014. 8.0 Sponsorship KAIROS was sponsored by the Air Force Research Laboratory (AFRL) and the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-19-S-0014. 9.0 Copyright Portions © 2019 ABC, © 2008 ACTUALIDAD MEDIA GROUP, LLC, © 2011 Autonomous Nonprofit Organization “TV-Novosti,” © 2010, 2014, 2019-2020 BBC, © 2019 BelfastTelegraph.co.uk, © 2016, 2020 Cable News Network. A Warner Media Company, © 2016 CBC/Radio-Canada, © 2010 Channel Four Television Corporation, © 2020 Chicago Tribune, © 2019 EDICIONES EL PAÍS SL, © 2019 France 24, © 2020 Google Inc. © 2010 Guardian News & Media Limited or its affiliated companies, © 2020 Heavy Inc., © 2020 Kansas City Public Radio, © 2020 KCTV5 News (A Meredith Corporation Station), © 2011, 2018 KHQ, © 2019 Morgan Murphy Media, © 2019-2020 NBC UNIVERSAL, © 2016, 2019 News Group Newspapers Limited in England, © 2019 npr, © 2019 NYP Holdings, Inc., © 2018 NYWA, © 2011, 2020 Reuters, © 2018 Sinclair Broadcast Group, Inc., © 2016 Sky UK, © 2017 Spanish Radio and Television Corporation, © 2019 Special Broadcasting Service Corporation, © 2010 Standard.co.uk, © 2020 The Associated Press, © 2019-2020 The City Paper Bogotá, © 2019 The Epoch Times, © 2019 THE IRISH TIMES, © 2017 The Lion of El Español Publications SA, © 2019 Unidad Editorial Informacion General, SLU, © 2020 Trustees of the University of Pennsylvania 10.0 Contacts Ann Bies - KAIROS Project Coordinator Christopher Caruso - KAIROS Tech Lead Brian Gainor - KAIROS Tech Lead Song Chen - KAIROS Project Manager ------ README created by Song Chen on November 28, 2023 Updated by Ann Bies on November 28, 2023 Updated by Ann Bies and Song Chen on April 4, 2025 Updated by Ann Bies and Song Chen on April 9, 2025 Updated by Ann Bies and Song Chen on April 10, 2025 Updated by Song Chen on April 15, 2025 Updated by Ann O'Brien on May 1, 2025 Updated by Song Chen on May 15, 2025 Updated by Ann Bies and Song Chen on May 16, 2025