Corpus Title: KAIROS Phase 2 Quizlet LDC Catalog-ID: LDC2025T15 Authors: Song Chen, Ann Bies, Christopher Caruso, Jennifer Tracey, Stephanie Strassel 1.0 Introduction The KAIROS Phase 2 Quizlet corpus contains the English and Spanish source data (including text, video, and images) and annotations used for pre- evaluation research and system development during Phase 2 of the DARPA KAIROS Program. The Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS) Program aims to develop technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilize formal event representations in the form of schema libraries that specify the steps, preconditions and constraints for an open set of complex events; schemas are then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus. KAIROS quizlets are a series of narrowly defined tasks designed to explore specific evaluation objectives, enabling KAIROS system developers to exercise individual system components on a small data set prior to the full program evaluation. This release contains the complete set of quizlet data used in Phase 2A of KAIROS. Phase 2A of the DARPA KAIROS Program focused on the Disease Outbreak (DO) scenario. All Phase 2 quizlets were in Phase 2A, Phase 2B of the program did not utilize quizlets. There were five quizlets during Phase 2A (Quizlet 5 - 9), with each quizlet building upon the data and results from the previous quizlets. Four quizlets (6 - 9) have data in this package (see * for more details below). The quizlets focus on five real-world Complex Events (CEs) within the Disease Outbreak (DO) scenario (which CEs are included in each Quizlet are detailed below): CE2002: Clostridium perfringens, Chipotle restaurant, Ohio, 2018 CE2004: Salmonella, from peanut butter, originated from Georgia peanut factory, 2008 CE2011: 2011 E. coli linked to contact with livestock at fair, North Carolina CE2019: 2017 Botulism from nacho cheese sauce, California CE2039: 1976 Philadelphia Legionnaires' disease outbreak Quizlets include CE-relevant documents in English and Spanish, covering text, image and video sources. Trained annotators create gold standard reference annotations by labeling the scenario-relevant events and relations that are necessary to fully characterize what happened in a given CE, resulting in a structured representation of the temporally-ordered events, relations and arguments necessary to fully express the focus CEs. Quizlets may also include a reference knowledge graph, known as Graph G, which contains a subset of the manually labeled events and relations and constitutes the ground truth reference CE for evaluation. Some manually labeled events and relations are intentionally omitted from Graph G in order to evaluate performers' ability in schema instantiation and event prediction capabilities. *Quizlet 5 was intended to allow performers to focus on developing schema representations for hierarchy in event complexes and did not require any data or annotation, so there is no Quizlet 5 data in this release. Quizlet 6 focused on creating and instantiating event complex schemas using a Wikidata-based ontology. The "DARPA Wikidata" or DWD was adopted by the Cross-Program Ontology Working Group for use as the program-wide ontology (using DWD Qnodes or Pnodes coupled with PropBank-style argument role sets). Data for Quizlet 6 includes a small set of English source documents and manual annotation for two CEs (CE2002, CE2004) in the DO scenario. The manual annotation for this quizlet uses DWD Qnodes and includes the labeled events and relations (plus their arguments) needed to understand the CE as a whole. Labeled events are also temporally ordered in start order. Quizlet 7 focused on the evaluation approach for predicted events, richer temporal attributes, and human readability. Data for Quizlet 7 includes the full set of English and Spanish source documents and updated annotation for the same two CEs (CE2002, CE2004) as for Quizlet 6, with the addition of annotation which linked labeled argument entities to DWD as a knowledge base, the graph G that was generated for each CE, and information about how the data set is partitioned to support prediction evaluation. Starting with Quizlet 7, temporal ordering for events includes a second layer of ordering to allow for greater specificity in the start order. Quizlet 7 also includes source data, but not annotation, for a third CE (CE2039). Quizlet 8 focused on predicted events and their arguments, temporal sequencing, and experiments with assessment. Data for Quizlet 8 includes a revised set of source documents for CE2004 (removing two overly rich documents from the set), and updated event/relation annotation that excludes events, relations, and arguments that were unique to the excluded source documents. CE2002 was not revised. Data also includes the annotation which linked labeled argument entities to DWD as a knowledge base, the graph G that was generated for each CE, and information about how the data set is partitioned to support prediction evaluation. Quizlet 8 does not include any data for CE2039. Quizlet 9 served as a dry run for the full Phase 2A evaluation, exercising the entire evaluation pipeline. Data for Quizlet 9 starts with the unchanged data from Quizlet 8, and adds source documents, manual annotation, graph G, and information about how the data set is partitioned to support prediction evaluation for an additional two CEs (CE2011, CE2019) in the DO scenario. Manual annotation for this quizlet includes the annotation of events, relations, and their arguments, the two-layer temporal ordering of all labeled events, and entity linking to DWD for all labeled argument entities. See ./docs/annotation_approach.txt for more details. 2.0 Directory Structure and Content Summary 2.1 Directory Structure The directory structure and contents of the package are summarized below – paths shown are relative to the base (root) directory of the package: ./data/source -- source data in subdirectories by data type (see below for list of data types) ./data/annotation/quizlet_{6-9} -- contains annotation data for Quizlets ./data/graphG/quizlet_{7-9} -- contains graph G for Quizlets 7-9 ./docs -- documentation for source data and annotations ./docs/ce_profile -- brief descriptions of the CEs that were annotated ./tools/ -- software for LTF data manipulation See ./docs/docs_README.txt for details on the documentation included in this release. The "./data/source" directory has a separate subdirectory for each of the following data types, and each directory contains one or more data files of the given type; the list below shows the directory and file-extension strings used for the data files of each type: gif/ -- contains "*.gif.ldcc" (image data) jpg/ -- contains "*.jpg.ldcc" (image data) mp4/ -- contains "*.mp4.ldcc" (video data) png/ -- contains "*.png.ldcc" (image data) svg/ -- contains "*.svg.ldcc" (image data) ltf/ -- contains "*.ltf.xml" (segmented/tokenized text data) psm/ -- contains "*.psm.xml" (companion to ltf.xml) Data types in the first group (image and video data) consist of original source materials presented in "ldcc wrapper" file format (see section 3.2 below). The latter group (ltf and psm) are created by LDC from source HTML data, by way of an intermediate XML reduction of the original HTML content for "root" web pages (see section 3.1 for a description of the process, and section 4 for details on the LTF and PSM file formats). The data files use 9-character file-IDs. For example: jpg/K0C03N6VP.jpg.ldcc (The "ldcc" file format is explained in more detail in section 3.2 below.) 2.2 Source Data Content A total of 66 root web pages were collected and processed, yielding 65 text data files, 890 image files, and 10 video files present in the corpus. 2.3 Annotation Content The table below summarizes the total number of documents and amount of annotation included in the corpus: total_ce - total complex events subject to data collection and annotation total_src - root documents collected and processed total_evt - number of events annotated total_rel - number of relations annotated total_arg - number of arguments annotated quizlet | total_ce | total_src | total_evt | total_rel | total_arg | 6 | 2 | 10 | 325 | 77 | 1193 | 7 | 3 | 40 | 431 | 104 | 1729 | 8 | 2 | 29 | 336 | 97 | 1409 | 9 | 4 | 54 | 577 | 154 | 2374 | 2.4 Graph G The knowledge graph ("graph G") for each CE was generated from the manual annotation to include a selected subset of the annotated events, temporal attributes, and relations (plus their arguments) that were determined to be of interest to the evaluation. Graph G served as input to one of the tasks in the program evaluation, where performers were required to instantiate their schemas for the CE by using graph G to provide the group of annotated events comprising the CE. Graph Gs for Quizlets 7-9 are included in this release in the ./data/graphG/quizlet_{7,8,9} directories. Quizlet 6 did not include a graph G, and so does not have a corresponding graphG directory. Several versions of graph G were included for Quizlets 7-8 to support performer experimentation with scoring during pre-evaluation quizlets. Quizlet 9 includes only the "abridged" graph G, as would be used by performers in the evaluation. - ce{20NN}_GraphG.json –- the version NIST used for scoring and assessment purposes - ce{20NN}abridged_GraphG.json –- corresponds to the version performers used for the evaluation. Events that uniquely appear in hidden documents are removed, as hidden documents were not processed by performers. - ce{20NN}critical_GraphG.json –- events/relations not annotated as "critical" are removed Graph G is in JSON format. However, for convenience only, Quizlet 7 graph G information is also included in human-readable Excel (.xlsx) files. 2.5 Software tools included in this release Software tools in this release can be found in ./tools/ltf2txt/ A data file in ltf.xml format (for source data in each quizlet) can be conditioned to recreate exactly the "raw source data" text stream (the rsd.txt file) from which the LTF was created. The tools described here can be used to apply that conditioning, either to a directory or to a zip archive file containing ltf.xml data. In either case, the scripts validate each output rsd.txt stream by comparing its MD5 checksum against the reference MD5 checksum of the original rsd.txt file from which the LTF was created. (This reference checksum is stored as an attribute of the "DOC" element in the ltf.xml structure; there is also an attribute that stores the character count of the original rsd.txt file.) Each script contains user documentation as part of the script content; you can run "perldoc" to view the documentation as a typical unix man page, or you can simply view the script content directly by whatever means to read the documentation. Also, running either script without any command-line arguments will cause it to display a one-line synopsis of its usage, and then exit. ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data) 3.0 Source Data Processing The web documents selected by annotators during data scouting were first harvested from various sources using an automated system developed by LDC, and then processed to produce a standardized format for use in downstream tasks. 3.1 Treatment of original HTML text content All harvested HTML content was initially converted from its original form into a relatively uniform XML format; this stage of conversion eliminated irrelevant content (menus, ads, headers, footers, etc.), and placed the content of interest into a simplified, consistent markup structure. The "homogenized" XML format then served as input for the creation of a reference "raw source data" (rsd) plain text form of the web page content; at this stage, the text was also conditioned to normalize white-space characters, and to apply transliteration and/or other character normalization, as appropriate to the given language. This processing creates the ltf.xml and psm.xml files for each harvested "root" web page; these file formats are described in more detail in section 4 below. 3.2 Treatment of non-HTML data types: "ldcc" file format To the fullest extent possible, all discrete resources referenced by a given "root" HTML page (style sheets, javascript, images, video, audio and other media files, etc.) are stored as separate files of the given data type, and assigned separate 9-character file-IDs (the same form of ID as is used for the "root" HTML page). In order to present these attached resources in a stable and consistent way, we developed a "wrapper" or "container" file format, which presents the original data as-is, together with a specialized header block prepended to the data. The header block provides metadata about the file contents, including the MD5 checksum (for self-validation), the data type and byte count, url, and citations of source-ID and parent (HTML) file-ID. The LDCC header block always begins with a 16-byte ASCII signature, as shown between double-quotes on the following line (where "\n" represents the ASCII "newline" character 0x0A): "LDCc \n1024 \n" Note that the "1024" on the second line of the signature represents the exact byte count of the LDCC header block. (If/when this header design needs to accommodate larger quantities of metadata, the header byte count can be expanded as needed in increments of 1024 bytes. Such expansion does not arise in the present release.) Immediately after the 16-byte signature, a YAML string presents a data structure comprising the file-specific header content, expressed as a set of "key: value" pairings in UTF-8 encoding. The YAML string is padded at the end with space characters, such that when the following 8-byte string is appended, the full header block size is exactly 1024 bytes (or whatever size is stated in the initial signature): "endLDCc\n" In order to process the content of an LDCC header: - read the initial block of 1024 bytes from the *.ldcc data file - check that it begins with "LDCc \n1024 \n" and ends with "endLDCc\n" - strip off those 16- and 8-byte portions - pass the remainder of the block to a YAML parser. In order to access the original content of the data file, simply skip or remove the initial 1024 bytes. 4.0 Overview of XML Data Structures 4.1 PSM.xml -- Primary Source Markup Data The "homogenized" XML format described above preserves the minimum set of tags needed to represent the structure of the relevant text as seen by the human web-page reader. When the text content of the XML file is extracted to create the "rsd" format (which contains no markup at all), the markup structure is preserved in a separate "primary source markup" (psm.xml) file, which enumerates the structural tags in a uniform way, and indicates, by means of character offsets into the rsd.txt file, the spans of text contained within each structural markup element. For example, in a discussion forum or web log page, there would be a division of content into the discrete "posts" that make up the given thread, along with "quote" regions and paragraph breaks within each post. After the HTML has been reduced to uniform XML, and the tags and text of the latter format have been separated, information about each structural tag is kept in a psm.xml file, preserving the type of each relevant structural element, along with its essential attributes ("post_author", "date_time", etc.), and the character offsets of the text span comprising its content in the corresponding rsd.txt file. 4.2 LTF.xml -- Logical Text Format Data The "ltf.xml" data format is derived from rsd.txt, and contains a fully segmented and tokenized version of the text content for a given web page. Segments (sentences) and the tokens (words) are marked off by XML tags (SEG and TOKEN), with "id" attributes (which are only unique within a given XML file) and character offset attributes relative to the corresponding rsd.txt file; TOKEN tags have additional attributes to describe the nature of the given word token. The segmentation is intended to partition each text file at sentence boundaries, to the extent that these boundaries are marked explicitly by suitable punctuation in the original source data. To the extent that sentence boundaries cannot be accurately detected (due to variability or ambiguity in the source data), the segmentation process will tend to err more often on the side of missing actual sentence boundaries, and (we hope) less often on the side of asserting false sentence breaks. The tokenization is intended to separate punctuation content from word content, and to segregate special categories of "words" that play particular roles in web-based text (e.g. URLs, email addresses and hashtags). To the extent that word boundaries are not explicitly marked in the source text, the LTF tokenization is intended to divide the raw-text character stream into units that correspond to "words" in the linguistic sense (i.e. basic units of lexical meaning). 5.0 Known Issues 5.1 Incomplete mappings for DWD Identifiers for CE2002 and CE2004 The available set of mappings from original / explicit ontology terms to DWD identifiers is incomplete. In particular: - Though all entries in the "arguments" tables have "qnode_type_id" values, many lack "mapped_slot_type" values (there are 162 rows with "EMPTY_TBD" and 16 rows with "---"). - Only a couple of "events" table entries lack "qnode_type_id" values. - The "relations" tables have values for "qnode_attribute_id", but not for "qnode_type_id"; also, as with the other tables, the "type", "subtype" and "subsubtype" values are all set to "EMPTY_NA" (in anticipation of using DWD identifiers instead). 5.2 Incomplete mappings for DWD Identifiers for CE2011 and CE2019 annotation Some argument slots haven't been mapped to any DWD identifiers. Those slots are mapped to either EMPTY_TBD or EMPTY_REF. 5.3 Graph G for CE2004 in Quizlet 8 The "full" version of graph G for CE2004 in Quizlet 8 is based on the Quizlet 7 annotation, which means that ce2004full_GraphG.json includes annotation for the two documents that were removed from CE2004 for Quizlet 8. The "critical" version of graph G (ce2004critical_GraphG.json) for CE2004 in Quizlet 8 includes 165 events/relations, although there are 180 events/relations labeled as critical in the annotation. 6.0 References DARPA. 2018. Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS). Defense Advanced Research Projects Agency, DARPA BAA HR001119S0014. Elizabeth Spaulding, Kathryn Conger, Anatole Gershman, Rosario Uceda-Sosa, Susan Windisch Brown, James Pustejovsky, Peter Anick, and Martha Palmer. 2023. The DARPA Wikidata Overlay: Wikidata as an ontology for natural language processing. In Proceedings of the 19th Joint ACL - ISO Workshop on Interoperable Semantic Annotation (ISA-19), pages 1–10, Nancy, France. Ann Bies, Jennifer Tracey, Ann O'Brien, Song Chen, Stephanie Strassel. 2024. Spanless Event Annotation for Corpus-Wide Complex Event Understanding. LREC- COLING 2024: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. Turin, May 20-24. 7.0 Copyright © 2021 Marler Clark, Inc., © 2021 Minutos Editora, S.L., © 2002-2024 A Gray Local Media Station, © 2005-2024 OutBreak, Inc., © 2016 Copyright ElPeriodicodeMexico.com., © 2020, 2021 A&E Television Networks, LLC., © 2021 ABC News, © 2021 Cable News Network., © 2021 Chicago Tribune, © 2021 CNBC LLC., © 2021 Encyclopedia of Greater Philadelphia, © 2021 EnsembleIQ, © 2021 Entravision, © 2021 Inside Edition Inc. and CBS interactive Inc., © 2021 Insider Inc. , © 2021 Marler Clark, Inc., © 2021 Metro Corp., © 2021 NBCUniversal Media, LLC, © 2021 NBCUniversal Media, LLC, © 2021 News. Policy. Trends. North Carolina., © 2021 npr, © 2021 PacBio, © 2021 Public Broadcasting Service (PBS), © 2021 Regents of the University of Minnesota, © 2021 Reuters, © 2021 Sinclair, Inc., © 2021 The Philadelphia Inquirer, LLC, © 2021 Univision Communications Inc., © 2021 Vimeo.com, Inc., © 2021 Vox Media, LLC, © 2021 Marler Clark, Inc., © 2021 Siegel Brill, P.A., © 2017 MERCADOTECNIA PUBLICIDAD MARKETING NOTICIAS, © 2021 | Make Food Safe, © 2021 The Chestnut Hill Local, © 2020 2021 EDITORA DEL CARIBE, © 2021 Capitol Broadcasting Company, Inc., © 2021 CBS Broadcasting Inc., © 2021 CIUDAD JUAREZ, CHIH. MEX, © 2021 Google, © 2021, 2022 Trustees of the University of Pennsylvania 8.0 Contacts Dana Delgado - KAIROS Project Manager Christopher Caruso - KAIROS Tech Lead Song Chen - KAIROS Project Manager Ann Bies - KAIROS Project Coordinator ------ README created by Song Chen on October 28, 2024 updated by Ann Bies on October 30, 2024 updated by Ann Bies on December 11, 2024 updated by Song Chen on December 12, 2024 updated by Ann Bies on December 13, 2024 updated by Dana Delgado on April 29, 2025