Corpus Title: KAIROS Schema Learning Corpus Background Source Data LDC Catalog-ID: LDC2026T02 Authors: Jennifer Tracey, Song Chen, Christopher Caruso, Stephanie Strassel 1.0 Introduction The KAIROS Schema Learning Corpus Background Source Data package contains Spanish and English source data, which was newly collected during the KAIROS program as supplemental background data for the KAIROS Schema Learning Corpus (SLC). Tools for processing the data and related documentation is also included in this package. The data included was collected primarily to increase the quantity of Spanish and English data that includes multimedia components for the SLC, as well as to add domains not well represented in the existing Spanish corpora that make up the SLC. This supplemental background data includes a substantial quantity of data in the business/logistics domain as well as multimedia news data. The SLC background data as a whole is comprised of over 16.2 million background documents, including more than 125,000 audio, video, image or multimedia documents. SLC background data includes Spanish, English and Russian corpora from the Linguistic Data Consortium catalog (see ./docs/background_corpora.tab for the list of corpora used as SLC background data), and the supplemental data contained in this corpus. The supplemental data focused particularly on resources for Spanish, including instructional documents (e.g., how-to articles), business and logistics domain documents, and multimedia data. The SLC background data is one component of the Schema Learning Corpus (SLC), which was designed to support research into the structure of complex events in multilingual, multimedia data as part of the DARPA Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS) Program. KAIROS aims to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilize formal event representations in the form of schema libraries that specify the steps, preconditions and constraints for an open set of complex events; schemas are then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus. The other component of the SLC is the KAIROS Schema Learning Corpus Complex Event Annotation corpus, available in a separate LDC release, which provides English and Spanish text, audio, video and image data labeled for 93 real-world Complex Events (CEs), like riots or disease outbreaks, that consist of numerous subsidiary elements that may happen sequentially or simultaneously, and which may have many inter-dependencies. Taken together, the SLC Complex Event annotation and the background data, including the supplemental background data in this package, constitute the data used by KAIROS system developers for schema learning. For further information about the Schema Learning Corpus and its use in the KAIROS program, refer to Chen (2024). 2.0 Directory Structure The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: ./data/ -- contains source data ./docs/ -- contains this README file and documentation for source data ./tools/ -- contains software for LTF data manipulation The "./data" directory has a separate subdirectory for each of the following data types, and each directory contains one or more zip archives with data files of the given type; the list shows the archive-internal directory and file-extension strings used for the data files of each type: gif/*.gif.zip -- contains "gif/*.gif.ldcc" (image data) jpg/*.jpg.zip -- contains "jpg/*.jpg.ldcc" (image data) mp4/*.mp4.zip -- contains "mp4/*.mp4.ldcc" (video data) png/*.png.zip -- contains "png/*.png.ldcc" (image data) svg/*.svg.zip -- contains "svg/*.svg.ldcc" (image data) ltf/*.ltf.zip -- contains "ltf/*.ltf.xml" (segmented/tokenized text data) psm/*.psm.zip -- contains "psm/*.psm.xml" (companion to ltf.xml) Data types in the first group consist of original source materials presented in "ldcc wrapper" file format (see section 4.2 below). The latter group (ltf and psm) are created by LDC from source HTML data, by way of an intermediate XML reduction of the original HTML content for "root" web pages (see section 4.1 for a description of the process, and section 5 for details on the LTF and PSM file formats). The 6-character file-ID of the zip archive matches the first 6 characters of the 9-character file-IDs of the data files it contains. For example: zip archive file ./data/gif/K0C03P.gif.zip contains: gif/K0C03P1BK.gif.ldcc gif/K0C03P1BN.gif.ldcc gif/K0C03P1BH.gif.ldcc (The "ldcc" file format is explained in more detail in section 4.2 below.) 3.0 Content Summary "#RtPgs" refers to the number of root HTML pages that were harvested; the other columns indicate the total number of data files of the various types extracted from those root pages (text, image, video). #RtPgs #Txts #Imgs #Vids 14324 12346 14896 165 4.0 Data Processing and Character Normalization The content has been harvested from various web sources using an automated system that is driven by manual scouting for relevant material. Some content may have been harvested manually, or by means of ad-hoc scripted methods for sources with unusual attributes. 4.1 Treatment of original HTML text content All harvested HTML content was initially converted from its original form into a relatively uniform XML format; this stage of conversion eliminated irrelevant content (menus, ads, headers, footers, etc.), and placed the content of interest into a simplified, consistent markup structure. The "homogenized" XML format then served as input for the creation of a reference "raw source data" (rsd) plain text form of the web page content; at this stage, the text was also conditioned to normalize white-space characters, and to apply transliteration and/or other character normalization, as appropriate to the given language. This processing creates the ltf.xml and psm.xml files for each harvested "root" web page; these file formats are described in more detail in section 5 below. 4.2 Treatment of non-HTML data types: "ldcc" file format To the fullest extent possible, all discrete resources referenced by a given "root" HTML page (style sheets, javascript, images, media files, etc.) are stored as separate files of the given data type, and assigned separate 9-character file-IDs (the same form of ID as is used for the "root" HTML page). In order to present these attached resources in a stable and consistent way, the LDC has developed a "wrapper" or "container" file format, which presents the original data as-is, together with a specialized header block prepended to the data. The header block provides metadata about the file contents, including the MD5 checksum (for self-validation), the data type and byte count, url, and citations of source-ID and parent (HTML) file-ID. The LDCC header block always begins with a 16-byte ASCII signature, as shown between double-quotes on the following line (where "\n" represents the ASCII "newline" character 0x0A): "LDCc \n1024 \n" Note that the "1024" on the second line of the signature represents the exact byte count of the LDCC header block. (If/when this header design needs to accommodate larger quantities of metadata, the header byte count can be expanded as needed in increments of 1024 bytes. Such expansion does not arise in the present release.) Immediately after the 16-byte signature, a YAML string presents a data structure comprising the file-specific header content, expressed as a set of "key: value" pairings in UTF-8 encoding. The YAML string is padded at the end with space characters, such that when the following 8-byte string is appended, the full header block size is exactly 1024 bytes (or whatever size is stated in the initial signature): "endLDCc\n" In order to process the content of an LDCC header: - read the initial block of 1024 bytes from the *.ldcc data file - check that it begins with "LDCc \n1024 \n" and ends with "endLDCc\n" - strip off those 16- and 8-byte portions - pass the remainder of the block to a YAML parser. In order to access the original content of the data file, simply skip or remove the initial 1024 bytes. 5.0 Overview of XML Data Structures 5.1 PSM.xml -- Primary Source Markup Data The "homogenized" XML format described above preserves the minimum set of tags needed to represent the structure of the relevant text as seen by the human web-page reader. When the text content of the XML file is extracted to create the "rsd" format (which contains no markup at all), the markup structure is preserved in a separate "primary source markup" (psm.xml) file, which enumerates the structural tags in a uniform way, and indicates, by means of character offsets into the rsd.txt file, the spans of text contained within each structural markup element. For example, in a discussion-forum or web-log page, there would be a division of content into the discrete "posts" that make up the given thread, along with "quote" regions and paragraph breaks within each post. After the HTML has been reduced to uniform XML, and the tags and text of the latter format have been separated, information about each structural tag is kept in a psm.xml file, preserving the type of each relevant structural element, along with its essential attributes ("post_author", "date_time", etc.), and the character offsets of the text span comprising its content in the corresponding rsd.txt file. 5.2 LTF.xml -- Logical Text Format Data The "ltf.xml" data format is derived from rsd.txt, and contains a fully segmented and tokenized version of the text content for a given web page. Segments (sentences) and the tokens (words) are marked off by XML tags (SEG and TOKEN), with "id" attributes (which are only unique within a given XML file) and character offset attributes relative to the corresponding rsd.txt file; TOKEN tags have additional attributes to describe the nature of the given word token. The segmentation is intended to partition each text file at sentence boundaries, to the extent that these boundaries are marked explicitly by suitable punctuation in the original source data. To the extent that sentence boundaries cannot be accurately detected (due to variability or ambiguity in the source data), the segmentation process will tend to err more often on the side of missing actual sentence boundaries, and (we hope) less often on the side of asserting false sentence breaks. The tokenization is intended to separate punctuation content from word content, and to segregate special categories of "words" that play particular roles in web-based text (e.g. URLs, email addresses and hashtags). To the extent that word boundaries are not explicitly marked in the source text, the LTF tokenization is intended to divide the raw-text character stream into units that correspond to "words" in the linguistic sense (i.e. basic units of lexical meaning). 6.0 Software tools included in this release 6.1 ltf2txt A data file in ltf.xml format (as described above) can be conditioned to recreate exactly the "raw source data" text stream (the rsd.txt file) from which the LTF was created. The tools described here can be used to apply that conditioning, either to a directory or to a zip archive file containing ltf.xml data. In either case, the scripts validate each output rsd.txt stream by comparing its MD5 checksum against the reference MD5 checksum of the original rsd.txt file from which the LTF was created. (This reference checksum is stored as an attribute of the "DOC" element in the ltf.xml structure; there is also an attribute that stores the character count of the original rsd.txt file.) Each script contains user documentation as part of the script content; you can run "perldoc" to view the documentation as a typical unix man page, or you can simply view the script content directly by whatever means to read the documentation. Also, running either script without any command-line arguments will cause it to display a one-line synopsis of its usage, and then exit. ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data) ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives 7.0 Documentation included in this release The ./docs folder (relative to the root directory of this release) contains two tab-delimited table file (see section 7.1 and 7.2 below for details), and DTD files for the "ltf" and "psm" xml file formats. 7.1 "parent_children.tab" -- relation of child assets to root HTML pages In the following, the term "asset" refers to any single "primary" data file of any given type. Each asset has a distinct 9-character identifier. If two or more files appear with the same 9-character file-ID, this means that they represent different forms or derivations created from the same, single primary data file (e.g. this is how we mark corresponding LTF.xml and PSM.xml file pairs). Source documents and related metadata are all managed with regard to a set of "root" HTML pages; therefore the table makes reference to the asset-IDs assigned to those root pages. However, the present release does not include the original HTML text streams, or any derived form of data corresponding to the full HTML content. As a result, the "root" asset-IDs cited in this table are not to be found among the inventory of data files presented in zip archives in the "./data" directory. Each root asset is associated with one or more "child" assets (including images, media files, style sheets, text data presented as ltf.xml, etc.); each child asset gets its own distinct 9-character ID. The root-child relations are provided in "parent_children.tab" table, and as part of the LDCC header content in the various "wrapped" data file formats (as listed in section 2). Each data file-ID in the set of zip archives is represented by the combination of child_uid and child_asset_type (columns 4 and 6). The columns are tab-delimited and the initial line of the file provides the column labels as shown below: Col.# Content 1. parent_uid (the parent UID associated with the doc URL) 2. child_uid 3. url 4. child_asset_type (e.g. ".jpg.ldcc") 5. language (automatically detected language or n/a) 6. rel_pos (relative position of the child asset within the root asset HTML code) 7. wrapped_md5 (md5 checksum of the .ldcc-wrapped asset file) 8. unwrapped_md5 (md5 checksum of the asset file without the ldcc wrapper) 9. download_date (download date of asset) 10. content_date (creation date of asset, or n/a) Notes: - Because ltf and psm files have the same "child" uid and differ only in the file extension (.ltf.xml or .psm.xml), only the ltf files are listed in the parent_children.tab document. - The URL provided for each .ltf.xml entry in the table is the "full-page" URL for root document associated with the "parent_uid" value. (For other types of child data -- images and media -- the "url" field contains the specific url for that specific piece of content.) - Some child_uids (for images or videos) may appear multiple times in the table, if they were found to occur identically in multiple root web pages. - The content_date is obtained for the parent document from the process that extracts the text (ltf) child asset. This date therefore appears only for ltf rows in the table, but can be considered to apply to the full parent document. 7.2 "background_corpora.tab" This table lists the Catalog ID and title of all packages used as part of the SLC background data in addition to the source data contained in this release. The packages are published either in LDC's general catalog or as e-corpora. The table has the following column labels: Col.# Content 1. Catalog_ID (the LDC catalog ID associated with the package) 2. Title (The title associated with the catalog ID) 8.0 References Song Chen, Jennifer Tracey, Ann Bies, and Stephanie Strassel. 2024. Schema Learning Corpus: Data and Annotation Focused on Complex Events. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 14393–14399, Torino, Italia. ELRA and ICCL. 9.0 Copyright ©2020 Casos de Corrupción, ©2020 VOA, ©2020 Trustees of the University of Pennsylvania 10.0 Contacts Song Chen - KAIROS Project Manager Christopher Caruso - KAIROS Tech Lead ------ README created October 31, 2023 updated May 14, 2024 updated April 16, 2025 updated May 13, 2025