Corpus Title: KAIROS Schema Learning Corpus Background Source Data
LDC Catalog-ID: LDC2026T02
Authors: Jennifer Tracey, Song Chen, Christopher Caruso, Stephanie Strassel

1.0 Introduction

The KAIROS Schema Learning Corpus Background Source Data package
contains Spanish and English source data, which was newly
collected during the KAIROS program as supplemental background data
for the KAIROS Schema Learning Corpus (SLC). Tools for processing the data
and related documentation is also included in this package. The data
included was collected primarily to increase the quantity of Spanish
and English data that includes multimedia components for the SLC, as
well as to add domains not well represented in the existing Spanish
corpora that make up the SLC. This supplemental background data includes
a substantial quantity of data in the business/logistics domain as well
as multimedia news data.

The SLC background data as a whole is comprised of over 16.2 million
background documents, including more than 125,000 audio, video, image
or multimedia documents. SLC background data includes Spanish, English
and Russian corpora from the Linguistic Data Consortium catalog
(see ./docs/background_corpora.tab for the list of corpora used as SLC
background data), and the supplemental data contained in this corpus.
The supplemental data focused particularly on resources for Spanish,
including instructional documents (e.g., how-to articles), business and
logistics domain documents, and multimedia data.

The SLC background data is one component of the Schema Learning Corpus
(SLC), which was designed to support research into the structure of complex
events in multilingual, multimedia data as part of the DARPA
Knowledge-directed Artificial Intelligence Reasoning Over Schemas
(KAIROS) Program. KAIROS aims to build technology capable of
understanding and reasoning about complex real-world events in order
to provide actionable insights to end users.  KAIROS systems utilize
formal event representations in the form of schema libraries that
specify the steps, preconditions and constraints for an open set of
complex events; schemas are then used in combination with event
extraction to characterize and make predictions about real-world
events in a large multilingual, multimedia corpus.

The other component of the SLC is the KAIROS Schema Learning Corpus
Complex Event Annotation corpus, available in a separate LDC release,
which provides English and Spanish text, audio, video and image data
labeled for 93 real-world Complex Events (CEs), like riots or disease
outbreaks, that consist of numerous subsidiary elements that may
happen sequentially or simultaneously, and which may have many
inter-dependencies.  Taken together, the SLC Complex Event annotation
and the background data, including the supplemental background data in
this package, constitute the data used by KAIROS system developers for
schema learning.

For further information about the Schema Learning Corpus and its use
in the KAIROS program, refer to Chen (2024).

2.0 Directory Structure

The directory structure and contents of the package are summarized below --
paths shown are relative to the base (root) directory of the package:

  ./data/ -- contains source data
  ./docs/ -- contains this README file and documentation for source data
  ./tools/ -- contains software for LTF data manipulation

The "./data" directory has a separate subdirectory for each of the
following data types, and each directory contains one or more zip
archives with data files of the given type; the list shows the
archive-internal directory and file-extension strings used for the
data files of each type:

    gif/*.gif.zip -- contains "gif/*.gif.ldcc" (image data)
    jpg/*.jpg.zip -- contains "jpg/*.jpg.ldcc" (image data)
    mp4/*.mp4.zip -- contains "mp4/*.mp4.ldcc" (video data)
    png/*.png.zip -- contains "png/*.png.ldcc" (image data)
    svg/*.svg.zip -- contains "svg/*.svg.ldcc" (image data)

    ltf/*.ltf.zip -- contains "ltf/*.ltf.xml" (segmented/tokenized text data)
    psm/*.psm.zip -- contains "psm/*.psm.xml" (companion to ltf.xml)

Data types in the first group consist of original source materials
presented in "ldcc wrapper" file format (see section 4.2 below).  The
latter group (ltf and psm) are created by LDC from source HTML data,
by way of an intermediate XML reduction of the original HTML content
for "root" web pages (see section 4.1 for a description of the
process, and section 5 for details on the LTF and PSM file formats).

The 6-character file-ID of the zip archive matches the first 6 characters of
the 9-character file-IDs of the data files it contains.  For example:

  zip archive file ./data/gif/K0C03P.gif.zip contains:

   gif/K0C03P1BK.gif.ldcc
   gif/K0C03P1BN.gif.ldcc
   gif/K0C03P1BH.gif.ldcc

(The "ldcc" file format is explained in more detail in section 4.2 below.)


3.0 Content Summary

"#RtPgs" refers to the number of root HTML pages that were harvested;
the other columns indicate the total number of data files of the
various types extracted from those root pages (text, image, video).

#RtPgs	#Txts	#Imgs	#Vids
14324	12346	14896	165


4.0 Data Processing and Character Normalization

The content has been harvested from various web sources using an
automated system that is driven by manual scouting for relevant
material. Some content may have been harvested manually, or by means
of ad-hoc scripted methods for sources with unusual attributes.

4.1 Treatment of original HTML text content

All harvested HTML content was initially converted from its original
form into a relatively uniform XML format; this stage of conversion
eliminated irrelevant content (menus, ads, headers, footers, etc.),
and placed the content of interest into a simplified, consistent
markup structure.

The "homogenized" XML format then served as input for the creation of
a reference "raw source data" (rsd) plain text form of the web page
content; at this stage, the text was also conditioned to normalize
white-space characters, and to apply transliteration and/or other
character normalization, as appropriate to the given language.

This processing creates the ltf.xml and psm.xml files for each
harvested "root" web page; these file formats are described in more
detail in section 5 below.

4.2 Treatment of non-HTML data types: "ldcc" file format

To the fullest extent possible, all discrete resources referenced by a
given "root" HTML page (style sheets, javascript, images, media files,
etc.) are stored as separate files of the given data type, and
assigned separate 9-character file-IDs (the same form of ID as is used
for the "root" HTML page).

In order to present these attached resources in a stable and
consistent way, the LDC has developed a "wrapper" or "container" file
format, which presents the original data as-is, together with a
specialized header block prepended to the data.  The header block
provides metadata about the file contents, including the MD5 checksum
(for self-validation), the data type and byte count, url, and
citations of source-ID and parent (HTML) file-ID.

The LDCC header block always begins with a 16-byte ASCII signature, as
shown between double-quotes on the following line (where "\n"
represents the ASCII "newline" character 0x0A):

"LDCc   \n1024   \n"

Note that the "1024" on the second line of the signature represents
the exact byte count of the LDCC header block.  (If/when this header
design needs to accommodate larger quantities of metadata, the header
byte count can be expanded as needed in increments of 1024 bytes.
Such expansion does not arise in the present release.)

Immediately after the 16-byte signature, a YAML string presents a data
structure comprising the file-specific header content, expressed as a
set of "key: value" pairings in UTF-8 encoding.

The YAML string is padded at the end with space characters, such that
when the following 8-byte string is appended, the full header block
size is exactly 1024 bytes (or whatever size is stated in the initial
signature):

"endLDCc\n"

In order to process the content of an LDCC header:

 - read the initial block of 1024 bytes from the *.ldcc data file
 - check that it begins with "LDCc   \n1024   \n" and ends with "endLDCc\n"
 - strip off those 16- and 8-byte portions
 - pass the remainder of the block to a YAML parser.

In order to access the original content of the data file, simply skip
or remove the initial 1024 bytes.


5.0 Overview of XML Data Structures

5.1 PSM.xml -- Primary Source Markup Data

The "homogenized" XML format described above preserves the minimum set
of tags needed to represent the structure of the relevant text as seen
by the human web-page reader.  When the text content of the XML file
is extracted to create the "rsd" format (which contains no markup at
all), the markup structure is preserved in a separate "primary source
markup" (psm.xml) file, which enumerates the structural tags in a
uniform way, and indicates, by means of character offsets into the
rsd.txt file, the spans of text contained within each structural
markup element.

For example, in a discussion-forum or web-log page, there would be a
division of content into the discrete "posts" that make up the given
thread, along with "quote" regions and paragraph breaks within each
post.  After the HTML has been reduced to uniform XML, and the tags
and text of the latter format have been separated, information about
each structural tag is kept in a psm.xml file, preserving the type of
each relevant structural element, along with its essential attributes
("post_author", "date_time", etc.), and the character offsets of the
text span comprising its content in the corresponding rsd.txt file.

5.2 LTF.xml -- Logical Text Format Data

The "ltf.xml" data format is derived from rsd.txt, and contains a
fully segmented and tokenized version of the text content for a given
web page.  Segments (sentences) and the tokens (words) are marked off
by XML tags (SEG and TOKEN), with "id" attributes (which are only
unique within a given XML file) and character offset attributes
relative to the corresponding rsd.txt file; TOKEN tags have additional
attributes to describe the nature of the given word token.

The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly
by suitable punctuation in the original source data.  To the extent
that sentence boundaries cannot be accurately detected (due to
variability or ambiguity in the source data), the segmentation process
will tend to err more often on the side of missing actual sentence
boundaries, and (we hope) less often on the side of asserting false
sentence breaks.

The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play
particular roles in web-based text (e.g. URLs, email addresses and
hashtags).  To the extent that word boundaries are not explicitly
marked in the source text, the LTF tokenization is intended to divide
the raw-text character stream into units that correspond to "words" in
the linguistic sense (i.e. basic units of lexical meaning).


6.0 Software tools included in this release

6.1 ltf2txt

A data file in ltf.xml format (as described above) can be conditioned
to recreate exactly the "raw source data" text stream (the rsd.txt
file) from which the LTF was created.  The tools described here can be
used to apply that conditioning, either to a directory or to a zip
archive file containing ltf.xml data.  In either case, the scripts
validate each output rsd.txt stream by comparing its MD5 checksum
against the reference MD5 checksum of the original rsd.txt file from
which the LTF was created.  (This reference checksum is stored as an
attribute of the "DOC" element in the ltf.xml structure; there is also
an attribute that stores the character count of the original rsd.txt
file.)

Each script contains user documentation as part of the script content;
you can run "perldoc" to view the documentation as a typical unix man
page, or you can simply view the script content directly by whatever
means to read the documentation.  Also, running either script without
any command-line arguments will cause it to display a one-line
synopsis of its usage, and then exit.

   ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)

   ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives


7.0 Documentation included in this release

The ./docs folder (relative to the root directory of this release)
contains two tab-delimited table file (see section 7.1 and 7.2 below for
details), and DTD files for the "ltf" and "psm" xml file formats.

7.1 "parent_children.tab" -- relation of child assets to root HTML pages

In the following, the term "asset" refers to any single "primary" data
file of any given type.  Each asset has a distinct 9-character
identifier.  If two or more files appear with the same 9-character
file-ID, this means that they represent different forms or derivations
created from the same, single primary data file (e.g. this is how we
mark corresponding LTF.xml and PSM.xml file pairs).

Source documents and related metadata are all managed with regard to a
set of "root" HTML pages; therefore the table makes reference to the 
asset-IDs assigned to those root pages. However, the present release 
does not include the original HTML text streams, or any derived form 
of data corresponding to the full HTML content. As a result, the
"root" asset-IDs cited in this table are not to be found among the
inventory of data files presented in zip archives in the "./data"
directory. 

Each root asset is associated with one or more "child" assets
(including images, media files, style sheets, text data presented as
ltf.xml, etc.); each child asset gets its own distinct 9-character ID.
The root-child relations are provided in "parent_children.tab" table,
and as part of the LDCC header content in the various "wrapped" data
file formats (as listed in section 2).

Each data file-ID in the set of zip archives is represented by the
combination of child_uid and child_asset_type (columns 4 and 6).  The
columns are tab-delimited and the initial line of the file provides
the column labels as shown below:

 Col.#  Content
 1. parent_uid (the parent UID associated with the doc URL)
 2. child_uid
 3. url
 4. child_asset_type (e.g. ".jpg.ldcc")
 5. language (automatically detected language or n/a) 
 6. rel_pos (relative position of the child asset within the 
    root asset HTML code)
 7. wrapped_md5 (md5 checksum of the .ldcc-wrapped asset file)
 8. unwrapped_md5 (md5 checksum of the asset file without the 
    ldcc wrapper)
 9. download_date (download date of asset)
 10. content_date (creation date of asset, or n/a)

Notes:

  - Because ltf and psm files have the same "child" uid and differ
    only in the file extension (.ltf.xml or .psm.xml), only the ltf
    files are listed in the parent_children.tab document.

  - The URL provided for each .ltf.xml entry in the table is the
    "full-page" URL for root document associated with the 
    "parent_uid" value. (For other types of child data -- images
    and media -- the "url" field contains the specific url for that
    specific piece of content.)

  - Some child_uids (for images or videos) may appear multiple 
    times in the table, if they were found to occur identically in 
    multiple root web pages.

  - The content_date is obtained for the parent document from the
    process that extracts the text (ltf) child asset. This date
    therefore appears only for ltf rows in the table, but can be
    considered to apply to the full parent document.

7.2 "background_corpora.tab"

This table lists the Catalog ID and title of all packages used as
part of the SLC background data in addition to the source data contained
in this release. The packages are published either in LDC's general catalog
or as e-corpora. The table has the following column labels:

  Col.#  Content
  1. Catalog_ID (the LDC catalog ID associated with the package)
  2. Title (The title associated with the catalog ID)

8.0 References

Song Chen, Jennifer Tracey, Ann Bies, and Stephanie
Strassel. 2024. Schema Learning Corpus: Data and Annotation Focused on
Complex Events. In Proceedings of the 2024 Joint International
Conference on Computational Linguistics, Language Resources and
Evaluation (LREC-COLING 2024), pages 14393–14399, Torino, Italia. ELRA
and ICCL.

9.0 Copyright

©2020 Casos de Corrupción, ©2020 VOA, ©2020 Trustees of the 
University of Pennsylvania


10.0 Contacts

Song Chen <zhiyi@ldc.upenn.edu> - KAIROS Project Manager
Christopher Caruso <caruso@ldc.upenn.edu> - KAIROS Tech Lead

------
README created October 31, 2023
       updated May 14, 2024
       updated April 16, 2025
       updated May 13, 2025