Corpus Title: KAIROS Schema Learning Corpus Complex Event Annotation
LDC Catalog-ID: LDC2025T07
Authors: Song Chen, Jennifer Tracey, Ann Bies, Christopher Caruso, Stephanie
Strassel

1.0 Introduction

The KAIROS Schema Learning Corpus Complex Event Annotation release includes
English and Spanish text, audio, video and image data labeled for 93 real-
world Complex Events (CEs), like riots or disease outbreaks, that consist of
numerous subsidiary elements that may happen sequentially or simultaneously,
and which may have many inter-dependencies. The corpus includes event,
relation and argument annotations for CE-relevant documents, with links to
document provenance instantiating each step in the CE.

This release is one component of the Schema Learning Corpus (SLC), which was
designed to support research into the structure of complex events in
multilingual, multimedia data as part of the DARPA Knowledge-directed
Artificial Intelligence Reasoning Over Schemas (KAIROS) Program. KAIROS aims
to build technology capable of understanding and reasoning about complex
real-world events in order to provide actionable insights to end users.
KAIROS systems utilize formal event representations in the form of schema
libraries that specify the steps, preconditions and constraints for an open
set of complex events; schemas are then used in combination with event
extraction to characterize and make predictions about real-world events in a
large multilingual, multimedia corpus.

The other component of the SLC is the Background Data Corpus, available in a
separate LDC release, which provides very large volumes of unlabeled
English,Spanish and Russian data from diverse sources and modalities, covering 
a wide variety of CEs. Taken together, the SLC Complex Event Annotation Corpus 
and the Background Data Corpus constitute the data used by KAIROS system
developers for schema learning.

For further information about the Schema Learning Corpus and its use
in the KAIROS program, refer to Chen (2024).

2.0 Directory Structure and Content Summary

This release contains source data and annotations for a total of 93 Complex
Events.

The directory structure and contents of the package are summarized below --
paths shown are relative to the base (root) directory of the package:

  ./data/source -- source data in subdirectories by data type
  ./data/annotation -- annotations
  ./docs/ -- documentation for source data and annotations
  ./docs/ce_profile -- Complex Event descriptions
  ./tools/ -- software for LTF data manipulation

The "./data" directory has a separate subdirectory for each of the following
data types, and each directory contains one or more zip archives with data
files of the given type; the list shows the archive-internal directory and
file-extension strings used for the data files of each type:

    bmp/*.bmp.zip -- contains "bmp/*.bmp.ldcc" (image data)
    gif/*.gif.zip -- contains "gif/*.gif.ldcc" (image data)
    jpg/*.jpg.zip -- contains "jpg/*.jpg.ldcc" (image data)
    mp4/*.mp4.zip -- contains "mp4/*.mp4.ldcc" (video data)
    mp3/*.mp3.zip -- contains "mp3/*.mp3.ldcc" (audio data)
    png/*.png.zip -- contains "png/*.png.ldcc" (image data)
    svg/*.svg.zip -- contains "svg/*.svg.ldcc" (image data)

    ltf/*.ltf.zip -- contains "ltf/*.ltf.xml" (segmented/tokenized text
    data)
    psm/*.psm.zip -- contains "psm/*.psm.xml" (companion to ltf.xml)

Data types in the first group (image, video, and audio data) consist of
original source materials presented in "ldcc wrapper" file format (see
section 4.2 below).  The latter group (ltf and psm) are created by LDC from
source HTML data, by way of an intermediate XML reduction of the original
HTML content for "root" web pages (see section 4.1 for a description of the
process, and section 5 for details on the LTF and PSM file formats).

The 6-character file-ID of the zip archive matches the first 6 characters of
the 9-character file-IDs of the data files it contains. For example:

  zip archive file ./data/svg/Jc002Y.svg.zip contains:

        svg/JC002YBYQ.svg.ldcc

(The "ldcc" file format is explained in more detail in section 4.2 below.)

2.1 Source Data Summary

A total of 3431 root web pages were collected and processed, yielding 1919
text data files, 24019 image files, 1472 video files and 16 audio files 
present in the corpus.

2.2 Annotation Data Summary

The table below summarizes the amount of annotation included in the corpus:

 total_ce - total complex events subject to data collection and annotation
 total_doc_src - CE-relevant root documents collected and processed
 total_doc_provlink - root docs labeled for provenance linking
 total_doc_mention - root docs labeled for events, relations, and schema
 linking

Language  | total_ce | total_doc_src | total_doc_provlink |total_doc_mention |
English   |   93     |     2,190     |        650         |     216          |
Spanish   |   90     |     1,241     |        493         |     122          |
Total     |   93     |     3,431     |      1,143         |     338          |

3. Annotation

3.1 Defining Complex Events

Prior to annotation, we defined 93 Complex Events, covering 12 domains:
  • Business workings
  • Civil unrest
  • Conflict or threat
  • Disaster
  • Government workings
  • Cyber or information
  • Illegal activities
  • Legal proceedings
  • Medical intervention
  • Movement or travel
  • New capability development
  • Social life

Each domain includes 3 or more CEs of various granularity. For each CE we
created a CE profile, using a standardized template that includes a natural 
language description of the CE along with a set of typical steps that 
comprise the event. These steps are described in natural language and include 
information about the expected event tag set types that might instantiate the 
step, along with information about the expected ordering of each step with 
respect to other steps. The steps defined for each CE are not intended to 
describe every possible variation in how things may play out for the CE; 
instead, they describe the typical way the complex event unfolds. Some steps 
may be optional, or ordered differently than described in the CE Profile, but 
the Profile provides a typical "script" for how this CE may appear in real 
world data. CE Profiles can be found in ./docs/ce_profile/.

3.2 Data Scouting

The CE profiles serve as a guide to data scouting and annotation. During
data scouting, annotators consult the CE profile and search the web for 
documents that discuss that CE. Special attention was paid to documents that 
contain evidence for the specific steps involved in the CE, aiming for 
variety in terms of the data source, genre, modality and language of the 
documents for both the steps and for the CE as a whole. A subset of the 
scouted documents were then subject to annotation, favoring documents that 
provided the best balance of variety and step coverage for the CE.

Please refer to the Data Scouting guidelines for additional information
about the scouting procedure: ./docs/KAIROS_Data_Scouting_Guidelines_v1.0.pdf.

3.3 Provenance Linking Annotation

Provenance Linking is a lightweight approach to grounding the presence of CE
steps in documents. This approach was adopted to provide a first layer of
annotation that emphasized the linking of events in documents to steps in a CE 
(using the CE profile as a stand-in for a schema), which is a primary focus of 
KAIROS research.

During Provenance Linking annotation, annotators review each document
subject to annotation for this CE and indicate which CE steps are present in 
the document, marking the document span (e.g. text character offsets or video 
start/end times) where the step is instantiated. CE steps may be instantiated 
across different documents, languages and modalities. For instance, in the CE 
"Provide And Distribute Disaster Relief", Step 1 may be instantiated in an 
English video document about a hurricane, while Step 2 could be instantiated 
in a Spanish text document about an earthquake.

Please refer to the Provenance Linking Annotation guidelines for additional
information about the annotation procedure:
./docs/KAIROS_Provenance_Linking_Guidelines_V1.0.pdf.

Provenance linking annotation output appears in
./data/annotation/KAIROS_SLC_provlinking.tab, and the data format is
described in ./docs/annotation_table_field_descriptions.tab.

3.4 Mention Annotation

Mention Annotation provides more detailed and structured representation in
the form of event and relation frames for the same documents that were 
previously annotated for provenance linking. All event and relation mentions 
relevant to the specified CE are labeled. Each frame consists of a type, 
subtype and sub-subtype from the KAIROS annotation tag set, using the 
official tags for Phase 1 of the program. Frames also include a document 
span for the event or relation trigger, and attributes to indicate things 
like negation. Entities that fill the argument roles for each event are 
relation are also labeled, with argument roles and types specified and 
argument spans indicated. Start and end times are labeled for each event
or relation, along with a link to the specific CE step represented by 
the event or relation mention.

Please refer to the Mention Annotation guidelines for additional information
about the annotation procedure: 
./docs/KAIROS_Mention_AnnotationGuidelines_v1.0.pdf. The annotation tag set 
is documented in ./docs/KAIROS_Annotation_Tagset_Phase_1_V3.0.xlsx.

Mention annotation output appears in the following 6 tables under
./data/annotation, and the data formats are described in
./docs/annotation_table_field_descriptions.tab:

  KAIROS_SLC_arg_mentions.tab
  - contains event and relation argument mention annotation

  KAIROS_SLC_ce_linking.tab
  - contains the linking between event/relation me ntions and a Complex Event
    step

  KAIROS_SLC_evt_mentions.tab
  - contains event mention annotation

  KAIROS_SLC_evt_slots.tab
  - contains event mention argument slots.  Event mentions in the mentions
    tables must be looked up in the slots tables to find the arguments and 
    fillers that are involved in the event.

  KAIROS_SLC_rel_mentions.tab
  - contains relation mention annotation

  KAIROS_SLC_rel_slots.tab
  - contains relation mention argument slots.  Relation mentions in the
    mentions tables must be looked up in the slots tables to find the 
    arguments and fillers that are involved in the relation.

4.0 Source Data Processing

The web documents selected by annotators during data scouting were first
harvested from various sources using an automated system developed by LDC, 
and then processed to produce a standardized format for use in downstream 
tasks.

4.1 Treatment of original HTML text content

All harvested HTML content was initially converted from its original form
into a relatively uniform XML format; this stage of conversion eliminated 
irrelevant content (menus, ads, headers, footers, etc.), and placed the 
content of interest into a simplified, consistent markup structure.

The "homogenized" XML format then served as input for the creation of a
reference "raw source data" (rsd) plain text form of the web page 
content; at this stage, the text was also conditioned to normalize white-
space characters, and to apply transliteration and/or other character 
normalization, as appropriate to the given language.

This processing creates the ltf.xml and psm.xml files for each harvested
"root" web page; these file formats are described in more detail in section 
5 below.

4.2 Treatment of non-HTML data types: "ldcc" file format

To the fullest extent possible, all discrete resources referenced by a given
"root" HTML page (style sheets, javascript, images, video, audio and other 
media files, etc.) are stored as separate files of the given data type, and 
assigned separate 9-character file-IDs (the same form of ID as is used for 
the "root" HTML page).

In order to present these attached resources in a stable and consistent way,
we developed a "wrapper" or "container" file format, which presents the 
original data as-is, together with a specialized header block prepended to 
the data.  The header block provides metadata about the file contents, 
including the MD5 checksum (for self-validation), the data type and byte count,
url, and citations of source-ID and parent (HTML) file-ID.

The LDCC header block always begins with a 16-byte ASCII signature, as shown
between double-quotes on the following line (where "\n" represents the ASCII 
"newline" character 0x0A):

 "LDCc   \n1024   \n"

Note that the "1024" on the second line of the signature represents the
exact byte count of the LDCC header block.  (If/when this header design needs 
to accommodate larger quantities of metadata, the header byte count can be 
expanded as needed in increments of 1024 bytes. Such expansion does not arise 
in the present release.)

Immediately after the 16-byte signature, a YAML string presents a data
structure comprising the file-specific header content, expressed as a set of 
"key: value" pairings in UTF-8 encoding.

The YAML string is padded at the end with space characters, such that when
the following 8-byte string is appended, the full header block size is exactly 
1024 bytes (or whatever size is stated in the initial signature):

 "endLDCc\n"

In order to process the content of an LDCC header:

 - read the initial block of 1024 bytes from the *.ldcc data file
 - check that it begins with "LDCc   \n1024   \n" and ends with "endLDCc\n"
 - strip off those 16- and 8-byte portions
 - pass the remainder of the block to a YAML parser.

In order to access the original content of the data file, simply skip or
remove the initial 1024 bytes.

5.0 Overview of XML Data Structures

5.1 PSM.xml -- Primary Source Markup Data

The "homogenized" XML format described above preserves the minimum set of
tags needed to represent the structure of the relevant text as seen by the 
human web-page reader. When the text content of the XML file is extracted 
to create the "rsd" format (which contains no markup at all), the markup
structure is preserved in a separate "primary source markup" (psm.xml) file,
which enumerates the structural tags in a uniform way, and indicates, by 
means of character offsets into the rsd.txt file, the spans of text 
contained within each structural markup element.

For example, in a discussion forum or weblog page, there would be a division
of content into the discrete "posts" that make up the given thread, along 
with "quote" regions and paragraph breaks within each post.  After the HTML 
has been reduced to uniform XML, and the tags and text of the latter format 
have been separated, information about each structural tag is kept in a 
psm.xml file, preserving the type of each relevant structural element, along 
with its essential attributes ("post_author", "date_time", etc.), and the 
character offsets of the text span comprising its content in the corresponding 
rsd.txt file.

5.2 LTF.xml -- Logical Text Format Data

The "ltf.xml" data format is derived from rsd.txt, and contains a fully
segmented and tokenized version of the text content for a given web page.  
Segments (sentences) and the tokens (words) are marked off by XML tags (SEG and 
TOKEN), with "id" attributes (which are only unique within a given XML file) 
and character offset attributes relative to the corresponding rsd.txt file; 
TOKEN tags have additional attributes to describe the nature of the given word 
token.

The segmentation is intended to partition each text file at sentence
boundaries, to the extent that these boundaries are marked explicitly by 
suitable punctuation in the original source data.  To the extent that sentence 
boundaries cannot be accurately detected (due to variability or ambiguity in
the source data), the segmentation process will tend to err more often on
the side of missing actual sentence boundaries, and (we hope) less often on 
the side of asserting false sentence breaks.

The tokenization is intended to separate punctuation content from word
content, and to segregate special categories of "words" that play particular 
roles in web-based text (e.g. URLs, email addresses and hashtags).  To the 
extent that word boundaries are not explicitly marked in the source text, 
the LTF tokenization is intended to divide the raw-text character stream 
into units that correspond to "words" in the linguistic sense (i.e. basic 
units of lexical meaning).

6.0 Software tools included in this release

6.1 ltf2txt

A data file in ltf.xml format (as described above) can be conditioned to
recreate exactly the "raw source data" text stream (the rsd.txt file) from 
which the LTF was created. The tools described here can be used to apply 
that conditioning, either to a directory or to a zip archive file containing
ltf.xml data.  In either case, the scripts validate each output rsd.txt
stream by comparing its MD5 checksum against the reference MD5 checksum of 
the original rsd.txt file from which the LTF was created.  (This reference 
checksum is stored as an attribute of the "DOC" element in the ltf.xml
structure; there is also an attribute that stores the character count of the
original rsd.txt file.)

Each script contains user documentation as part of the script content; you
can run "perldoc" to view the documentation as a typical unix man page, or 
you can simply view the script content directly by whatever means to read 
the documentation.  Also, running either script without any command-line
arguments will cause it to display a one-line synopsis of its usage, and
then exit.

   ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data)

   ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives

7.0 Documentation included in this release

7.1. Complex Event Profiles

./docs/ce_profile - contains Complex Event Profiles for all 93 CEs. CE
profiles are named "complexevent[id#]_[name]_v[n].txt", where the "id#" is 
a 3-digit value (e.g. "002"), and "name" is a word or (underscore-conjoined) 
phrase that serves as the title for the Complex Event -- for example:

   complexevent005_Disease_Outbreak_v2.txt

7.2. Root Pages and Child Assets

./docs/parent_children.tab describes the relationship between child assets
and root HTML pages.

In the following, the term "asset" refers to any single "primary" data file
of any given type.  Each asset has a distinct 9-character identifier.  If 
two or more files appear with the same 9-character file-ID, this means that 
they represent different forms or derivations created from the same, single
primary data file (e.g. this is how we mark corresponding LTF.xml and PSM.xml 
file pairs).

Data scouting, annotation and related metadata are all managed with regard
to a set of "root" HTML pages (harvested by the LDC for a specified set of 
events); therefore the tables and annotations make reference to the asset-IDs 
assigned to those root pages.

However, the present release does not include the original HTML text
streams, or any derived form of data corresponding to the full HTML content. 
As a result, the "root" asset-IDs cited in tables and annotations are not to 
be found among the inventory of data files presented in zip archives in the
"./data" directory.

Each root asset is associated with one or more "child" assets (including
images, media files, style sheets, text data presented as ltf.xml, etc.); 
each child asset gets it own distinct 9-character ID. The root-child relations 
are provided in "parent_children.tab" table, and as part of the LDCC header
content in the various "wrapped" data file formats (as listed in section 2).

Each data file-ID in the set of zip archives is represented by the
combination of child_uid and child_asset_type (columns 4 and 6).  The columns 
are tab-delimited and the initial line of the file provides the column labels 
as shown below:

 Col.#  Content
 1.     parent_uid (the parent UID associated with the doc URL)
 2.     child_uid
 3.     url
 4.     child_asset_type (e.g. ".jpg.ldcc")
 5.     rel_pos (relative position of the child asset within ght root asset
	 HTML code)
 6.     wrapped_md5 (md5 checksum of the .ldcc-wrapped asset file)
 7.     unwrapped_md5 (md5 checksum of the asset file without the ldcc
 	wrapper)
 8.     download_date (download date of asset)
 9.     content_date (creation date of asset, or n/a)

Notes:

  - Because ltf and psm files have the same "child" uid and differ only in
    the file extension (.ltf.xml or .psm.xml), only the ltf files are listed 
    in the parent_children.tab document.

  - The URL provided for each .ltf.xml entry in the table is the "full-page"
    URL for root document associated with the "parent_uid" value. (For other 
    types of child data -- images and media -- the "url" field contains the 
    specific url for that specific piece of content.)

  - Because the harvesting of some root URLs yielded no text content (hence
    no ltf/psm data files), the table includes "placeholder" .ltf.xml entries 
    for those parent_uids, in order to provide the full-page URL for every 
    root.  The "status_in_corpus" field for these entries is set to "n/a" (as 
    opposed to "present").

  - Some child_uids (for images or videos) may appear multiple times in the
    table, if they were found to occur identically in multiple root web pages.

  - The content_date is obtained for the parent document from the process
    that extracts the text (ltf) child asset. This date therefore appears only 
    for ltf rows in the table, but can be considered to apply to the full parent 
    document.

7.3 Document Profile

./docs/document_profile.tab provides information about the source data in
the package, including the source UID, the CE ID the document was scouted for, 
the language of the document, and the annotation status of the document.

 Col.#  Content
 1.     ce_id (Complex Event ID)
 2.     parent_uid (the parent UID associated with the doc URL)
 3.     language (the language that the URL is scouted for)
 4.     provlink (whether the document has been annotated for provenance
	linking)
 5.     mention (whether the document has been annotated for mention)

7.4 Data Scouting and Annotation

The ./docs directory also includes guidelines for data scouting and
annotation, along with a tab file describing all of the fields in the various 
annotation tables.

 ./docs/KAIROS_Data_Scouting_Guidelines_v1.0.pdf
  - guidelines for scouting source documents for Complex Events

 ./docs/KAIROS_Provenance_Linking_Guidelines_V1.0.pdf
  - guidelines for provenance linking annotation

 ./docs/KAIROS_Mention_AnnotationGuidelines_v1.0.pdf
  - guidelines for mention annotation

 ./docs/annotation_table_field_descriptions.tab
  - description of the structure of each type of annotation table. This
    table includes information about column headers, content of each field, 
    and format of the contents.

7.5 Annotation Tag Set

The ./docs directory also includes documentation about the annotation
tagset used in the Schema Learing Corpus, which was also the official
tagset for KAIROS Phase 1:

 ./docs/KAIROS_Annotation_Tagset_Phase_1_V3.0.xlsx
  - annotation tag set (ontology) for KAIROS Phase 1

This annotation tag set (also known as the annotation ontology) was
used for the annotation of event, relation, and entity mentions. The
tag set includes type, subtype, sub-subtype, attribute, and temporal
start/end timestamp specifications. Please refer to section 3.4
Mention Annotation of this README and the Mention Annotation
guidelines for additional information about the annotation procedure:
./docs/KAIROS_Mention_AnnotationGuidelines_v1.0.pdf.

The tag set is included as an excel file, with the following five tabs:

  - events (the labels, output values, definitions, templates, arg labels,
    and arg constraints for events}
  - entities (the labels, output values, and definitions for entities)
  - relations (the labels, output values, definitions, templates, arg
    labels, and arg constraints for relations)
  - attributes (the labels, definitions, and output values for attributes of
    events, relations, and arguments)
  - temporal startend (the start and end type labels, output values, and
    definitions for temporal timestamp annotation of events and relations, 
    along with the output format for times)

The initial line of each tab provides the column labels as shown below.

events:

 Col.#  Content
 1.     AnnotIndexID (a unique ID for the tag in the format
	LDC_KAIROS_evt_NNN)
 2.     Type (the human-readable type label)
 3.     Output Value for Type (the output value for the type as it appears
	in the annotation tables)
 4.     Subtype (the human-readable subtype label)
 5.     Output Value for Subtype (the output value for the subtype as it
	appears in the annotation tables)
 6.     Sub-subtype (the human-readable sub-subtype label)
 7.     Output Value for Sub-subtype (the output value for the sub-subtype
	as it appears in the annotation tables)
 8.     Definition (natural language definition of the full tag)
 9.     Template (human-readable templatic representation of the event and
	its arguments)
 10.    arg1 label (human-readable role label for argument 1)
 11.    Output value for arg1 (the output value for the arg1 role label as
	it appears in the annotation tables)
 12.    arg1 type constraints (the list of entity types that may fill the
	arg1 role, including whether any event or relation could fill the role)
 13.    arg2 label (human-readable role label for argument 2)
 14.    Output value for arg2 (the output value for the arg2 role label as
	it appears in the annotation tables)
 15.    arg2 type constraints (the list of entity types that may fill the
	arg2 role, including whether any event or relation could fill the role)
 16.    arg3 label (human-readable role label for argument 3)
 17.    Output value for arg3 (the output value for the arg3 role label as
	it appears in the annotation tables)
 18.    arg3 type constraints (the list of entity types that may fill the
	arg3 role, including whether any event or relation could fill the role)
 19.    arg4 label (human-readable role label for argument 4)
 20.    Output value for arg4 (the output value for the arg4 role label as
	 it appears in the annotation tables)
 21.    arg4 type constraints (the list of entity types that may fill the
	arg4 role, including whether any event or relation could fill the role)
 22.    arg5 label (human-readable role label for argument 5)
 23.    Output value for arg5 (the output value for the arg5 role label as
	it appears in the annotation tables)
 24.    arg5 type constraints (the list of entity types that may fill the
 	arg5 role, including whether any event or relation could fill the role)
 25.    arg6 label (human-readable role label for argument 6)
 26.    Output value for arg6 (the output value for the arg6 role label as
	it appears in the annotation tables)
 27.    arg6 type constraints (the list of entity types that may fill the
	arg6 role, including whether any event or relation could fill the role)

Notes:

  - The annotation for events and relations used a three-level annotation
    tag, which included a high-level type, a more specific subtype under each 
    type, and a finer-grained sub-subtype under each subtype. The three levels 
    together comprise the annotation tag for the event or relation.

  - A sub-subtype of "unspecified" indicates that none of the fine-grained
    sub-subtypes under the subtype is appropriate for the annotated event or 
    relation in the context of the document. This may be either because the 
    document context does not support a finer-grained reading (so the higher 
    level subtype is the most specific reading in the document context), or 
    it may be because the available fine-grained sub-subtypes are not 
    applicable (in which case, the higher level subtype is the most specific 
    tag available).

  - Each event has a defined set of argument roles, and only the defined
    roles are available for annotation. The maximum number of roles for an 
    event in this tag set is six.

  - Argument constraints for each argument role list the entities types that
    may fill the argument role, using the entity type output values, along 
    with whether an event (any event type) or relation (any relation type) may 
    fill the argument role.

entities:

 Col.#  Content
 1.     AnnotIndexID (a unique ID for the tag in the format
 	LDC_KAIROS_ent_NNN)
 2.     Type (the human-readable type label)
 3.     Output Value for Type (the output value for the type as it appears
 	in the annotation tables)
 4.     Definition (natural language definition of the tag)

Notes:

  - The annotation for entities used only a single type label for each entity.

relations:

 Col.#  Content
 1.     AnnotIndexID (a unique ID for the tag in the format
	LDC_KAIROS_rel_NNN)
 2.     Type (the human-readable type label)
 3.     Output Value for Type (the output value for the type as it appears
	in the annotation tables)
 4.     Subtype (the human-readable subtype label)
 5.     Output Value for Subtype (the output value for the subtype as it
	appears in the annotation tables)
 6.     Sub-subtype (the human-readable sub-subtype label)
 7.     Output Value for Sub-subtype (the output value for the sub-subtype
	as it appears in the annotation tables)
 8.     Definition (natural language definition of the full tag)
 9.     Template (human-readable templatic representation of the relation
 	and its arguments)
 10.    arg1 label (human-readable role label for argument 1)
 11.    Output value for arg1 (the output value for the arg1 role label as
 	it appears in the annotation tables)
 12.    arg1 type constraints (the list of entity types that may fill the
 	arg1 role, including whether any event or relation could fill the role)
 13.    arg2 label (human-readable role label for argument 2)
 14.    Output value for arg2 (the output value for the arg2 role label as
 	it appears in the annotation tables)
 15.    arg2 type constraints (the list of entity types that may fill the
 	arg2 role, including whether any event or relation could fill the role)

Notes:

  - The annotation for events and relations used a three-level annotation
    tag, which included a high-level type, a more specific subtype under each 
    type, and a finer-grained sub-subtype under each subtype. The three levels 
    together comprise the annotation tag for the event or relation.

  - A sub-subtype of "unspecified" indicates that none of the fine-grained
    sub-subtypes under the subtype is appropriate for the annotated event or 
    relation in the context of the document. This may be either because the 
    document context does not support a finer-grained reading (so the higher 
    level subtype is the most specific reading in the document context), or it 
    may be because the available fine-grained sub-subtypes are not applicable 
    (in which case, the higher level subtype is the most specific tag available).

  - Each relation has set of two defined argument roles, and only the
    defined roles are available for annotation.

  - Argument constraints for each argument role list the entities types that
    may fill the argument role, using the entity type output values, along with 
    whether an event (any event type) or relation (any relation type) may fill 
    the argument role.

attributes:

 Col.#  Content
 1.     Attribute Label (the human-readable attribute label)
 2.     Definition (natural language definition of the attribute)
 3.     Output Value for Attribute (the output value for the attribute as it
 	appears in the annotation tables)

Notes:

  - The rows in this tab are divided into sections for Event Attributes,
    Relation Attributes, and Argument Attributes for Arguments of Events, under 
    the Attribute Label column.

temporal startend:

 Col.#  Content
 1.     Start/End Type Label (the human-readable type label)
 2.     Output Value for Start/End Type (the output value for the type as it
 	appears in the annotation tables)
 3.     Definition (natural language definition of the temporal type)

Notes:

  - The rows in this tab include an additional section showing the Output
    Format for Start/End Times under the Start/End Type Label column.

8.0 Known Issues

Three of the .ltf.xml data files -- K0C048H4C, K0C048H4E and K0C048H4F --
contain a few instances of the Unicode character "Zero Width Space" (ZWS, 
U+200B); the data processing failed to treat this character appropriately as
white-space, and as a result, it shows up within both <ORIGINAL_TEXT> and
<TOKEN> elements in each file; in each case, U+200B is attached at the start 
or end of a word or punctuation token.These issues were discovered after
annotation had begun, so the ZWS characters have been kept as-is in order to
avoid disrupting the "start_char" and "end_char" offsets of annotations on 
these files. (Each ZWS counts as one character in the offset numbering.)

9.0 References

DARPA. Broad Agency Announcement: Knowledge-directed Artificial Intelligence
Reasoning Over Schemas (KAIROS). Defense Advanced Research Projects Agency, 
DARPA BAA HR001119S0014.

Song Chen, Jennifer Tracey, Ann Bies, Stephanie Strassel. Schema Learning
Corpus: Data and Annotation Focused on Complex Events. LREC-COLING 2024: 
The 2024 Joint International Conference on Computational Linguistics, 
Language Resources and Evaluation. Turin, May 20-24, 2024

10.0 Sponsorship

KAIROS was sponsored by the Air Force Research Laboratory (AFRL) and the Defense 
Advanced Research Projects Agency (DARPA) under Contract No. HR0011-19-S-0014.

11.0 Copyright

Portions © 2017 13.CL, © 2019 47abc, © 2020 ABC News Internet Ventures, ©
2018-2020 A&E Television Networks, LLC,© 2017-2018 AL DÍANEWS Media, © 2017, 
2019-2020 ALM Media Properties, LLC, © 2020 AlMomento.net, © 2020 American
City Business Journals, © 2020 Anti-Defamation League, © 2019-2020 Autodesk,
Inc., © 2014, 2020 Bloomberg L.P., © 2016-2017, 2019 BuzzFeed, Inc., © 2020 
Cable News Network. A Warner Media Company, © 2016-2018 CBS Interactive
Inc., ©2020 Charlotte Observer, © 2019 Chicago Tribune, © 2014, 2018 China
Daily Information Co., © 2020 Cision US Inc., © 2020 Contxto, © 2013, 
2019-2020 Corporation of Spanish Radio and Television, © 2020 Divorce Source,
Inc.,© 2004, 2006, 2007 GateHouse Media, LLC, © 2020 GlobeNewswire, Inc. ©
2017 GOBankingRates, © 2015, 2019, 2020 Gray Television, Inc., © 2008 
Griffin Communications, © 2020 Hearst Magazine Media, Inc., © 2011-2019 
Impremedia Operating Company LLC, © 2017-2020 Insider Inc, © 2020 KPWHRI, © 
2018, 2020 KQED Inc., © 2020 Kurdistan24, © 2020 Latin American Information 
Agency Prensa Latina, ©2017, 2019-2020 Listen Notes, Inc., © 2017-2018, 2020 
Los Angeles Times, © 2016, 2018-2019 Microsoft, © 2020 MJH Life Sciences and 
Pharmacy Times, © 2016 MUNDOJURIDICO.INFO, © 2018, 2020 NBCUniversal Media, 
LLC, © 2017, 2019 News Group Newspapers Limited, © 2015, 2018-2020 Nexstar 
Inc., © 2016, 2019 NYP Holdings, Inc., © 2011, 2015, 2017, 2020 Patch Media, 
© 2019 Peoria Public Radio, © 2016, 2019-2020 Perfil.com, © 2016 Plan V, © 
2020 Public Citizen, © 2014, 2019 Republica Media Group, © 2019 Reuters, © 
2013-2014, 2018 RFE/RL, Inc., © 2013, 2020 Scientific American, A Division 
of Springer Nature America, Inc., © 2014-2015, 2017 StarMedia, © 2020 Tacoma 
News Tribune,© 2018 The Cumberland Times-News, © 2014, 2017-2018 The New York 
Times Company, © 2018-2019 The Philadelphia Inquirer, LLC, © 2019-2020 THE 
POINTS GUY, LLC, © 2020 The Regents of The University of California,© 2018 
The Sacramento Bee, © 2017, 2019 The Texas Tribune, © 2014, 2017-2018 The 
Washington Post, © 2010, 2012, 2017 The World from PRX, © 2020 Tri-City 
Herald, © 2017, 2019-2020 Univision Communications Inc., © 2016 WVTF, © 2021 
Trustees of the University of Pennsylvania

10.0 Contacts

Dana Delgado <foredana@ldc.upenn.edu> - KAIROS Project Manager
Song Chen <zhiyi@ldc.upenn.edu> - KAIROS Project Manager

------
README created May 7, 2024
       updated May 14, 2024
       updated June 6, 2024
       updated October 18, 2024