Corpus Title: KAIROS Schema Learning Corpus Complex Event Annotation LDC Catalog-ID: LDC2025T07 Authors: Song Chen, Jennifer Tracey, Ann Bies, Christopher Caruso, Stephanie Strassel 1.0 Introduction The KAIROS Schema Learning Corpus Complex Event Annotation release includes English and Spanish text, audio, video and image data labeled for 93 real- world Complex Events (CEs), like riots or disease outbreaks, that consist of numerous subsidiary elements that may happen sequentially or simultaneously, and which may have many inter-dependencies. The corpus includes event, relation and argument annotations for CE-relevant documents, with links to document provenance instantiating each step in the CE. This release is one component of the Schema Learning Corpus (SLC), which was designed to support research into the structure of complex events in multilingual, multimedia data as part of the DARPA Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS) Program. KAIROS aims to build technology capable of understanding and reasoning about complex real-world events in order to provide actionable insights to end users. KAIROS systems utilize formal event representations in the form of schema libraries that specify the steps, preconditions and constraints for an open set of complex events; schemas are then used in combination with event extraction to characterize and make predictions about real-world events in a large multilingual, multimedia corpus. The other component of the SLC is the Background Data Corpus, available in a separate LDC release, which provides very large volumes of unlabeled English,Spanish and Russian data from diverse sources and modalities, covering a wide variety of CEs. Taken together, the SLC Complex Event Annotation Corpus and the Background Data Corpus constitute the data used by KAIROS system developers for schema learning. For further information about the Schema Learning Corpus and its use in the KAIROS program, refer to Chen (2024). 2.0 Directory Structure and Content Summary This release contains source data and annotations for a total of 93 Complex Events. The directory structure and contents of the package are summarized below -- paths shown are relative to the base (root) directory of the package: ./data/source -- source data in subdirectories by data type ./data/annotation -- annotations ./docs/ -- documentation for source data and annotations ./docs/ce_profile -- Complex Event descriptions ./tools/ -- software for LTF data manipulation The "./data" directory has a separate subdirectory for each of the following data types, and each directory contains one or more zip archives with data files of the given type; the list shows the archive-internal directory and file-extension strings used for the data files of each type: bmp/*.bmp.zip -- contains "bmp/*.bmp.ldcc" (image data) gif/*.gif.zip -- contains "gif/*.gif.ldcc" (image data) jpg/*.jpg.zip -- contains "jpg/*.jpg.ldcc" (image data) mp4/*.mp4.zip -- contains "mp4/*.mp4.ldcc" (video data) mp3/*.mp3.zip -- contains "mp3/*.mp3.ldcc" (audio data) png/*.png.zip -- contains "png/*.png.ldcc" (image data) svg/*.svg.zip -- contains "svg/*.svg.ldcc" (image data) ltf/*.ltf.zip -- contains "ltf/*.ltf.xml" (segmented/tokenized text data) psm/*.psm.zip -- contains "psm/*.psm.xml" (companion to ltf.xml) Data types in the first group (image, video, and audio data) consist of original source materials presented in "ldcc wrapper" file format (see section 4.2 below). The latter group (ltf and psm) are created by LDC from source HTML data, by way of an intermediate XML reduction of the original HTML content for "root" web pages (see section 4.1 for a description of the process, and section 5 for details on the LTF and PSM file formats). The 6-character file-ID of the zip archive matches the first 6 characters of the 9-character file-IDs of the data files it contains. For example: zip archive file ./data/svg/Jc002Y.svg.zip contains: svg/JC002YBYQ.svg.ldcc (The "ldcc" file format is explained in more detail in section 4.2 below.) 2.1 Source Data Summary A total of 3431 root web pages were collected and processed, yielding 1919 text data files, 24019 image files, 1472 video files and 16 audio files present in the corpus. 2.2 Annotation Data Summary The table below summarizes the amount of annotation included in the corpus: total_ce - total complex events subject to data collection and annotation total_doc_src - CE-relevant root documents collected and processed total_doc_provlink - root docs labeled for provenance linking total_doc_mention - root docs labeled for events, relations, and schema linking Language | total_ce | total_doc_src | total_doc_provlink |total_doc_mention | English | 93 | 2,190 | 650 | 216 | Spanish | 90 | 1,241 | 493 | 122 | Total | 93 | 3,431 | 1,143 | 338 | 3. Annotation 3.1 Defining Complex Events Prior to annotation, we defined 93 Complex Events, covering 12 domains: • Business workings • Civil unrest • Conflict or threat • Disaster • Government workings • Cyber or information • Illegal activities • Legal proceedings • Medical intervention • Movement or travel • New capability development • Social life Each domain includes 3 or more CEs of various granularity. For each CE we created a CE profile, using a standardized template that includes a natural language description of the CE along with a set of typical steps that comprise the event. These steps are described in natural language and include information about the expected event tag set types that might instantiate the step, along with information about the expected ordering of each step with respect to other steps. The steps defined for each CE are not intended to describe every possible variation in how things may play out for the CE; instead, they describe the typical way the complex event unfolds. Some steps may be optional, or ordered differently than described in the CE Profile, but the Profile provides a typical "script" for how this CE may appear in real world data. CE Profiles can be found in ./docs/ce_profile/. 3.2 Data Scouting The CE profiles serve as a guide to data scouting and annotation. During data scouting, annotators consult the CE profile and search the web for documents that discuss that CE. Special attention was paid to documents that contain evidence for the specific steps involved in the CE, aiming for variety in terms of the data source, genre, modality and language of the documents for both the steps and for the CE as a whole. A subset of the scouted documents were then subject to annotation, favoring documents that provided the best balance of variety and step coverage for the CE. Please refer to the Data Scouting guidelines for additional information about the scouting procedure: ./docs/KAIROS_Data_Scouting_Guidelines_v1.0.pdf. 3.3 Provenance Linking Annotation Provenance Linking is a lightweight approach to grounding the presence of CE steps in documents. This approach was adopted to provide a first layer of annotation that emphasized the linking of events in documents to steps in a CE (using the CE profile as a stand-in for a schema), which is a primary focus of KAIROS research. During Provenance Linking annotation, annotators review each document subject to annotation for this CE and indicate which CE steps are present in the document, marking the document span (e.g. text character offsets or video start/end times) where the step is instantiated. CE steps may be instantiated across different documents, languages and modalities. For instance, in the CE "Provide And Distribute Disaster Relief", Step 1 may be instantiated in an English video document about a hurricane, while Step 2 could be instantiated in a Spanish text document about an earthquake. Please refer to the Provenance Linking Annotation guidelines for additional information about the annotation procedure: ./docs/KAIROS_Provenance_Linking_Guidelines_V1.0.pdf. Provenance linking annotation output appears in ./data/annotation/KAIROS_SLC_provlinking.tab, and the data format is described in ./docs/annotation_table_field_descriptions.tab. 3.4 Mention Annotation Mention Annotation provides more detailed and structured representation in the form of event and relation frames for the same documents that were previously annotated for provenance linking. All event and relation mentions relevant to the specified CE are labeled. Each frame consists of a type, subtype and sub-subtype from the KAIROS annotation tag set, using the official tags for Phase 1 of the program. Frames also include a document span for the event or relation trigger, and attributes to indicate things like negation. Entities that fill the argument roles for each event are relation are also labeled, with argument roles and types specified and argument spans indicated. Start and end times are labeled for each event or relation, along with a link to the specific CE step represented by the event or relation mention. Please refer to the Mention Annotation guidelines for additional information about the annotation procedure: ./docs/KAIROS_Mention_AnnotationGuidelines_v1.0.pdf. The annotation tag set is documented in ./docs/KAIROS_Annotation_Tagset_Phase_1_V3.0.xlsx. Mention annotation output appears in the following 6 tables under ./data/annotation, and the data formats are described in ./docs/annotation_table_field_descriptions.tab: KAIROS_SLC_arg_mentions.tab - contains event and relation argument mention annotation KAIROS_SLC_ce_linking.tab - contains the linking between event/relation me ntions and a Complex Event step KAIROS_SLC_evt_mentions.tab - contains event mention annotation KAIROS_SLC_evt_slots.tab - contains event mention argument slots. Event mentions in the mentions tables must be looked up in the slots tables to find the arguments and fillers that are involved in the event. KAIROS_SLC_rel_mentions.tab - contains relation mention annotation KAIROS_SLC_rel_slots.tab - contains relation mention argument slots. Relation mentions in the mentions tables must be looked up in the slots tables to find the arguments and fillers that are involved in the relation. 4.0 Source Data Processing The web documents selected by annotators during data scouting were first harvested from various sources using an automated system developed by LDC, and then processed to produce a standardized format for use in downstream tasks. 4.1 Treatment of original HTML text content All harvested HTML content was initially converted from its original form into a relatively uniform XML format; this stage of conversion eliminated irrelevant content (menus, ads, headers, footers, etc.), and placed the content of interest into a simplified, consistent markup structure. The "homogenized" XML format then served as input for the creation of a reference "raw source data" (rsd) plain text form of the web page content; at this stage, the text was also conditioned to normalize white- space characters, and to apply transliteration and/or other character normalization, as appropriate to the given language. This processing creates the ltf.xml and psm.xml files for each harvested "root" web page; these file formats are described in more detail in section 5 below. 4.2 Treatment of non-HTML data types: "ldcc" file format To the fullest extent possible, all discrete resources referenced by a given "root" HTML page (style sheets, javascript, images, video, audio and other media files, etc.) are stored as separate files of the given data type, and assigned separate 9-character file-IDs (the same form of ID as is used for the "root" HTML page). In order to present these attached resources in a stable and consistent way, we developed a "wrapper" or "container" file format, which presents the original data as-is, together with a specialized header block prepended to the data. The header block provides metadata about the file contents, including the MD5 checksum (for self-validation), the data type and byte count, url, and citations of source-ID and parent (HTML) file-ID. The LDCC header block always begins with a 16-byte ASCII signature, as shown between double-quotes on the following line (where "\n" represents the ASCII "newline" character 0x0A): "LDCc \n1024 \n" Note that the "1024" on the second line of the signature represents the exact byte count of the LDCC header block. (If/when this header design needs to accommodate larger quantities of metadata, the header byte count can be expanded as needed in increments of 1024 bytes. Such expansion does not arise in the present release.) Immediately after the 16-byte signature, a YAML string presents a data structure comprising the file-specific header content, expressed as a set of "key: value" pairings in UTF-8 encoding. The YAML string is padded at the end with space characters, such that when the following 8-byte string is appended, the full header block size is exactly 1024 bytes (or whatever size is stated in the initial signature): "endLDCc\n" In order to process the content of an LDCC header: - read the initial block of 1024 bytes from the *.ldcc data file - check that it begins with "LDCc \n1024 \n" and ends with "endLDCc\n" - strip off those 16- and 8-byte portions - pass the remainder of the block to a YAML parser. In order to access the original content of the data file, simply skip or remove the initial 1024 bytes. 5.0 Overview of XML Data Structures 5.1 PSM.xml -- Primary Source Markup Data The "homogenized" XML format described above preserves the minimum set of tags needed to represent the structure of the relevant text as seen by the human web-page reader. When the text content of the XML file is extracted to create the "rsd" format (which contains no markup at all), the markup structure is preserved in a separate "primary source markup" (psm.xml) file, which enumerates the structural tags in a uniform way, and indicates, by means of character offsets into the rsd.txt file, the spans of text contained within each structural markup element. For example, in a discussion forum or weblog page, there would be a division of content into the discrete "posts" that make up the given thread, along with "quote" regions and paragraph breaks within each post. After the HTML has been reduced to uniform XML, and the tags and text of the latter format have been separated, information about each structural tag is kept in a psm.xml file, preserving the type of each relevant structural element, along with its essential attributes ("post_author", "date_time", etc.), and the character offsets of the text span comprising its content in the corresponding rsd.txt file. 5.2 LTF.xml -- Logical Text Format Data The "ltf.xml" data format is derived from rsd.txt, and contains a fully segmented and tokenized version of the text content for a given web page. Segments (sentences) and the tokens (words) are marked off by XML tags (SEG and TOKEN), with "id" attributes (which are only unique within a given XML file) and character offset attributes relative to the corresponding rsd.txt file; TOKEN tags have additional attributes to describe the nature of the given word token. The segmentation is intended to partition each text file at sentence boundaries, to the extent that these boundaries are marked explicitly by suitable punctuation in the original source data. To the extent that sentence boundaries cannot be accurately detected (due to variability or ambiguity in the source data), the segmentation process will tend to err more often on the side of missing actual sentence boundaries, and (we hope) less often on the side of asserting false sentence breaks. The tokenization is intended to separate punctuation content from word content, and to segregate special categories of "words" that play particular roles in web-based text (e.g. URLs, email addresses and hashtags). To the extent that word boundaries are not explicitly marked in the source text, the LTF tokenization is intended to divide the raw-text character stream into units that correspond to "words" in the linguistic sense (i.e. basic units of lexical meaning). 6.0 Software tools included in this release 6.1 ltf2txt A data file in ltf.xml format (as described above) can be conditioned to recreate exactly the "raw source data" text stream (the rsd.txt file) from which the LTF was created. The tools described here can be used to apply that conditioning, either to a directory or to a zip archive file containing ltf.xml data. In either case, the scripts validate each output rsd.txt stream by comparing its MD5 checksum against the reference MD5 checksum of the original rsd.txt file from which the LTF was created. (This reference checksum is stored as an attribute of the "DOC" element in the ltf.xml structure; there is also an attribute that stores the character count of the original rsd.txt file.) Each script contains user documentation as part of the script content; you can run "perldoc" to view the documentation as a typical unix man page, or you can simply view the script content directly by whatever means to read the documentation. Also, running either script without any command-line arguments will cause it to display a one-line synopsis of its usage, and then exit. ltf2rsd.perl -- convert ltf.xml files to rsd.txt (raw-source-data) ltfzip2rsd.perl -- extract and convert ltf.xml files from zip archives 7.0 Documentation included in this release 7.1. Complex Event Profiles ./docs/ce_profile - contains Complex Event Profiles for all 93 CEs. CE profiles are named "complexevent[id#]_[name]_v[n].txt", where the "id#" is a 3-digit value (e.g. "002"), and "name" is a word or (underscore-conjoined) phrase that serves as the title for the Complex Event -- for example: complexevent005_Disease_Outbreak_v2.txt 7.2. Root Pages and Child Assets ./docs/parent_children.tab describes the relationship between child assets and root HTML pages. In the following, the term "asset" refers to any single "primary" data file of any given type. Each asset has a distinct 9-character identifier. If two or more files appear with the same 9-character file-ID, this means that they represent different forms or derivations created from the same, single primary data file (e.g. this is how we mark corresponding LTF.xml and PSM.xml file pairs). Data scouting, annotation and related metadata are all managed with regard to a set of "root" HTML pages (harvested by the LDC for a specified set of events); therefore the tables and annotations make reference to the asset-IDs assigned to those root pages. However, the present release does not include the original HTML text streams, or any derived form of data corresponding to the full HTML content. As a result, the "root" asset-IDs cited in tables and annotations are not to be found among the inventory of data files presented in zip archives in the "./data" directory. Each root asset is associated with one or more "child" assets (including images, media files, style sheets, text data presented as ltf.xml, etc.); each child asset gets it own distinct 9-character ID. The root-child relations are provided in "parent_children.tab" table, and as part of the LDCC header content in the various "wrapped" data file formats (as listed in section 2). Each data file-ID in the set of zip archives is represented by the combination of child_uid and child_asset_type (columns 4 and 6). The columns are tab-delimited and the initial line of the file provides the column labels as shown below: Col.# Content 1. parent_uid (the parent UID associated with the doc URL) 2. child_uid 3. url 4. child_asset_type (e.g. ".jpg.ldcc") 5. rel_pos (relative position of the child asset within ght root asset HTML code) 6. wrapped_md5 (md5 checksum of the .ldcc-wrapped asset file) 7. unwrapped_md5 (md5 checksum of the asset file without the ldcc wrapper) 8. download_date (download date of asset) 9. content_date (creation date of asset, or n/a) Notes: - Because ltf and psm files have the same "child" uid and differ only in the file extension (.ltf.xml or .psm.xml), only the ltf files are listed in the parent_children.tab document. - The URL provided for each .ltf.xml entry in the table is the "full-page" URL for root document associated with the "parent_uid" value. (For other types of child data -- images and media -- the "url" field contains the specific url for that specific piece of content.) - Because the harvesting of some root URLs yielded no text content (hence no ltf/psm data files), the table includes "placeholder" .ltf.xml entries for those parent_uids, in order to provide the full-page URL for every root. The "status_in_corpus" field for these entries is set to "n/a" (as opposed to "present"). - Some child_uids (for images or videos) may appear multiple times in the table, if they were found to occur identically in multiple root web pages. - The content_date is obtained for the parent document from the process that extracts the text (ltf) child asset. This date therefore appears only for ltf rows in the table, but can be considered to apply to the full parent document. 7.3 Document Profile ./docs/document_profile.tab provides information about the source data in the package, including the source UID, the CE ID the document was scouted for, the language of the document, and the annotation status of the document. Col.# Content 1. ce_id (Complex Event ID) 2. parent_uid (the parent UID associated with the doc URL) 3. language (the language that the URL is scouted for) 4. provlink (whether the document has been annotated for provenance linking) 5. mention (whether the document has been annotated for mention) 7.4 Data Scouting and Annotation The ./docs directory also includes guidelines for data scouting and annotation, along with a tab file describing all of the fields in the various annotation tables. ./docs/KAIROS_Data_Scouting_Guidelines_v1.0.pdf - guidelines for scouting source documents for Complex Events ./docs/KAIROS_Provenance_Linking_Guidelines_V1.0.pdf - guidelines for provenance linking annotation ./docs/KAIROS_Mention_AnnotationGuidelines_v1.0.pdf - guidelines for mention annotation ./docs/annotation_table_field_descriptions.tab - description of the structure of each type of annotation table. This table includes information about column headers, content of each field, and format of the contents. 7.5 Annotation Tag Set The ./docs directory also includes documentation about the annotation tagset used in the Schema Learing Corpus, which was also the official tagset for KAIROS Phase 1: ./docs/KAIROS_Annotation_Tagset_Phase_1_V3.0.xlsx - annotation tag set (ontology) for KAIROS Phase 1 This annotation tag set (also known as the annotation ontology) was used for the annotation of event, relation, and entity mentions. The tag set includes type, subtype, sub-subtype, attribute, and temporal start/end timestamp specifications. Please refer to section 3.4 Mention Annotation of this README and the Mention Annotation guidelines for additional information about the annotation procedure: ./docs/KAIROS_Mention_AnnotationGuidelines_v1.0.pdf. The tag set is included as an excel file, with the following five tabs: - events (the labels, output values, definitions, templates, arg labels, and arg constraints for events} - entities (the labels, output values, and definitions for entities) - relations (the labels, output values, definitions, templates, arg labels, and arg constraints for relations) - attributes (the labels, definitions, and output values for attributes of events, relations, and arguments) - temporal startend (the start and end type labels, output values, and definitions for temporal timestamp annotation of events and relations, along with the output format for times) The initial line of each tab provides the column labels as shown below. events: Col.# Content 1. AnnotIndexID (a unique ID for the tag in the format LDC_KAIROS_evt_NNN) 2. Type (the human-readable type label) 3. Output Value for Type (the output value for the type as it appears in the annotation tables) 4. Subtype (the human-readable subtype label) 5. Output Value for Subtype (the output value for the subtype as it appears in the annotation tables) 6. Sub-subtype (the human-readable sub-subtype label) 7. Output Value for Sub-subtype (the output value for the sub-subtype as it appears in the annotation tables) 8. Definition (natural language definition of the full tag) 9. Template (human-readable templatic representation of the event and its arguments) 10. arg1 label (human-readable role label for argument 1) 11. Output value for arg1 (the output value for the arg1 role label as it appears in the annotation tables) 12. arg1 type constraints (the list of entity types that may fill the arg1 role, including whether any event or relation could fill the role) 13. arg2 label (human-readable role label for argument 2) 14. Output value for arg2 (the output value for the arg2 role label as it appears in the annotation tables) 15. arg2 type constraints (the list of entity types that may fill the arg2 role, including whether any event or relation could fill the role) 16. arg3 label (human-readable role label for argument 3) 17. Output value for arg3 (the output value for the arg3 role label as it appears in the annotation tables) 18. arg3 type constraints (the list of entity types that may fill the arg3 role, including whether any event or relation could fill the role) 19. arg4 label (human-readable role label for argument 4) 20. Output value for arg4 (the output value for the arg4 role label as it appears in the annotation tables) 21. arg4 type constraints (the list of entity types that may fill the arg4 role, including whether any event or relation could fill the role) 22. arg5 label (human-readable role label for argument 5) 23. Output value for arg5 (the output value for the arg5 role label as it appears in the annotation tables) 24. arg5 type constraints (the list of entity types that may fill the arg5 role, including whether any event or relation could fill the role) 25. arg6 label (human-readable role label for argument 6) 26. Output value for arg6 (the output value for the arg6 role label as it appears in the annotation tables) 27. arg6 type constraints (the list of entity types that may fill the arg6 role, including whether any event or relation could fill the role) Notes: - The annotation for events and relations used a three-level annotation tag, which included a high-level type, a more specific subtype under each type, and a finer-grained sub-subtype under each subtype. The three levels together comprise the annotation tag for the event or relation. - A sub-subtype of "unspecified" indicates that none of the fine-grained sub-subtypes under the subtype is appropriate for the annotated event or relation in the context of the document. This may be either because the document context does not support a finer-grained reading (so the higher level subtype is the most specific reading in the document context), or it may be because the available fine-grained sub-subtypes are not applicable (in which case, the higher level subtype is the most specific tag available). - Each event has a defined set of argument roles, and only the defined roles are available for annotation. The maximum number of roles for an event in this tag set is six. - Argument constraints for each argument role list the entities types that may fill the argument role, using the entity type output values, along with whether an event (any event type) or relation (any relation type) may fill the argument role. entities: Col.# Content 1. AnnotIndexID (a unique ID for the tag in the format LDC_KAIROS_ent_NNN) 2. Type (the human-readable type label) 3. Output Value for Type (the output value for the type as it appears in the annotation tables) 4. Definition (natural language definition of the tag) Notes: - The annotation for entities used only a single type label for each entity. relations: Col.# Content 1. AnnotIndexID (a unique ID for the tag in the format LDC_KAIROS_rel_NNN) 2. Type (the human-readable type label) 3. Output Value for Type (the output value for the type as it appears in the annotation tables) 4. Subtype (the human-readable subtype label) 5. Output Value for Subtype (the output value for the subtype as it appears in the annotation tables) 6. Sub-subtype (the human-readable sub-subtype label) 7. Output Value for Sub-subtype (the output value for the sub-subtype as it appears in the annotation tables) 8. Definition (natural language definition of the full tag) 9. Template (human-readable templatic representation of the relation and its arguments) 10. arg1 label (human-readable role label for argument 1) 11. Output value for arg1 (the output value for the arg1 role label as it appears in the annotation tables) 12. arg1 type constraints (the list of entity types that may fill the arg1 role, including whether any event or relation could fill the role) 13. arg2 label (human-readable role label for argument 2) 14. Output value for arg2 (the output value for the arg2 role label as it appears in the annotation tables) 15. arg2 type constraints (the list of entity types that may fill the arg2 role, including whether any event or relation could fill the role) Notes: - The annotation for events and relations used a three-level annotation tag, which included a high-level type, a more specific subtype under each type, and a finer-grained sub-subtype under each subtype. The three levels together comprise the annotation tag for the event or relation. - A sub-subtype of "unspecified" indicates that none of the fine-grained sub-subtypes under the subtype is appropriate for the annotated event or relation in the context of the document. This may be either because the document context does not support a finer-grained reading (so the higher level subtype is the most specific reading in the document context), or it may be because the available fine-grained sub-subtypes are not applicable (in which case, the higher level subtype is the most specific tag available). - Each relation has set of two defined argument roles, and only the defined roles are available for annotation. - Argument constraints for each argument role list the entities types that may fill the argument role, using the entity type output values, along with whether an event (any event type) or relation (any relation type) may fill the argument role. attributes: Col.# Content 1. Attribute Label (the human-readable attribute label) 2. Definition (natural language definition of the attribute) 3. Output Value for Attribute (the output value for the attribute as it appears in the annotation tables) Notes: - The rows in this tab are divided into sections for Event Attributes, Relation Attributes, and Argument Attributes for Arguments of Events, under the Attribute Label column. temporal startend: Col.# Content 1. Start/End Type Label (the human-readable type label) 2. Output Value for Start/End Type (the output value for the type as it appears in the annotation tables) 3. Definition (natural language definition of the temporal type) Notes: - The rows in this tab include an additional section showing the Output Format for Start/End Times under the Start/End Type Label column. 8.0 Known Issues Three of the .ltf.xml data files -- K0C048H4C, K0C048H4E and K0C048H4F -- contain a few instances of the Unicode character "Zero Width Space" (ZWS, U+200B); the data processing failed to treat this character appropriately as white-space, and as a result, it shows up within both and elements in each file; in each case, U+200B is attached at the start or end of a word or punctuation token.These issues were discovered after annotation had begun, so the ZWS characters have been kept as-is in order to avoid disrupting the "start_char" and "end_char" offsets of annotations on these files. (Each ZWS counts as one character in the offset numbering.) 9.0 References DARPA. Broad Agency Announcement: Knowledge-directed Artificial Intelligence Reasoning Over Schemas (KAIROS). Defense Advanced Research Projects Agency, DARPA BAA HR001119S0014. Song Chen, Jennifer Tracey, Ann Bies, Stephanie Strassel. Schema Learning Corpus: Data and Annotation Focused on Complex Events. LREC-COLING 2024: The 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation. Turin, May 20-24, 2024 10.0 Sponsorship KAIROS was sponsored by the Air Force Research Laboratory (AFRL) and the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-19-S-0014. 11.0 Copyright Portions © 2017 13.CL, © 2019 47abc, © 2020 ABC News Internet Ventures, © 2018-2020 A&E Television Networks, LLC,© 2017-2018 AL DÍANEWS Media, © 2017, 2019-2020 ALM Media Properties, LLC, © 2020 AlMomento.net, © 2020 American City Business Journals, © 2020 Anti-Defamation League, © 2019-2020 Autodesk, Inc., © 2014, 2020 Bloomberg L.P., © 2016-2017, 2019 BuzzFeed, Inc., © 2020 Cable News Network. A Warner Media Company, © 2016-2018 CBS Interactive Inc., ©2020 Charlotte Observer, © 2019 Chicago Tribune, © 2014, 2018 China Daily Information Co., © 2020 Cision US Inc., © 2020 Contxto, © 2013, 2019-2020 Corporation of Spanish Radio and Television, © 2020 Divorce Source, Inc.,© 2004, 2006, 2007 GateHouse Media, LLC, © 2020 GlobeNewswire, Inc. © 2017 GOBankingRates, © 2015, 2019, 2020 Gray Television, Inc., © 2008 Griffin Communications, © 2020 Hearst Magazine Media, Inc., © 2011-2019 Impremedia Operating Company LLC, © 2017-2020 Insider Inc, © 2020 KPWHRI, © 2018, 2020 KQED Inc., © 2020 Kurdistan24, © 2020 Latin American Information Agency Prensa Latina, ©2017, 2019-2020 Listen Notes, Inc., © 2017-2018, 2020 Los Angeles Times, © 2016, 2018-2019 Microsoft, © 2020 MJH Life Sciences and Pharmacy Times, © 2016 MUNDOJURIDICO.INFO, © 2018, 2020 NBCUniversal Media, LLC, © 2017, 2019 News Group Newspapers Limited, © 2015, 2018-2020 Nexstar Inc., © 2016, 2019 NYP Holdings, Inc., © 2011, 2015, 2017, 2020 Patch Media, © 2019 Peoria Public Radio, © 2016, 2019-2020 Perfil.com, © 2016 Plan V, © 2020 Public Citizen, © 2014, 2019 Republica Media Group, © 2019 Reuters, © 2013-2014, 2018 RFE/RL, Inc., © 2013, 2020 Scientific American, A Division of Springer Nature America, Inc., © 2014-2015, 2017 StarMedia, © 2020 Tacoma News Tribune,© 2018 The Cumberland Times-News, © 2014, 2017-2018 The New York Times Company, © 2018-2019 The Philadelphia Inquirer, LLC, © 2019-2020 THE POINTS GUY, LLC, © 2020 The Regents of The University of California,© 2018 The Sacramento Bee, © 2017, 2019 The Texas Tribune, © 2014, 2017-2018 The Washington Post, © 2010, 2012, 2017 The World from PRX, © 2020 Tri-City Herald, © 2017, 2019-2020 Univision Communications Inc., © 2016 WVTF, © 2021 Trustees of the University of Pennsylvania 10.0 Contacts Dana Delgado - KAIROS Project Manager Song Chen - KAIROS Project Manager ------ README created May 7, 2024 updated May 14, 2024 updated June 6, 2024 updated October 18, 2024