Documentation Included in this Release 1.0 Overview The ./docs folder (relative to the root directory of this release) contains tab-delimited table files (see section 1.1 and 1.2 below for details), DTD files for the "ltf" and "psm" xml file formats, narrative profiles for each CE, and the annotation and assessment guidelines. 2.0 "parent_children.tab" -- Relation of Child Assets to Root HTML Pages In the following, the term "asset" refers to any single "primary" data file of any given type. Each asset has a distinct 9-character identifier. If two or more files appear with the same 9-character file-ID, this means that they represent different forms or derivations created from the same, single primary data file (e.g. this is how we mark corresponding LTF.xml and PSM.xml file pairs). Data scouting, annotation and related metadata are all managed with regard to a set of "root" HTML pages (harvested by the LDC for a specified set of events); therefore the tables and annotations make reference to the asset-IDs assigned to those root pages. However, the present release does not include the original HTML text streams, or any derived form of data corresponding to the full HTML content. As a result, the "root" asset-IDs cited in tables and annotations are not to be found among the inventory of data files presented in the "./data" directory. Each root asset is associated with one or more "child" assets (including images, media files, style sheets, text data presented as ltf.xml, etc.); each child asset gets its own distinct 9-character ID. The root-child relations are provided in "parent_children.tab" table, and as part of the LDCC header content in the various "wrapped" data file formats (as listed in section 2). Each data file-ID is represented by the combination of child_uid and child_asset_type (columns 4 and 6). The columns are tab-delimited and the initial line of the file provides the column labels as shown below: Col.# Content 1. parent_uid (the parent UID associated with the doc URL) 2. child_uid (the UID of the child asset) 3. url (URL of the child asset) 4. child_asset_type (e.g. ".jpg.ldcc") 5. rel_pos (relative position of the child asset within the root asset HTML code) 6. wrapped_md5 (md5 checksum of the .ldcc-wrapped asset file) 7. unwrapped_md5 (md5 checksum of the asset file without the ldcc wrapper) 8. download_date (download date of asset) 9. content_date (creation date of asset, or n/a) Notes: - Because ltf and psm files have the same "child" uid and differ only in the file extension (.ltf.xml or .psm.xml), only the ltf files are listed in the parent_children.tab document. - The URL provided for each .ltf.xml entry in the table is the "full-page" URL for root document associated with the "parent_uid" value. (For other types of child data -- images and media -- the "url" field contains the specific url for that specific piece of content.) - Some child_uids (for images or videos) may appear multiple times in the table if they were found to occur identically in multiple root web pages. - The content_date is obtained for the parent document from the process that extracts the text (ltf) child asset. This date therefore appears only for ltf rows in the table, but can be considered to apply to the full parent document. 3.0 "document_profile.tab" -- Source Document Information for Each CE Information about the source data in the package is provided in ./docs/document_profile.tab, including the source UID, the CE ID the document was scouted for, the language of the document, whether the document is CE relevant or a distractor document. Col.# Content 1. ce_id (the ID of the CE, e.g. "ce2013") 2. ce_name (the name of the CE) 3. parent_uid (parent_uid of the document) 4. language (language of the document) 5. url (the URL of the "root" web page that corresponds to the parent_uid) 6. status (whether a document is on-topic or distractor) 4.0 Annotation and Assessment Guidelines The ./docs directory also includes guidelines for manual annotation and assessment. ./docs/KAIROS_Phase1_Eval_AnnotationGuidelines_v1.11.pdf - guidelines for Phase 1 evaluation data annotation ./docs/KAIROSEvalAnnotationAssessmentV2.4.pdf - guidelines for Phase 1 manual assessment 5.0 Annotation Tag Set The ./docs directory also includes documentation about the program tag set used in KAIROS Phase 1 for both manual annotation and system output: ./docs/KAIROS_Annotation_Tagset_Phase_1_V3.0.xlsx - annotation tag set (ontology) for KAIROS Phase 1 This annotation tag set (also known as the annotation ontology) was used for the annotation of events, relations, and their argument entities. The tag set includes type, subtype, sub-subtype, attribute, and temporal start/end timestamp specifications. Please refer to ./docs/README_annotation.txt and to the annotation guidelines (./docs/KAIROS_Phase1_Eval_AnnotationGuidelines_v1.11.pdf) for additional information about the annotation procedure. The tag set is included as an Excel file, with the following five tabs: - events (the labels, output values, definitions, templates, arg labels, and arg constraints for events) - entities (the labels, output values, and definitions for entities) - relations (the labels, output values, definitions, templates, arg labels, and arg constraints for relations) - attributes (the labels, definitions, and output values for attributes of events, relations, and arguments) - temporal startend (the start and end type labels, output values, and definitions for temporal timestamp annotation of events and relations, along with the output format for times) The description of the column labels and fields for each annotation tag set tab can be found in ./docs/annotation_tagset_description.pdf 6.0 CE Profiles The ./docs/ce_profile directory contains a brief narrative summary of the key events in each of the annotated CEs. Annotators relied on the CE profiles for an understanding of the overall CE narrative while creating the reference annotation. Because the CE profiles supported the annotation effort, there is no CE profile for ce1021, which was not annotated. 7.0 "annotation_table_description.pdf" – Description of Annotation Tables The description of the column labels and fields for each of the annotation tables can be found in ./docs/annotation_table_description.pdf 8.0 "assessment_result_description.pdf" – Description of Assessment Result Tables The description of the column labels and fields for each of the assessment tables can be found in ./docs/assessment_result_description.pdf 8.0 "system_output_description.pdf" – Description of System Output in tab Format The description of the column labels and fields for each of the system output tab files can be found in ./docs/system_output_description.pdf.