Assessment Data Included in this Release 1.0 Assessment Overview Manual assessment of KAIROS system output was a key component of system evaluation, complementing automatic evaluation metrics based on comparison of the reference CE events against the ordered set of system events and relations. In the assessment task for the end-to-end evaluation task, assessors reviewed populated system CEs to identify matches between system and reference events/relations and determine whether extra events/relations and arguments produced by the system are plausible (even if not attested) for the CE. 2.0 Assessment Approach There were two types of systems in KAIROS. TA1 systems built the schema libraries while TA2 systems instantiated the schema by extracting the entities, relations, and events from the input source data. In Phase 1, NIST designated particular system combinations for assessment, and LDC assessed the first 2 instantiated schemas per CE for each designated system combination. Within each system CE, events were prioritized for presentation to assessors based on system confidence ratings. Manual assessment focused on assessing the accuracy of instantiated system CEs, determined by comparison to human-produced reference annotation for the CE. Assessment results indicate how well-formed a system-produced CE is, and to what extent a system-produced CE matches the reference annotation CE. In comparing the system output to the human-produced reference annotation, assessors were instructed to be lenient but principled, and to give the system credit wherever possible as long as doing so did not contradict assessment guidelines. Assessors only assessed system-produced CEs that were deemed sufficiently human-readable to be identifiable by assessors. Sufficiently human-readable system CEs were those that contained sufficient participant information to identify events or relations included in the system CE and sufficient provenance to justify participants' inclusion in the events or relations as arguments. Empty items (those that contained no provenance and no populated arguments) and vacuous items (those that contained provenance but no populated arguments) were not assessed. Assessment during the evaluation consisted of four subtasks: Complex Event (CE) Matching, Event/Relation (EP) Alignment, Knowledge Element (KE) Analysis, and Correctness. First, CE matching determined whether the instantiated system schema and the reference CE were representing the same complex event. If so, the system output proceeded to subsequent stages, starting with EP (event primitive) alignment, in which assessors determined, for each system event or relation, whether there was a matching event or relation in the reference annotation. For matching events/relations, there was then a KE analysis step, asking whether the system KE (knowledge element, or argument) had a matching argument in the reference. Finally, the correctness step examined any “extra” events, relations and arguments produced by the system that were deemed relevant to the CE but were not present in the reference annotation; for each item, assessors judged whether the item was actually attested in the input data set. Two additional assessment tasks were completed as a post hoc add-on following the evaluation: EP Clustering and Empty EP Review. The EP Clustering task clustered EPs as coreferent when they were first assessed as being "extra-relevant" during EP Alignment and then also judged as correct during Correctness assessment. This task created equivalence classes (consisting of one or more system events/relations) for the relevant and correct EPs that were returned by the systems but that were not present in the reference annotation. In the Empty EP Review task, EPs for which the system returned no provenance, but which were part of a system CE that was judged as matching during CE Matching, were reviewed to see if they are relevant to the target CE. Detailed assessment guidelines are included in ./docs/KAIROSEvalAnnotationAssessmentV2.4.pdf 3.0 Assessment Results The ./data/assessment/assessment_result directory includes the manual assessment results divided into six separate tables, one for each assessment task. Each table includes all assessment results for the task, across all assessed system runs and CEs. The fields and content of the assessment results tables are described in the following file: ./docs/assessment_result_description.pdf KAIROS_EVAL_ce_matching.tab – This table contains judgements for the CE Matching stage of assessment, in which system CEs are compared to human reference CEs to determine whether the system CE refers to the same Complex Event as the human reference CE. KAIROS_EVAL_ep_alignment.tab – This table contains judgments for the EP Alignment stage of assessment, in which system EPs are compared to EPs in the human reference CE to determine whether the system EP refers to the same event or relation as the human reference CE. KAIROS_EVAL_ke_analysis.tab – This table contains judgments for the KE Analysis stage of assessment, in which the individual KEs (arguments) that make up a system EP are compared to KEs in the matched human reference EP. KAIROS_EVAL_correctness.tab – This table contains judgments for the Correctness stage of assessment, in which unmatched but relevant and informative EPs and KEs are judged for whether they are justified in the source data. KAIROS_EVAL_ep_clustering.tab – This table contains coreference IDs for EPs that were marked as "extra-relevant" in EP Alignment and "yes" in Correctness assessment. KAIROS_EVAL_empty_ep_review.tab – This table contains judgments for the Empty EP Review stage of assessment, in which EPs with no provenance that were returned as part of matched system CEs are reviewed to see if they are relevant to the target CE. 4.0 System Output The system output included in this package is based on the system CEs that were manually assessed for CE Matching (see ./data/assessment/assessment_result/KAIROS_EVAL_ce_matching.tab). The full system output for each system CE that was manually assessed for CE Matching is included in this package, including all events, relations, and arguments in the system output. The system output can be found in this directory: ./data/assessment/system_output However, manual assessment of events, relations, and arguments for the Phase 1 evaluation was not exhaustive. First, the events, relations and arguments for system CEs that were not judged as matching a reference CE were not manually assessed. Second, the manual assessment of events, relations and arguments for system CEs that were judged as matching a reference CE was prioritized as discussed in section 2.0, so not all system EPs were manually assessed even within matching system CEs. This package does not include system output for system schema CEs that were not manually assessed. System output in this package is in two formats: (1) the SDF JSON format submitted by performers in the evaluation (.data/assessment/system_output/sdf) and (2) a tab-delimited format that renders the SDF JSON series human readable (./data/assessment/system_output/tab). LDC assigned a globally unique ID to all elements subject to manual assessment (system CEs, events, relations, arguments, etc.) as part of the evaluation assessment process, and also converted the SDF JSON format to the human-readable tab-delimited format. The tab-delimited format of the system output files served as input to the manual assessment process. Each system output table file corresponds to a manual assessment task. The content of each element that was assessed can be found in the system output table file, and the assessment judgments for each assessed element can be found in the corresponding assessment results table file. The fields, values, and content of the system output table files are described in the following file: ./docs/system_output_description.pdf The structure of the SDF JSON files can be found here: https://github.com/NextCenturyCorporation/kairos-pub/blob/master/data-format/kairos-v1.0.jsonld Performer IDs and names were anonymized in all system output files and in all assessment results files for this package. 5.0 Correspondence between Assessment Results, System Output Tables, and System SDF JSON The correspondence between the assessment results tables, the system output tables, and the original SDF JSON system output files depends on both the globally unique ID and the system-assigned unique name for each assessed element. The relationship between the assessment results tables and the system output tables can be found in ./docs/assessment_result_description.pdf. LDC assigned a globally unique ID to each element that was assessed. The element’s unique ID is used in both the assessment results tables and the tab-delimited system output files. The “relationship to system tab” column in ./docs/assessment_result_description.pdf shows which system output table contains the element. The relationship between the system output tables and the SDF JSON files can be found in ./docs/system_output_description.pdf. Each element in the tab-delimited system output files also has a unique name assigned by the system. The element’s name corresponds to a JSON key in the SDF JSON files. The mapping between the system name for the element and its JSON key is provided in the “relationship to SDF” column in ./docs/system_output_description.pdf. There is no direct mapping between the assessment result tables and the SDF JSON files. However, they may be linked by pivoting through the system output tables, as described above. 6.0 Known Issues Some EPs and/or KEs could not be manually assessed for correctness due to ill-formed provenance in the system output (e.g., missing character offsets). These show up as EMPTY_TBD in the Correctness Assessment table. Similarly, in the Empty EP Review table, EMPTY_TBD is used when step names in the system output were unintelligible and therefore could not be manually judged for relevance. Some of the non-ASCII characters were not displayed properly in the tab files but appear as expected in the JSON. For example, Bogotá shows up as Bogotà in some rows of the tab file, while the JSON contains the expected Bogotá.