Assessment Data Included in this Release

1.0 Assessment Overview

Manual assessment of KAIROS system output was a key component of
system evaluation, complementing automatic evaluation metrics based on
comparison of the reference CE events against the ordered set of
system events and relations. In the assessment task for the end-to-end
evaluation task, assessors reviewed populated system CEs to identify
matches between system and reference events/relations and determine
whether extra events/relations and arguments produced by the system
are plausible (even if not attested) for the CE.


2.0 Assessment Approach

There were two types of systems in KAIROS. TA1 systems built the schema
libraries while TA2 systems instantiated the schema by extracting the
entities, relations, and events from the input source data. In Phase 1, 
NIST designated particular system combinations for assessment, and LDC 
assessed the first 2 instantiated schemas per CE for each designated 
system combination. Within each system CE, events were prioritized for 
presentation to assessors based on system confidence ratings. Manual 
assessment focused on assessing the accuracy of instantiated system 
CEs, determined by comparison to human-produced reference annotation 
for the CE. Assessment results indicate how well-formed a 
system-produced CE is, and to what extent a system-produced CE matches 
the reference annotation CE. In comparing the system output to the 
human-produced reference annotation, assessors were instructed to be 
lenient but principled, and to give the system credit wherever 
possible as long as doing so did not contradict assessment guidelines.

Assessors only assessed system-produced CEs that were deemed
sufficiently human-readable to be identifiable by
assessors. Sufficiently human-readable system CEs were those that
contained sufficient participant information to identify events or
relations included in the system CE and sufficient provenance to
justify participants' inclusion in the events or relations as
arguments. Empty items (those that contained no provenance and no
populated arguments) and vacuous items (those that contained
provenance but no populated arguments) were not assessed.

Assessment during the evaluation consisted of four subtasks: Complex
Event (CE) Matching, Event/Relation (EP) Alignment, Knowledge Element
(KE) Analysis, and Correctness. First, CE matching determined whether
the instantiated system schema and the reference CE were representing
the same complex event. If so, the system output proceeded to
subsequent stages, starting with EP (event primitive) alignment, in
which assessors determined, for each system event or relation, whether
there was a matching event or relation in the reference
annotation. For matching events/relations, there was then a KE analysis
step, asking whether the system KE (knowledge element, or argument)
had a matching argument in the reference. Finally, the correctness
step examined any “extra” events, relations and arguments produced by
the system that were deemed relevant to the CE but were not present in
the reference annotation; for each item, assessors judged whether the
item was actually attested in the input data set.

Two additional assessment tasks were completed as a post hoc add-on
following the evaluation: EP Clustering and Empty EP Review.

The EP Clustering task clustered EPs as coreferent when they were
first assessed as being "extra-relevant" during EP Alignment and then
also judged as correct during Correctness assessment. This task
created equivalence classes (consisting of one or more system
events/relations) for the relevant and correct EPs that were returned
by the systems but that were not present in the reference annotation.

In the Empty EP Review task, EPs for which the system returned no
provenance, but which were part of a system CE that was judged as
matching during CE Matching, were reviewed to see if they are relevant
to the target CE.

Detailed assessment guidelines are included in
./docs/KAIROSEvalAnnotationAssessmentV2.4.pdf


3.0 Assessment Results

The ./data/assessment/assessment_result directory includes the manual
assessment results divided into six separate tables, one for each
assessment task. Each table includes all assessment results for the
task, across all assessed system runs and CEs.

The fields and content of the assessment results tables are described
in the following file: ./docs/assessment_result_description.pdf

KAIROS_EVAL_ce_matching.tab – This table contains judgements for the
CE Matching stage of assessment, in which system CEs are compared to
human reference CEs to determine whether the system CE refers to the
same Complex Event as the human reference CE.

KAIROS_EVAL_ep_alignment.tab – This table contains judgments for the
EP Alignment stage of assessment, in which system EPs are compared to
EPs in the human reference CE to determine whether the system EP
refers to the same event or relation as the human reference CE.

KAIROS_EVAL_ke_analysis.tab – This table contains judgments for the KE
Analysis stage of assessment, in which the individual KEs (arguments)
that make up a system EP are compared to KEs in the matched human
reference EP.

KAIROS_EVAL_correctness.tab – This table contains judgments for the
Correctness stage of assessment, in which unmatched but relevant and
informative EPs and KEs are judged for whether they are justified in
the source data.

KAIROS_EVAL_ep_clustering.tab – This table contains coreference IDs
for EPs that were marked as "extra-relevant" in EP Alignment and "yes"
in Correctness assessment.

KAIROS_EVAL_empty_ep_review.tab – This table contains judgments for
the Empty EP Review stage of assessment, in which EPs with no
provenance that were returned as part of matched system CEs are
reviewed to see if they are relevant to the target CE.


4.0 System Output

The system output included in this package is based on the system CEs
that were manually assessed for CE Matching (see
./data/assessment/assessment_result/KAIROS_EVAL_ce_matching.tab). The
full system output for each system CE that was manually assessed for
CE Matching is included in this package, including all events,
relations, and arguments in the system output. The system output can
be found in this directory: ./data/assessment/system_output

However, manual assessment of events, relations, and arguments for the
Phase 1 evaluation was not exhaustive. First, the events, relations
and arguments for system CEs that were not judged as matching a
reference CE were not manually assessed. Second, the manual assessment
of events, relations and arguments for system CEs that were judged as
matching a reference CE was prioritized as discussed in section 2.0,
so not all system EPs were manually assessed even within matching
system CEs.

This package does not include system output for system schema CEs that
were not manually assessed.

System output in this package is in two formats: (1) the SDF JSON
format submitted by performers in the evaluation
(.data/assessment/system_output/sdf) and (2) a tab-delimited format
that renders the SDF JSON series human readable
(./data/assessment/system_output/tab).

LDC assigned a globally unique ID to all elements subject to manual
assessment (system CEs, events, relations, arguments, etc.) as part of
the evaluation assessment process, and also converted the SDF JSON
format to the human-readable tab-delimited format.

The tab-delimited format of the system output files served as input to
the manual assessment process. Each system output table file
corresponds to a manual assessment task. The content of each element
that was assessed can be found in the system output table file, and
the assessment judgments for each assessed element can be found in the
corresponding assessment results table file.

The fields, values, and content of the system output table files are
described in the following file: ./docs/system_output_description.pdf

The structure of the SDF JSON files can be found here:
https://github.com/NextCenturyCorporation/kairos-pub/blob/master/data-format/kairos-v1.0.jsonld

Performer IDs and names were anonymized in all system output files and
in all assessment results files for this package.


5.0 Correspondence between Assessment Results, System Output Tables, and System SDF JSON

The correspondence between the assessment results tables, the system
output tables, and the original SDF JSON system output files depends
on both the globally unique ID and the system-assigned unique name for
each assessed element.

The relationship between the assessment results tables and the system
output tables can be found in
./docs/assessment_result_description.pdf. LDC assigned a globally
unique ID to each element that was assessed. The element’s unique ID
is used in both the assessment results tables and the tab-delimited
system output files. The “relationship to system tab” column in
./docs/assessment_result_description.pdf shows which system output
table contains the element.

The relationship between the system output tables and the SDF JSON
files can be found in ./docs/system_output_description.pdf. Each
element in the tab-delimited system output files also has a unique
name assigned by the system. The element’s name corresponds to a JSON
key in the SDF JSON files. The mapping between the system name for the
element and its JSON key is provided in the “relationship to SDF”
column in ./docs/system_output_description.pdf.

There is no direct mapping between the assessment result tables and
the SDF JSON files. However, they may be linked by pivoting through
the system output tables, as described above.


6.0 Known Issues

Some EPs and/or KEs could not be manually assessed for correctness due
to ill-formed provenance in the system output (e.g., missing character
offsets). These show up as EMPTY_TBD in the Correctness Assessment
table. Similarly, in the Empty EP Review table, EMPTY_TBD is used when 
step names in the system output were unintelligible and therefore could 
not be manually judged for relevance.

Some of the non-ASCII characters were not displayed properly in the tab
files but appear as expected in the JSON. For example, Bogotá shows up
as BogotÃ in some rows of the tab file, while the JSON contains the
expected Bogotá.