Corpus Title: AIDA Scenario 1 and 2 Reference Knowledge Base LDC-Catalog-ID: LDC2023T10 Authors: Jennifer Tracey, Stephanie Strassel, Jeremy Getman, Ann Bies, Kira Griffitt, David Graff, Chris Caruso 1. Overview This corpus was developed by the Linguistic Data Consortium for the DARPA AIDA Program and contains the full AIDA Knowledge Base (KB) used for all AIDA entity linking annotation for Scenario 1 (Russia-Ukraine Relations) and Scenario 2 (Crisis in Venezuela). The KB content was drawn from GeoNames, the CIA World Leaders List and the CIA World Factbook and was supplemented with manually-created KB entries developed specifically for AIDA data. The bulk of the entries in the KB in this release (those from GEO, WLL, and APB) are also present in the LORELEI KB (Catalog ID: LDC2020T10). Beyond those, the AIDA KB also contains additional entities that have been added specifically for their relevance to the AIDA Phase 1 and Phase 2 scenarios. The AIDA KB does not contain the LORELEI-specific entries that are present in the LORELEI KB. The AIDA (Active Interpretations of Disparate Alternatives) Program is designed to support development of technology that can assist in cultivating and maintaining understanding of events when there are conflicting accounts of what happened (e.g. who did what to whom and/or where and when events occurred). AIDA systems must extract entities, events, and relations from individual multimedia documents, aggregate that information across documents and languages, and produce multiple knowledge graph hypotheses that characterize the conflicting accounts that are present in the corpus (see https://www.darpa.mil/program/active-interpretation-of-disparate-alternatives for more information about the program). Each phase of the AIDA program focused on a different scenario, or broad topic area. The scenario for Phase 1 was political relations between Russia and Ukraine in the 2010s. The scenario for Phase 2 was the socioeconomic and political Crisis in Venezuela since 2010. In addition, each scenario had a set of specific subtopics within the scenario that were designated as either "practice topics" (released as for use in system development) or "evaluation topics" (reserved for use in the AIDA program evaluations for each phase). To support creation of corpus-wide hypotheses, cross-document coreference was a requirement. Procedurally, coreference was achieved by linking individual entity instances to the entities in a knowledge base. The knowledge base in this release acted as the AIDA program-wide reference entity knowledge base for scenarios 1 and 2, and was constructed by LDC to include salient entities identified during topic development as well as a large number of general domain entities drawn from the LORELEI reference Knowledge Base. 2. Knowledge Base Contents LDC developed an English knowledge base to enable entity linking in AIDA for 13 entity types: GPE (Geo-Political Entity), LOC (Location), PER (Person), ORG (Organization), FAC (Facility), MHI (Medical/Health Issue), WEA (Weapon), SID (Side), COM (Commodity), CRM (Crime), LAW (Law), VEH (Vehicle), and BAL (Ballot). There are four inputs to the KB, as follows: A. The GPE/LOC Non-Augmented KB input (GEO) contains GPE and LOC entities from GeoNames (http://www.geonames.org/). B. The PER Non-Augmented KB input (WLL) contains PER entities from the CIA World Leaders List (https://www.cia.gov/resources/world-leaders/). C. The ORG Non-Augmented KB input (APB) contains ORG entities from Appendix B of the CIA World Factbook (https://www.cia.gov/the-world-factbook/). D. The Manually Augmented KB inputs consist of additional topic-, region- and/or scenario-relevant entities that do not appear in the non-augmented KBs. These additional entities were manually created by LDC during AIDA Phase 1 and Phase 2. The additional entities created during Phase 1 were used for the Phase 1 data. The additional entities created during Phase 2 were added to those from Phase 1, and the full set was used for the Phase 2 data. The origin codes indicating these manually augmented sets are as follows: AIDA_AUG_PHASE1, AIDA_AUG_PHASE2 This KB includes a total of 10215753 entities (with a total of 12993790 associated alternate names, and a total of 10186 associated member states). The table below gives a count of entities in the current knowledge base by type assignment: 6088355 LOC 4121568 GPE 5454 PER 311 ORG 43 FAC 39 MHI 38 WEA 24 SID 14 COM 5 CRM 4 LAW 3 VEH 2 BAL The table below gives a count of entities in the knowledge base by input origin: 10209918 GEO 5287 WLL 242 APB 184 AIDA_AUG_PHASE1 229 AIDA_AUG_PHASE2 2.1 Knowledge Base Specification The specification for the KB is in docs/LORELEI_EDL_KB_description_v2.5.xlsx 3. File Format The KB consists of 3 tab-delimited files in the /data directory, which are linked via the entityid in each entry. entities.tab -- the entity entries in the KB (including their entityid) alternate_names.tab -- alternate names associated (by entityid) with entities in the KB member_states.tab -- member states associated (by entityid) with ORG entities from APB that have countries as members of the organization 4. Directory Structure ./README.txt this file ./data/ contains 3 KB tab files ./docs/ contains LORELEI_EDL_KB_description_v2.5.xlsx -- the KB specification 5. KNOWN ISSUES 5.1 Extra columns in member_states.tab The KB specification provided here (LORELEI_EDL_KB_description_v2.5.xlsx) describes the "member_stats.tab" file as having three columns, but the "data/member_states.tab" file in this release has five columns; one additional column is labeled "note", and is a non-mandatory field (it is empty in most rows of the table). The other is labeled "member_state_entity_id" and is empty for all rows of the table in this release. 5.2 Duplicate entries in entities.tab Among the set of augmented AIDA entries (entries with origin AIDA_AUG_PHASE2), there are a small number of known duplicate entries in entities.tab, where "duplicate" means multiple KB IDs associated with the same real-world entity. The known duplicates are (separated by '|'): 20000161|80000527 80000429|80000591|30005166 There are also an unknown number of ambiguous and/or duplicate entries across the set of entities in Geonames (entries with origin GEO). This is a known issue with Geonames, and not specific to this release. 6. Acknowledgements This material is based upon work supported by Air Force Research Laboratory (AFRL) and the Defense Advanced Research Projects Agency (DARPA) under Contract No. FA8750-18-C-0013. 7. Copyright Information (c) 2023 Trustees of the University of Pennsylvania 8. Contacts Stephanie Strassel - AIDA PI