README FILE FOR LDC CATALOG ID: LDC2019TXX TITLE: LORELEI EDL Knowledge Base AUTHORS: Jennifer Tracey, Stephanie Strassel, Jonathan Wright, Ann Bies, Neil Kuster, Jeremy Getman, Mike Ciul 1. Introduction This corpus provides the full LORELEI EDL Knowledge Base (KB) used for all LORELEI Representative Language and Incident Language Pack entity linking annotation. The LORELEI (Low Resource Languages for Emergent Incidents) Program is concerned with building Human Language Technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs for over 2 dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative Languages (RLs) are selected to provide broad typological coverage, while Incident Languages (ILs) are selected to evaluate system performance on a language whose identity is disclosed at the start of the evaluation, and for which no training data has been provided. The evaluation protocol is based on a scenario in which some unforeseen event (the "incident") triggers a need for humanitarian and logistical support in a region where the predominant language (the "incident language") is one that has received little or no attention as yet in NLP research. The objective for evaluation participants is to provide NLP solutions, including information extraction and machine translation, based only on limited resources and with very little time for development. For more information about LORELEI language resources, see https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/lrec2016-lorelei-language-packs.pdf. The KB content is drawn from existing resources (GeoNames, CIA World Leaders List, CIA Appendix B), and has been supplemented with several sets of manually created KB entries (referred to as Augmented KB entries) created specifically for the LORELEI data. These include additional entities not present in the base sources, selected to include persons and organizations relevant to the LORELEI incident and domain focus on natural disasters and humanitarian assistance, and more specifically relevant to the regions and incidents covered in the LORELEI language packs. 2. Knowledge Base Contents LDC has developed an English knowledge base to support the entity detection and linking (EDL) task in LORELEI for four entity types: geo-political entities (GPE), locations, including facilities (LOC), persons (PER) and organizations (ORG). There are four inputs to the KB, each designated by a unique "origin" code in the KB, as follows: 1. The GPE/LOC Non-Augmented KB input (GEO) contains GPE and LOC entities from a 2015 snapshot of GeoNames (http://www.geonames.org/). 2. The PER Non-Augmented KB input (WLL) contains PER entities from the CIA World Leaders List dated May 2015 (https://www.cia.gov/library/publications/world-leaders-1/). 3. The ORG Non-Augmented KB input (APB) contains ORG entities from Appendix B of the CIA World Factbook downloaded in 2015 (https://www.cia.gov/library/publications/resources/the-worldfactbook/appendix/appendix-b.html). 4. The Manually Augmented KB inputs consist of additional incident-, region- and/or domain-relevant PER and ORG entities that do not appear in the non-augmented KBs. These additional entities were manually created by LDC for each of the RLs and ILs, and added to the KB in sets over the course of the LORELEI program. The origin codes indicating these manually augemnted sets are as follows: AUG AUG_RL_2018 AUG_RL_2018_2 AUG_RL_2018_3 AUG_RL_2018_4 AUG_RL_2018_5 AUG_IL4_2018 AUG_SHARED_2018 AUG_IL9_2018 AUG_IL10_2018 AUG_IL11_2019 AUG_IL12_2019 These inputs are merged into a single master KB which can be used with all LORELEI RL and IL EDL data. Manually Augmented KB inputs were created in stages, so different languages had slightly different sets of entities available for linking. Below is the breakdown for each language of the KB origin values used, including the Manually Augmented KB inputs: year lang lang type origins 1 Amharic RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2, AUG_RL_2018_3, AUG_RL_2018_4 1 Arabic RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2, AUG_RL_2018_3 1 Chinese RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2, AUG_RL_2018_3 1 Farsi RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2, AUG_RL_2018_3, AUG_RL_2018_4 1 Hungarian RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2, AUG_RL_2018_3, AUG_RL_2018_4 1 Russian RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2, AUG_RL_2018_3, AUG_RL_2018_4, AUG_RL_2018_5 1 Spanish RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2, AUG_RL_2018_3 1 Vietnamese RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2, AUG_RL_2018_3, AUG_RL_2018_4 2 Akan RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2 2 Bengali RL GEO, APB, WLL, AUG_RL_2018 2 English RL GEO, APB, WLL, AUG_RL_2018 2 Hindi RL GEO, APB, WLL, AUG_RL_2018 2 Indonesian RL GEO, APB, WLL, AUG_RL_2018 2 Swahili RL GEO, APB, WLL, AUG_RL_2018 2 Tagalog RL GEO, APB, WLL, AUG_RL_2018 2 Tamil RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2, AUG_RL_2018_3 2 Thai RL GEO, APB, WLL, AUG_RL_2018 2 Wolof RL GEO, APB, WLL, AUG_RL_2018, AUG_RL_2018_2 2 Zulu RL GEO, APB, WLL, AUG_RL_2018 2 Tigrinya IL GEO, APB, WLL, AUG 2 Oromo IL GEO, APB, WLL, AUG 3 Kinyarwanda IL GEO, APB, WLL, AUG_IL9_2018, AUG_SHARED_2018 3 Sinhala IL GEO, APB, WLL, AUG_IL10_2018, AUG_SHARED_2018 4 Odia IL GEO, APB, WLL, AUG_IL11_2019 4 Ilocano IL GEO, APB, WLL, AUG_IL12_2019 4 Ukrainian RL GEO, APB, WLL, AUG_IL4_2018 This KB includes a total of 10216832 entities (with a total of 12994440 associated alternate names, and a total of 10186 associated member states). The table below gives a count of entities in the knowledge base by type assignment: 6088354 LOC 4121564 GPE 5945 PER 969 ORG The table below gives a count of entities in the knowledge base by input origin: 10209918 GEO 5287 WLL 242 APB 197 AUG 180 AUG_IL11_2019 179 AUG_IL12_2019 175 AUG_IL10_2018 175 AUG_IL9_2018 152 AUG_RL_2018 89 AUG_IL4_2018 80 AUG_RL_2018_3 77 AUG_RL_2018_4 37 AUG_RL_2018_2 26 AUG_SHARED_2018 18 AUG_RL_2018_5 2.1 Knowledge Base Specification The specification for the KB is in docs/LORELEI_EDL_KB_description_v2.5.xlsx The specification provides detailed information about the contents of each field and table in the KB. 3. File Format The KB consists of 3 tab-delimited files in the /data directory, which are linked via the entityid in each entry. entities.tab -- the entity entries in the KB (including their entityid) alternate_names.tab -- alternate names associated (by entityid) with entities in the KB member_states.tab -- member states associated (by entityid) with ORG entities in the KB (only for ORGs that have countries as members of the organization 4. Directory Structure ./docs/README.txt this file ./data/ contains the 3 KB tab files described above ./docs/ contains LORELEI_EDL_KB_description_v2.6.xlsx -- the KB specification 5. KNOWN ISSUES 5.1 Un-normalized character data in alternate_names.tab Users of the "alternate_names.tab" content should be aware that some strings in this table contain "presentation form" Arabic characters and the "tatweel" character; in general, presentation form characters should be converted to their correlate characters in the main Unicode Arabic table (U+0600 - U+06FF), and the tatweel would be deleted or ignored. 5.2 Duplication of AUG entities with WLL/APB entities In one batch of augmentation, entities were included which were already present in the WLL or APB portions of the KB. Because the Geonames input KB already contains a number of duplicate or near-duplicate entities, it's necessary to be able to accommodate "ambiguous" references -- i.e. strings that can match two or more entity entries in the KB. The same approach would be used to handle the WLL and APB duplicates in the EDL annotation. 6. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA. 7. Copyright Information Portions (c) 2020 Trustees of the University of Pennsylvania 8. Contacts If you have questions about this data release, please contact the following personnel at LDC. Stephanie Strassel - LORELEI PI Jennifer Tracey - LORELEI Project Manager Jonathan Wright - LORELEI Technical Lead ---------------------- README created by Mike Ciul, September 18, 2019 updated by Jennifer Tracey, September 19, 2019