LORELEI Entity Detection and Linking Knowledge Base
Item Name: | LORELEI Entity Detection and Linking Knowledge Base |
Author(s): | Stephanie Strassel, Jennifer Tracey, Ann Bies, Neil Kuster, Michael Ciul |
LDC Catalog No.: | LDC2020T10 |
ISBN: | 1-58563-926-5 |
ISLRN: | 571-976-494-378-2 |
DOI: | https://doi.org/10.35111/8rdp-tq10 |
Release Date: | May 15, 2020 |
Member Year(s): | 2020 |
DCMI Type(s): | Text |
Data Source(s): | web collection, government documents |
Project(s): | LORELEI |
Application(s): | knowledge base population, entity extraction, information extraction, machine translation, cross-language transfer |
Language(s): | English |
Language ID(s): | eng |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2020T10 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Strassel, Stephanie, et al. LORELEI Entity Detection and Linking Knowledge Base LDC2020T10. Web Download. Philadelphia: Linguistic Data Consortium, 2020. |
Related Works: | View |
Introduction
LORELEI Entity Detection and Linking Knowledge Base was developed by the Linguistic Data Consortium (LDC) and contains the full LORELEI Entity Detection and Linking (EDL) Knowledge Base (KB) used for all LORELEI Representative Language and Incident Language Pack entity linking annotation. The KB content was drawn from GeoNames, the CIA World Leaders List and the CIA World Factbook and was supplemented with manually-created KB entries developed specifically for LORELEI data.
The LORELEI (Low Resource Languages for Emergent Incidents) Program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.
Data
This corpus is comprised of an English knowledge base to support the EDL task in LORELEI for four entity types: geo-political entities (GPE), locations, including facilities (LOC), persons (PER) and organizations (ORG). There are four inputs to the KB, each designated by a unique "origin" code in the KB, as follows: GPE and LOC entities from a 2015 snapshot of GeoNames, PER entities from the CIA World Leaders List dated May 2015, ORG entities from Appendix B of the CIA World Factbook downloaded in 2015, and additional entities manually created by LDC for each of the representative and incident languages.
The KB contains a total of 10,216,832 entities and consists of three tab-delimited files, which are linked via the entityid in each entry. More information is contained in the included documentation.
Acknowledgement
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA.
Samples
Please view the following samples:
Updates
None at this time.