LORELEI Entity Detection and Linking Knowledge Base

Item Name: LORELEI Entity Detection and Linking Knowledge Base
Author(s): Stephanie Strassel, Jennifer Tracey, Ann Bies, Neil Kuster, Michael Ciul
LDC Catalog No.: LDC2020T10
ISBN: 1-58563-926-5
ISLRN: 571-976-494-378-2
DOI: https://doi.org/10.35111/8rdp-tq10
Release Date: May 15, 2020
Member Year(s): 2020
DCMI Type(s): Text
Data Source(s): web collection, government documents
Project(s): LORELEI
Application(s): knowledge base population, entity extraction, information extraction, machine translation, cross-language transfer
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2020T10 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Strassel, Stephanie, et al. LORELEI Entity Detection and Linking Knowledge Base LDC2020T10. Web Download. Philadelphia: Linguistic Data Consortium, 2020.
Related Works: View


LORELEI Entity Detection and Linking Knowledge Base was developed by the Linguistic Data Consortium (LDC) and contains the full LORELEI Entity Detection and Linking (EDL) Knowledge Base (KB) used for all LORELEI Representative Language and Incident Language Pack entity linking annotation. The KB content was drawn from GeoNames, the CIA World Leaders List and the CIA World Factbook and was supplemented with manually-created KB entries developed specifically for LORELEI data.

The LORELEI (Low Resource Languages for Emergent Incidents) Program was concerned with building human language technology for low resource languages in the context of emergent situations like natural disasters or disease outbreaks. Linguistic resources for LORELEI include Representative Language Packs and Incident Language Packs for over two dozen low resource languages, comprising data, annotations, basic natural language processing tools, lexicons and grammatical resources. Representative languages were selected to provide broad typological coverage, while incident languages were selected to evaluate system performance on a language whose identity was disclosed at the start of the evaluation.


This corpus is comprised of an English knowledge base to support the EDL task in LORELEI for four entity types: geo-political entities (GPE), locations, including facilities (LOC), persons (PER) and organizations (ORG). There are four inputs to the KB, each designated by a unique "origin" code in the KB, as follows: GPE and LOC entities from a 2015 snapshot of GeoNames, PER entities from the CIA World Leaders List dated May 2015, ORG entities from Appendix B of the CIA World Factbook downloaded in 2015, and additional entities manually created by LDC for each of the representative and incident languages.

The KB contains a total of 10,216,832 entities and consists of three tab-delimited files, which are linked via the entityid in each entry. More information is contained in the included documentation.


This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-15-C-0123. Any opinions, findings and conclusions or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the views of DARPA.


Please view the following samples:


None at this time.

Available Media

View Fees

Login for the applicable fee