TAC KBP Reference Knowledge Base LDC2009E58 April 5, 2013 Linguistic Data Consortium 1. Overview Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. This package was originally released in 2009 as TAC 2009 KBP Evaluation Reference Knowledge Base (LDC2009E58). At the time of this re-release for LDC's general catalog, the Knowledge Base (KB) had been used in the development of multiple training and evaluation data sets produced for TAC KBP between 2009 and 2012. Additionally, this same KB is slated for use in a number of corpora that will be produced for TAC KBP 2013, specifically those pertaining to the Entity Linking, Slot Filling (English and Spanish), Temporal Slot Filling, and Sentiment Slot Filling tasks. The Knowledge Base contains a set of entities, each with a canonical name and title for the Wikipedia page, an entity type, an automatically parsed version of the data from the infobox in the entity's Wikipedia article, and a stripped version of the text of the Wiki article. The Wikipedia infoboxes and entries are taken from an October 2008 snapshot of Wikipedia. 2. Knowledge Base Contents Although all Wikipedia articles with infoboxes in the snapshot were candidates for the inclusion in the knowledge base, some articles were discarded during processing, most commonly due to errors parsing the wiki markup. In addition, some types of infoboxes were discarded, specifically ones which did not contain named values. For example, the infobox in the article for the element Carbon is {{Infobox carbon}}, which doesn't contain parsable key/value pairs. Some KB fields were discarded during processing, most commonly ones related to images (e.g., images of flags in GPE infoboxes, picture captions) and HTML formatting. Note that while significant effort was made to properly parse and format the data in the knowledge base, there may be instances in which fields were improperly rendered. In the case that a given Wikipedia article contained more than a single infobox, only the first infobox found was included in the knowledge base. Each entity in the knowledge base is assigned one of four types: * PER - person * ORG - organization * GPE - geo-political entity * UKN - unknown By default an entity is of type UKN. As part of the process of generating the knowledge base, LDC assigned types to entities based on the type of infobox occurring in the article. This mapping was made by determining the type most likely associated with a given infobox (e.g., Infobox_Actor is a person). Although care was taken to provide a good mapping, it is possible that some entities may have type assignments that are incorrect. The table below gives a count of entities in the knowledge base by type assignment: Entity Type # of Entities ------------------------------ GPE 116498 ORG 55813 PER 114523 UKN 531907 ------------------------------ Total 818741 3. File Format The format is defined by knowledge_base.dtd located in the dtd directory at the top level of the package. The dtd file contains comments related to the purpose and intent of the markup. 4. Directory Structure ./README.txt this file ./data/ contains 88 KB xml files ./docs/ files.md5 - contains md5 sums of KB xml files ./dtd/ contains DTD for KB xml 5. Data Validation - All xml files have been validated against the DTD using xmllint. xmllint --noout --dtdvalid ../dtd/knowledge_base.dtd file.xml - md5 sums have been generated for all xml files. On Unix-like systems, the following command can be used to verify the integrity of the xml files. md5sum -c docs/files.md5 - Confirmed that entity IDs referenced in the links exist in the provided knowledge base. - Independent sanity checks have been performed on the completed package by members of the LDC technical staff. 6. Copyright Information Portions © 2008-2009, 2014 Trustees of the University of Pennsylvania 7. Contact Information For further information about this data release, or the TAC KBP project, contact the following project staff at LDC: Joe Ellis, Project Manager Stephanie Strassel, Consultant ----------------------------------------------------------------------------- README created by Joe Ellis on April 5, 2013