TAC KBP Reference Knowledge Base
			      LDC2009E58

			     April 5, 2013
		      Linguistic Data Consortium

1. Overview

Text Analysis Conference (TAC) is a series of workshops organized by
the National Institute of Standards and Technology (NIST).  TAC was
developed to encourage research in natural language processing (NLP)
and related applications by providing a large test collection, common
evaluation procedures, and a forum for researchers to share their
results.  Through its various evaluations, the Knowledge Base
Population (KBP) track of TAC encourages the development of systems
that can match entities mentioned in natural texts with those
appearing in a knowledge base and extract novel information about
entities from a document collection and add it to a new or existing
knowledge base.

This package was originally released in 2009 as TAC 2009 KBP
Evaluation Reference Knowledge Base (LDC2009E58).  At the time of this
re-release for LDC's general catalog, the Knowledge Base (KB) had been
used in the development of multiple training and evaluation data sets
produced for TAC KBP between 2009 and 2012. Additionally, this same KB
is slated for use in a number of corpora that will be produced for TAC
KBP 2013, specifically those pertaining to the Entity Linking, Slot
Filling (English and Spanish), Temporal Slot Filling, and Sentiment
Slot Filling tasks.

The Knowledge Base contains a set of entities, each with a canonical
name and title for the Wikipedia page, an entity type, an
automatically parsed version of the data from the infobox in the
entity's Wikipedia article, and a stripped version of the text of the
Wiki article.  The Wikipedia infoboxes and entries are taken from an
October 2008 snapshot of Wikipedia. 

2. Knowledge Base Contents

Although all Wikipedia articles with infoboxes in the snapshot were
candidates for the inclusion in the knowledge base, some articles were
discarded during processing, most commonly due to errors parsing the
wiki markup.  In addition, some types of infoboxes were discarded,
specifically ones which did not contain named values.  For example,
the infobox in the article for the element Carbon is {{Infobox
carbon}}, which doesn't contain parsable key/value pairs.

Some KB fields were discarded during processing, most commonly ones
related to images (e.g., images of flags in GPE infoboxes, picture
captions) and HTML formatting.  Note that while significant effort was
made to properly parse and format the data in the knowledge base,
there may be instances in which fields were improperly rendered.  In
the case that a given Wikipedia article contained more than a single
infobox, only the first infobox found was included in the knowledge
base.

Each entity in the knowledge base is assigned one of four types:

* PER - person
* ORG - organization
* GPE - geo-political entity
* UKN - unknown

By default an entity is of type UKN.  As part of the process of
generating the knowledge base, LDC assigned types to entities based on
the type of infobox occurring in the article.  This mapping was made
by determining the type most likely associated with a given infobox
(e.g., Infobox_Actor is a person).  Although care was taken to provide
a good mapping, it is possible that some entities may have type
assignments that are incorrect.

The table below gives a count of entities in the knowledge base by type
assignment:

 Entity Type      # of Entities
 ------------------------------
 GPE                     116498
 ORG                      55813
 PER                     114523
 UKN                     531907
 ------------------------------
 Total                   818741

3. File Format
 
The format is defined by knowledge_base.dtd located in the dtd
directory at the top level of the package.  The dtd file contains
comments related to the purpose and intent of the markup.

4. Directory Structure

  ./README.txt
      this file
  
  ./data/
      contains 88 KB xml files
  
  ./docs/   
      files.md5 - contains md5 sums of KB xml files
  
  ./dtd/
      contains DTD for KB xml

5. Data Validation

- All xml files have been validated against the DTD using xmllint.

  xmllint --noout --dtdvalid ../dtd/knowledge_base.dtd file.xml

- md5 sums have been generated for all xml files.  On Unix-like
  systems, the following command can be used to verify the integrity
  of the xml files.

  md5sum -c docs/files.md5

- Confirmed that entity IDs referenced in the links exist in the
  provided knowledge base.

- Independent sanity checks have been performed on the completed
  package by members of the LDC technical staff.

6. Copyright Information

Portions © 2008-2009, 2014 Trustees of the University of Pennsylvania

7. Contact Information

For further information about this data release, or the TAC KBP
project, contact the following project staff at LDC:

	Joe Ellis, Project Manager     	     <joellis@ldc.upenn.edu>
	Stephanie Strassel, Consultant	     <strassel@ldc.upenn.edu>

-----------------------------------------------------------------------------
README created by Joe Ellis on April 5, 2013