TAC KBP Spanish Cross-lingual Entity Linking
Comprehensive Training and Evaluation Data 2012-2014

Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel

1. Overview

This package contains training and evaluation data produced in support of
the TAC KBP Spanish Cross-lingual Entity Linking tasks in 2012, 2013 and
2014.

Text Analysis Conference (TAC) is a series of workshops organized by the
National Institute of Standards and Technology (NIST). TAC was developed to
encourage research in natural language processing (NLP) and related
applications by providing a large test collection, common evaluation
procedures, and a forum for researchers to share their results. Through its
various evaluations, the Knowledge Base Population (KBP) track of TAC
encourages the development of systems that can match entities mentioned in
natural texts with those appearing in a knowledge base and extract novel
information about entities from a document collection and add it to a new
or existing knowledge base.

Spanish cross-lingual Entity Linking was first conducted as part of the
2012 TAC KBP evaluations. The track was an extension of the monolingual
English Entity Linking track (EL) whose goal is to measure systems' ability
to determine whether an entity, specified by a query, has a matching node
in a reference knowledge base and, if so, to create a link between the
two. If there is no matching node for a query entity in the KB, EL systems
are required to cluster the mention together with others referencing the
same entity. More information about the TAC KBP Entity Linking task and
other TAC KBP evaluations can be found on the NIST TAC website,
http://www.nist.gov/tac/.

This package contains all evaluation and training data developed in support
of TAC KBP Spanish Cross-lingual Entity Linking during the three years in
which the task was conducted, from 2012-2014. This includes queries and
gold standard entity type information, KB links, and equivalence class
clusters for NIL entities (those for which there was no matching node in
the knowledge base). Source documents for the queries are also in this
corpus. The corresponding knowledge base is available as LDC2014T16 TAC KBP
Reference Knowledge Base.

The data included in this package were originally released to TAC KBP as:

LDC2012E67: TAC 2012 KBP Spanish Entity Linking Training Queries and Annotations V1.1
LDC2012E101: TAC 2012 KBP Spanish Entity Linking Evaluation Annotations V1.1
LDC2013E97: TAC 2013 KBP Spanish Entity Linking Evaluation Queries and Knowledge Base Links V1.1
LDC2014E46: TAC 2014 KBP Spanish Entity Linking Discussion Forum Training Data V1.1
LDC2014E84: TAC 2014 KBP Spanish Entity Linking Evaluation Queries and Knowledge Base Links
LDC2015E18: TAC KBP Spanish Entity Linking - Comprehensive Training and Evaluation Data 2012 - 2014

Summary of data included in this package (for more details see
/docs/tac_kbp_2012-2014_spanish_entity_linking_query_distribution_table.tsv):
+------+------------------+---------+
| Year | Source Documents | Queries |
+------+------------------+---------+
| 2012 | 3772 | 3890 |
| 2013 | 1832 | 2117 |
| 2014 | 2207 | 2596 |
+------+------------------+---------+

2. Contents

./README.txt

This file

./data/{2012,2013,2014}/contents.txt

The data in this package are organized by the evaluation year in order to
clarify dependencies, highlight occasional differences in formats from one
year to another, and to increase readability in documentation. The
contents.txt file within each year's root directory provides a list of the
contents for all subdirectories as well as details about file formats and
contents.

./docs/all_files.md5

Paths (relative to the root of the corpus) and md5 checksums for all files
included in the package.

./docs/tac_kbp_2012-2014_spanish_entity_linking_query_distribution_table.tsv

Tab-delimited table containing the query distribution quantities for
all years and datasets, further broken down by language, source type,
KB-Link, and entity type.

./docs/guidelines/2012/TAC_KBP_2012_Entity_Selection_V1.1.pdf

The guidelines used by annotators in developing the 2012 Entity Linking queries
and gold standard data contained in this corpus.

./docs/guidelines/2013/TAC_KBP_2013_EL_Query_Development_Guidelines_V1.0.pdf

The guidelines used by annotators in developing the 2013 Entity Linking queries
and gold standard data contained in this corpus.

./docs/guidelines/2014/TAC_KBP_2014_EL_Query_Development_Guidelines_V1.0.pdf

The guidelines used by annotators in developing the 2014 Entity Linking queries
and gold standard data contained in this corpus.

./docs/task_descriptions/KBP2012_TaskDefinition_1.1.pdf

Task Description for all of the 2012 TAC KBP tracks, written by evaluation
track coordinators.
Note that this document also describes tasks not relevant to this
specific package.

./docs/task_descriptions/KBP2013_EntityLinkingTaskDescription_1.0.pdf

Task Description for the 2013 Entity Linking evaluation tracks,
written by evaluation track coordinators.

./docs/task_descriptions/KBP2014EL_V1.1.pdf

Task Description for the 2014 Entity Linking evaluation tracks,
written by track coordinators.

./dtd/clel_queries_2012-2014.dtd

DTD for
./data/2012/eval/tac_kbp_2012_spanish_entity_linking_evaluation_queries.xml
./data/2012/training/tac_kbp_2012_spanish_entity_linking_training_queries.xml
./data/2013/eval/tac_kbp_2013_spanish_entity_linking_evaluation_queries.xml
./data/2014/eval/tac_2014_kbp_spanish_entity_linking_evaluation_queries.xml
./data/2014/training/tac_kbp_2014_spanish_entity_linking_training_queries.xml

./tools/check_kbp2012_spanish-entity-linking.pl

Validator for 2012 entity linking submission files, as provided to LDC by
evaluation track coordinators, with no further testing.

./tools/check_kbp2013_2014_spanish-entity-linking.pl

Validator for 2013 and 2014 entity linking submission files, as provided to
LDC by evaluation track coordinators, with no further testing.

./tools/el_scorer_2012.py

Scorer for 2012 entity linking submission files, as provided to LDC by
evaluation track coordinators, with no further testing.

./tools/el_scorer_2013.py

Scorer for 2013 entity linking submission files, as provided to LDC by
evaluation track coordinators, with no further testing.

./tools/el_scorer_2014.py

Scorer for 2014 entity linking submission files, as provided to LDC by
evaluation track coordinators, with no further testing.

3. Query Development Annotation and Quality Control

Query development begins with Entity Selection, which has three stages:
Namestring Annotation, KB Linking, and NIL Coreference (where a NIL entity
is an entity without a node in the KB). Bilingual Spanish/English-speaking
annotators searched the corpus for entities that would make suitable
queries, using an interface created by LDC for this task. To the extent
possible, an effort was made to balance queries across entity type, status
(NIL vs. non-NIL), and document genre. Most queries were drawn from
non-English documents, but mentions in English documents of entities
co-referential with other non-English queries were selected whenever
possible.

In Namestring Annotation, annotators search for and select named mentions
of entities in text. Annotators focused on creating queries using confusable
named entity mentions. Confusability was measured both by the number of
distinct entities in the full query set referred to by the same name string
(polysemy) as well as the number of distinct entities in the set that were
referred to by multiple, unique named mentions (synonymy). For example,
the string "Smith" would make a polysemous query because an annotator could
probably find it in the corpus referring to different entities, while
"Barack Obama" would make a synonymous query because the entity is also
referred to in the corpus as "B. Hussein Obama" or "Bam Bam".

In KB Linking, annotators search the KB and indicate whether or not it
includes pages on the entities they selected during Namestring
Annotation. Annotators created a link between the query and the matching KB
node ID. If no matching node was found, the query was marked as NIL and
later coreferenced with other NIL entities. Annotators were allowed to use
online searching to assist in determining the KB link/NIL status. Queries
for which an annotator could not confidently determine the KB link status
were removed from the final data sets.

For NIL Coreference, selected entities that were not included in the KB
(i.e., NIL entities) were grouped into equivalence classes by annotators.
Mentions referring to the same entity were grouped into one equivalence
class.

Senior annotators conducted quality control on query development to
correct errors and identify areas of difficulty to use in improving future
guidelines and annotator training. Annotators performing quality control
made sure that the extent of each selected namestring was correct and
checked that each entity was linked to the correct KB node or was properly
identified as NIL and coreferenced correctly.

4. Source Documents

The source data contained in this release comprises all documents
from which queries were drawn and is the complete data set used in
the Spanish EL evaluations. The source data was drawn from existing
LDC holdings, with no additional validation. An overall scan of
character content in the source collections indicates some relatively
small quantities of various problems, especially in the web and
discussion forum data, including language mismatch (characters from
Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors
(some documents have apparently undergone "double encoding" into UTF-8,
and others may have been "noisy" to begin with, or may have gone through
an improper encoding conversion, yielding occurrences of the
Unicode "replacement character" (U+FFFD) throughout the corpus); the web
collection also has characters whose Unicode code points lie
outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF.

All source documents were originally released as XML but have been
converted to text files for this release. This change was made
primarily because the documents were used as text files during
data development but also because some fail XML parsing. All
documents that have filenames beginning with "eng-NG" are Web Document
data (WB) and some of these fail XML parsing (see below for details).
All files that start with "bolt-" are Discussion Forum threads (DF)
and have the XML structure described below. All other files are
Newswire data (NW) and have the newswire markup pattern detailed below.

Note as well that some source documents are duplicated across a few of
the separated source_documents directories, indicating that some queries
from different data sets originated from the same source documents. As
it is acceptable for source to be reused for Entity Linking queries, this
duplication is intentional and expected.

The subsections below go into more detail regarding the markup and
other properties of the three source data types:

4.1 Newswire Data

Newswire data use the following markup framework:

where the HEADLINE and DATELINE tags are optional (not always
present), and the TEXT content may or may not include "<P> ... </P>"
tags (depending on whether or not the "doc_type_label" is "story").

All the newswire files, if converted back to XML, are parseable.

4.2 Discussion Forum Data

Discussion forum files use the following markup framework:

where there may be arbitrarily deep nesting of quote elements, and
other elements may be present (e.g. "<a...>...</a>" anchor tags). As
mentioned in section 2 above, each <doc> unit contains at least five
post elements.

All the discussion forum files, if converted back to XML, are parseable.

4.3 Web Document Data

"Web" files use the following markup framework:

<DOC>
<DOCID> {doc_id_string} </DOCID>
<DOCTYPE> ... </DOCTYPE>
<DATETIME> ... </DATETIME>
<BODY>
<HEADLINE>
...
</HEADLINE>
<TEXT>
<POST>
<POSTER> ... </POSTER>
<POSTDATE> ... </POSTDATE>
...
</POST>
</TEXT>
</BODY>
</DOC>

Other kinds of tags may be present ("<QUOTE ...>", "<A ...>", etc).

Some of the web source documents contain material that interferes
with XML parsing (e.g. unescaped "&", or "<QUOTE>" tags that lack
a corresponding "</QUOTE>").

5. Using the Data

5.1 Offset calculation

The values of the beg and end XML elements in the later queries.xml files
indicate character offsets to identify text extents in the source. Offset
counting starts from the initial opening angle bracket of the <DOC> element
(<doc> in DF sources), which is usually the initial character (character 0)
of the source. Note as well that character counting includes newlines and
all markup characters - that is, the offsets are based on treating the
source document file as "raw text", with all its markup included.

Note that although strings included in the annotation files
(queries and gold standard mentions) generally match source documents, a
few characters are normalized in order to enhance readability: Conversion
of newlines to spaces, except where preceding characters were hyphens ("-"),
in which case newlines were removed, and conversion of multiple spaces to
a single space.

5.2 Proper ingesting of XML queries

While the character offsets are calculated based on treating the source
document as "raw text", the "name" strings being referenced by the queries
sometimes contain XML metacharacters, and these had to be "re-escaped" for
proper inclusion in the queries.xml file. For example, an actual name like
"AT&T" may show up a source document file as "AT&amp;T" (because the source
document was originally formatted as XML data). But since the source doc is
being treated here as raw text, this name string is treated in queries.xml as
having 7 characters (i.e., the character offsets, when provided, will point to
a string of length 7).

However, the "name" element itself, as presented in the queries.xml file, will
be even longer - "AT&amp;amp;T" - because the queries.xml file is intended to
be handled by an XML parser, which will return "AT&amp;T" when this "name"
element is extracted. Using the queries.xml data without XML parsing would
yield a mismatch between the "name" value and the corresponding string in the
source data.

6. Acknowledgements

This material is based on research sponsored by Air Force Research
Laboratory and Defense Advance Research Projects Agency under
agreement number FA8750-13-2-0045. The U.S. Government is authoized
to reporoduce and distribute reprints for Governmental purposes
notwithstanding any copyright notation thereon. The views and
conclusions contained herein are those of the authors and should
not be interpreted as necessarily representing the official policies
or endorsements, either expressed or implied, of Air Force Research
Laboratory and Defense Advanced Research Projects Agency or the U.S.
Government.

The authors acknowledge the following contributors to this data set:
Dave Graff (LDC)
Dana Fore (LDC)
Heng Ji (RPI)
Hoa Dang (NIST)
Ralph Grishman (NYU)
Javier Artiles (Slice Technologies)
Boyan Onyshkevych (DARPA)

7. References

Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014
Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: Planning,
Execution, and Results
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-2014-overview.pdf
TAC KBP 2014 Workshop: National Institute of Standards and Technology,
Gaithersburg, Maryland, November 17-18

Joe Ellis, Jeremy Getman, Justin Mott, Xuansong Li, Kira Griffitt,
Stephanie M. Strassel, Jonathan Wright. 2013
Linguistic Resources for 2013 Knowledge Base Population Evaluations
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-workshop2013-linguistic-resources-kbp-eval.pdf
TAC KBP 2013 Workshop: National Institute of Standards and Technology,
Gaithersburg, MD, November 18-19

Joe Ellis, Xuansong Li, Kira Griffitt, Stephanie M. Strassel,
Jonathan Wright. 2012
Linguistic Resources for 2012 Knowledge Base Population Evaluations
https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-workshop2012-linguistic-resources-kbp-eval.pdf
TAC KBP 2012 Workshop: National Institute of Standards and Technology,
Gaithersburg, MD, November 5-6

8. Copyright Information

9. Contact Information

For further information about this data release, contact the following
project staff at LDC:

Joseph Ellis, Project Manager <joellis@ldc.upenn.edu>
Jeremy Getman, Lead Annotator <jgetman@ldc.upenn.edu>
Stephanie Strassel, PI <strassel@ldc.upenn.edu>

--------------------------------------------------------------------------
README created by Dana Fore on November 19, 2015
updated by Jeremy Getman on November 20, 2015
updated by Joe Ellis on November 22, 2015
updated by Dana Fore on December 23, 2015
updated by Jeremy Getman on December 23, 2015
updated by Stephanie Strassel on January 21, 2016
updated by Dana Fore on January 26, 2016
updated by Joe Ellis on January 26, 2016