TAC KBP Chinese Regular Slot Filling
Comprehensive Training and Evaluation Data 2014
Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel
1. Overview
This package contains training and evaluation data produced in support of
the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 2014.
Text Analysis Conference (TAC) is a series of workshops organized by the
National Institute of Standards and Technology (NIST). TAC was developed to
encourage research in natural language processing (NLP) and related
applications by providing a large test collection, common evaluation
procedures, and a forum for researchers to share their results. Through its
various evaluations, the Knowledge Base Population (KBP) track of TAC
encourages the development of systems that can match entities mentioned in
natural texts with those appearing in a knowledge base and extract novel
information about entities from a document collection and add it to a new
or existing knowledge base.
The regular Slot Filling evaluation track (SF), for Chinese, as for English,
involves mining information about entities from text. SF can be viewed as
more traditional Information Extraction (IE), or alternatively, as a
Question Answering (QA) task, in which the questions are static but the
targets change. In completing the task, participating systems and LDC
annotators searched a corpus for information on certain attributes (slots)
of person (PER) and organization (ORG) entities and returned any valid
responses (slot fillers). For more information about Chinese SF, please
refer to the 2014 TAC KBP home page (2014 was the only year in which
Chinese SF was conducted) at http://www.nist.gov/tac.
This package contains all evaluation and training data developed in
support of TAC KBP Chinese Regular Slot Filling. This includes queries,
the manual run produced by LDC annotators, and the final assessment
results. This release also contains the complete set of Chinese source
documents for the Chinese Slot Filling evaluation.
The data included in this package were originally released by LDC
to TAC KBP coordinators and performers under the following ecorpora
catalog IDs and titles:
LDC2013E45: TAC 2013 KBP Source Corpus
LDC2014E29: TAC 2014 KBP Chinese Source Corpus
LDC2014E123: TAC KBP 2014 Chinese Regular Slot Filling Training Data
LDC2015E01: TAC KBP 2014 Chinese Regular Slot Filling Evaluation
Queries and Manual Run
LDC2015E67: TAC KBP 2014 Chinese Regular Slot Filling Evaluation
Assessment Results V2.0
LDC2016E35: TAC KBP Chinese Regular Slot Filling Comprehensive
Training and Evaluation Data 2014
Summary of data included in this package:
Query Data:
+------------+-----+-----+-------+
| set | PER | ORG | total |
+------------+-----+-----+-------+
| training | 17 | 15 | 32 |
| evaluation | 51 | 52 | 103 |
+------------+-----+-----+-------+
Manual Response Data:
+------------+------------------+
| set | manual responses |
+------------+------------------+
| training | 967 |
| evaluation | 2858 |
+------------+------------------+
Assessment Data:
+----------+-----------+
| assessed | assessed |
| files | responses |
+----------+-----------+
| 2107 | 2878 |
+----------+-----------+
2. Contents
./docs/README.txt
This file.
./data/eval/assessment/*
This directory holds 2107 assessment files, the combination of
which contains a total of 2878 assessed responses. Assessment
was performed on a set of pooled responses provided by NIST that
includes fillers and justification returned by both systems and LDC
annotators. Note that 1188 of the files in this directory are empty,
indicating that no responses were returned for the particular
query/slot combination.
There is one file for each combination of query entity and slot for
all of the queries found in
/data/eval/tac_kbp_2014_chinese_regular_slot_filling_evaluation_queries.xml
The assessment results files contain 8 tab-delimited fields. The field
definitions are as follows:
Column 1: Response ID
Column 2: A concatenation of SF query ID and the relevant slot
name, separated by a colon
Column 3: Provenance for the relation between the query entity and
slot filler, consisting of up to 4 triples in the format
'docid:startoffset-endoffset' separated by a comma.
Column 4: A slot filler (possibly normalized, e.g., for dates;
otherwise, should appear in the provenance document)
Column 5: Provenance for the slot filler string. This is either a
single span (docid:startoffset-endoffset) from the
document where the canonical slot filler string was
extracted, or (in the case when the slot filler string
has been normalized) a set of up to two
docid:startoffset-endoffset spans, separated by a comma,
that indicate the base strings that were used to
generate the normalized slot filler string.
Column 6: Assessment of slot filler with respect to the text regions
defined by the relation provenance and filler provenance.
Values can be:
C - Correct and not in live Wikipedia at the time the
data were created (late 2014)
R - Correct and Redundant with what was in live Wikipedia
in late 2014
X - Inexact
W - Wrong
I - Ignored because sum of lengths of provenance spans
is too long
Column 7: Assessment of relation provenance. Values can be:
C - Correct and not in live Wikipedia at the time the
data were created (late 2014)
L - Inexact (Long)
S - Inexact (Short)
W - Wrong
I - Ignored because sum of lengths of provenance spans
is too long
Column 8: LDC Equivalence class of Col 4 (slot filler) if column 6
is Correct or Redundant. If the response is marked wrong
('W' in column 6), this column contains a 0, indicating
that no coreference was performed.
./data/eval/tac_kbp_2014_chinese_regular_slot_filling_evaluation_queries.xml
This file contains 51 person (PER) queries and 52 organization (ORG)
queries. Each query consists of the following 5 elements (see DTD for
more details):
1. The namestring of the entity
2. The entity's type (PER or ORG)
3. The source document
4. The start offset
5. The end offset
Note that each Chinese Slot Filling query has an identifier, formatted
as "SF14_CMN_" plus a three-digit integer value (e.g., "001").
./data/eval/tac_kbp_2014_chinese_regular_slot_filling_evaluation_manual_run.tab
This file contains the results of LDC's time-limited manual run
over the 2014 Chinese Slot Filling evaluation queries. These responses
represent all of the unique fillers that annotators were able to
find in the corpus for each of the queries.
The responses file is tab-delimited, with 7 fields total. The column
descriptions are as follows:
1. Query ID - the slot filling query list ID for the entity
2. Slot name - the name of the slot for the filler
3. System ID - the ID of the system that generated the response;
always "LDC" in these data
4. Relation provenance - Provenance for the relation between the
query entity and slot filler, consisting
of up to 4 triples in the format:
docid:startoffset-endoffset separated by a
comma. Each of these individual spans are
at most 150 UTF-8 characters. For each
slot for which annotators could not find
any new information, column 4 states "NIL"
and Columns 5 through 7 are empty.
5. Slot filler - The (possibly normalized) response. Note that, if
column 4 contains the word 'NIL' this column will
be blank.
6. Filler provenance - This is either a single span
(docid:startoffset-endoffset) from the
document indicating where the canonical
slot filler string was extracted, or (for
cases in which the slot filler string in
Column 5 has been normalized) a set of up
to two docid:startoffset-endoffset spans
for the base strings that were used to
generate the normalized slot filler
string. As in Column 4, multiple spans are
separated by commas.
7. Confidence score - a confidence score for the response; always
"1.0" in these data. Note that, if column 4
contains the word 'NIL' this column will be
blank.
./data/source_corpus/{cmn_df_doclist,cmn_nw_doclist,cmn_wb_doclist}
The three Chinese doclist files in this directory list all of the
Chinese source documents included in this package by genre type.
Note that the Chinese SF evaluation conducted in 2014 was monolingual
and so, for the purposes of recreating that evaluation, only the
sources included in these three lists should be used.
./data/source_corpus/{eng_ng_doclist,eng_nw_doclist}
The two English doclist files in this directory list all of the
English source documents included in this package by genre type.
Although the Chinese SF evaluation conducted in 2014 was monolingual,
a few English responses were returned by systems. In the interest of
producing training data for future cross-lingual evaluations, assessors
were instructed to treat the responses as potentially valid. Thus, the
responses and source documents are included in this package but should
not be used for the purpose of recreating the evaluation.
./data/source_corpus/chinese/discussion_forums/*
The concatenated files in this directory contain 199,321 Chinese
discussion forum documents selected from BOLT Phase 1 discussion
forums source data releases (LDC2012E04, LDC2012E16, LDC2012E21,
and LDC2012E54, which have not yet been released in the general
catalog). Each forum includes at least 5 posts.
./data/source_corpus/chinese/newswire/*
The concatenated files in this directory contain 2,000,256
documents selected from Chinese Gigaword Fifth Edition (LDC2011T13).
./data/source_corpus/chinese/web/*
The concatenated files in this directory contain 815,886 Chinese
(Mandarin) web documents selected from various GALE web collections.
./data/source_corpus/english/newswire/*
The concatenated file in this directory contains 34 documents
drawn from the TAC KBP 2014 English source corpus.
./data/source_corpus/english/newsgroup/*
The single XML file in this directory was drawn from the TAC KBP
2014 English source corpus.
./data/training/tac_kbp_2014_chinese_regular_slot_filling_training_queries.xml
This file contains 17 person (PER) queries and 15 organization (ORG)
queries. Each query is structured the same as the evaluation queries
described above and in the DTD. Note that each Chinese Slot Filling
training query has an identifier, formatted as "SF14_CMN_TRAINING"
plus a three-digit integer value (e.g., "001").
./data/training/tac_kbp_2014_chinese_regular_slot_filling_training_manual_run.tab
This file contains the results of LDC's time-limited manual run
over the 2014 Chinese Slot Filling training queries. These responses
represent all of the unique fillers that annotators were able to
find in the corpus for each of the queries. The manual run for the
training data is formatted the same as the evaluation data described
above
./docs/all_files.md5
Paths (relative to the root of the corpus) and md5 checksums for all
files included in the package.
./docs/guidelines/*
The guidelines used by annotators in developing the 2014 Chinese
Regular Slot Filling queries, gold standard data, and assessments
contained in this corpus.
./docs/task_descriptions/KBP2014_TaskDefinition_EnglishSlotFilling_1.1.pdf
Task Description for the 2014 English Regular Slot Filling evaluation
track, written by track coordinators. Note that although English is
specified, it otherwise accurately describes the specifics and formats
of the 2014 Chinese SF task as well.
./dtd/kbpslotfill.dtd
The dtd against which to validate both the training and evaluation
queries files specified above
./tools/SFScore2014.java
Scorer for 2014 regular slot filling submission files, as provided
to LDC by evaluation track coordinators, with no further testing.
3. Source Corpus Information
All the *.gz files, when uncompressed, comprise a concatenated stream
of document units, each of which is presented as an independent XML
(or XML-like) structure. There is no XML element that spans the
concatenated documents, so each *.gz file is not parseable as an XML
stream, but XML tags are used throughout to mark document boundaries.
All the text data in the *.gz files have been taken directly from
previous LDC corpus releases, and are being provided here essentially
"as-is", with little or no additional quality control. An overall
scan of character content in the uncompressed data indicates some
relatively small quantities of various problems, especially in the web
and discussion forum data, including language mismatch (characters
from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding
errors (some documents have apparently undergone "double encoding"
into UTF-8, and others may have been "noisy" to begin with, or may
have gone through an improper encoding conversion, yielding 167,692
occurrences of the Unicode "replacement character" (U+FFFD) throughout
the corpus); there are over 1000 characters whose Unicode code points
lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF.
Special note: the file './data/source_corpus/web/cmn-WL-31-111.gz'
contains two lines with a total of six UTF-8 "characters" that map
to the "Surrogate Pair" portion of the Unicode character space
(between 0xD800 and 0xDFFF); some UTF8-aware processes (e.g.
modern Ruby scripts) will fail when trying to read data from this
file. The offending characters are on lines 262932 and 262953 of
the file when it is uncompressed.
When uncompressed, these documents as a whole yield 12.5 GB of data, and
over 5.32 billion UTF-8 characters.
The subsections below go into more detail regarding the markup and
other properties of the various data subsets:
3.1 Newswire Data
Newswire data use the following markup framework:
...
...
" tags (depending on whether or not the "doc_type_label" is "story"). If a suitable "container" or "root" tag is applied at the beginning and end of each *.gz stream, all the newswire files are parseable as XML. 3.2 Discussion Forum Data Discussion forum files use the following markup framework:......
", "", etc). 359 of the 363 *.gz data streams contain material that interferes with XML parsing (e.g. unescaped "&", or "" tags that lack a corresponding ""). 4. Annotation tasks The tasks conducted by LDC annotators in support of the Chinese Regular Slot Filling (SF) evaluation were query and manual run development and assessment of system- and human-produced responses to queries. 4.1 Query and Manual Run Development Entities, which are the basis of SF queries, are selected based on their level of non-confusability and productivity. A candidate query entity mention is considered non-confusable if it is "canonical", meaning that it is not an alias and includes more than just a first or last name. Mentions with objectionable content are also excluded. Productivity for candidate queries is determined by searching the source corpus to find whether it contains at least two slot fillers for the entity. Annotators were additionally required to check Wikipedia when considering potential query entities so as to avoid entities for which the online resource would indicate too many correct responses. Individual SF query entities are also selected based on the degree to which they help to balance certain features across the full set of queries. Specifically, these features are entity types (person and organization) and response types for slots (i.e., those that take named entities as fillers, those that take values (dates and numbers) as fillers, and those that take strings as fillers). Concurrent with query development for Chinese SF in 2014, LDC annotators produced the "manual run", or the human-produced set of responses for each of the evaluation queries. While these two tasks had historically been conducted separately for English SF, in 2014 the tasks were combined in order to leverage the knowledge about entities annotators acquire while researching for query development. During the manual run, annotators are given up to two hours per query to search the corpus and locate all valid fillers. In the event that annotators feel they have found all unique responses in less than the two hours provided, they also return some duplicate fillers in order to provide more training data for systems in the future. Justification - the minimum extents of provenance supporting the validity of a slot filler - is also provided as part of the manual run in order to pinpoint the sources of assertions. Valid justification strings clearly identify all three elements of a relation (i.e. the subject entity, the predicate slot, and the object filler), and the relation between them, with minimal extraneous text. In 2014, justification allowed for up to four justification strings, to support relations that are difficult to justify with just a single document. Following initial query development, a quality control pass is conducted to flag or correct as necessary any fillers that do not have adequate justification in the source document, or that might be at variance with the guidelines in any way. Any flagged fillers are then adjudicated by senior annotators who update, remove, or replace them as appropriate. 4.2 Assessment In assessment, annotators judge and coreference anonymized slot filler responses returned for the query set from both the manual run and from systems. Fillers are marked as correct if they are found to be both compatible with the slot descriptions and supported in the provided justification string(s) and/or its surrounding content. Fillers are assessed either as wrong if they do not meet both of the conditions for correctness, or inexact if overly insufficient or extraneous text had been selected for an otherwise correct response. Justification is assessed as correct if it succinctly and completely supports the relation, wrong if it does not support the relation at all (or if the corresponding filler is marked wrong), inexact-short if part but not all of the information necessary to support the relation was provided, or inexact-long if it contains all information necessary to support the relation but also a great deal of extraneous text. Responses with justification comprising more than 600 characters in total are automatically marked as ignored and not reviewed during assessment. After first passes of assessment are completed, quality control is performed on the data by senior annotators. Performing quality control makes sure that the extents of each annotated filler and justification are correct and that entities assessed as correct are included in the appropriate equivalence classes. 5. Using the Data 5.1 Text normalization and offset calculation Text normalization consisting of a 1-for-1 substitution of newline (0x0A) and tab (0x09) characters with space (0x20) characters was performed on the document text input to the response field. The values of the beg and end XML elements in the queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 5.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 6. Acknowledgemnts This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Neil Kuster (LDC) Dave Graff (LDC) Heather Simpson (LDC) Robert Parker (LDC) Hoa Dang (NIST) Heng Ji (RPI) Ralph Grishman (NYU) James Mayfield (JHU) Mihai Surdeanu (UA) Paul McNamee (JHU) Boyan Onyshkevych (DARPA) 7. References Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014 Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: Planning, Execution, and Results https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-2014-overview.pdf TAC KBP 2014 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 17-18 8. Copyright Information (c) 2016 Trustees of the University of Pennsylvania 9. Contact Information For further information about this data release, contact the following project staff at LDC: Joe Ellis, Project ManagerJeremy Getman, Lead Annotator Stephanie Strassel, PI ----------------------------------------------------------------------------- README created by Neil Kuster on February 12, 2016 updated by Jeremy Getman on March 1, 2016 updated by Neil Kuster on April 27, 2016 updated by Neil Kuster on September 14, 2016 updated by Joe Ellis on October 7, 2016 updated by Joe Ellis on November 28, 2016