TAC KBP Chinese Regular Slot Filling Comprehensive Training and Evaluation Data 2014 Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel 1. Overview This package contains training and evaluation data produced in support of the TAC KBP Chinese Regular Slot Filling evaluation track conducted in 2014. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. The regular Slot Filling evaluation track (SF), for Chinese, as for English, involves mining information about entities from text. SF can be viewed as more traditional Information Extraction (IE), or alternatively, as a Question Answering (QA) task, in which the questions are static but the targets change. In completing the task, participating systems and LDC annotators searched a corpus for information on certain attributes (slots) of person (PER) and organization (ORG) entities and returned any valid responses (slot fillers). For more information about Chinese SF, please refer to the 2014 TAC KBP home page (2014 was the only year in which Chinese SF was conducted) at http://www.nist.gov/tac. This package contains all evaluation and training data developed in support of TAC KBP Chinese Regular Slot Filling. This includes queries, the manual run produced by LDC annotators, and the final assessment results. This release also contains the complete set of Chinese source documents for the Chinese Slot Filling evaluation. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2013E45: TAC 2013 KBP Source Corpus LDC2014E29: TAC 2014 KBP Chinese Source Corpus LDC2014E123: TAC KBP 2014 Chinese Regular Slot Filling Training Data LDC2015E01: TAC KBP 2014 Chinese Regular Slot Filling Evaluation Queries and Manual Run LDC2015E67: TAC KBP 2014 Chinese Regular Slot Filling Evaluation Assessment Results V2.0 LDC2016E35: TAC KBP Chinese Regular Slot Filling Comprehensive Training and Evaluation Data 2014 Summary of data included in this package: Query Data: +------------+-----+-----+-------+ | set | PER | ORG | total | +------------+-----+-----+-------+ | training | 17 | 15 | 32 | | evaluation | 51 | 52 | 103 | +------------+-----+-----+-------+ Manual Response Data: +------------+------------------+ | set | manual responses | +------------+------------------+ | training | 967 | | evaluation | 2858 | +------------+------------------+ Assessment Data: +----------+-----------+ | assessed | assessed | | files | responses | +----------+-----------+ | 2107 | 2878 | +----------+-----------+ 2. Contents ./docs/README.txt This file. ./data/eval/assessment/* This directory holds 2107 assessment files, the combination of which contains a total of 2878 assessed responses. Assessment was performed on a set of pooled responses provided by NIST that includes fillers and justification returned by both systems and LDC annotators. Note that 1188 of the files in this directory are empty, indicating that no responses were returned for the particular query/slot combination. There is one file for each combination of query entity and slot for all of the queries found in /data/eval/tac_kbp_2014_chinese_regular_slot_filling_evaluation_queries.xml The assessment results files contain 8 tab-delimited fields. The field definitions are as follows: Column 1: Response ID Column 2: A concatenation of SF query ID and the relevant slot name, separated by a colon Column 3: Provenance for the relation between the query entity and slot filler, consisting of up to 4 triples in the format 'docid:startoffset-endoffset' separated by a comma. Column 4: A slot filler (possibly normalized, e.g., for dates; otherwise, should appear in the provenance document) Column 5: Provenance for the slot filler string. This is either a single span (docid:startoffset-endoffset) from the document where the canonical slot filler string was extracted, or (in the case when the slot filler string has been normalized) a set of up to two docid:startoffset-endoffset spans, separated by a comma, that indicate the base strings that were used to generate the normalized slot filler string. Column 6: Assessment of slot filler with respect to the text regions defined by the relation provenance and filler provenance. Values can be: C - Correct and not in live Wikipedia at the time the data were created (late 2014) R - Correct and Redundant with what was in live Wikipedia in late 2014 X - Inexact W - Wrong I - Ignored because sum of lengths of provenance spans is too long Column 7: Assessment of relation provenance. Values can be: C - Correct and not in live Wikipedia at the time the data were created (late 2014) L - Inexact (Long) S - Inexact (Short) W - Wrong I - Ignored because sum of lengths of provenance spans is too long Column 8: LDC Equivalence class of Col 4 (slot filler) if column 6 is Correct or Redundant. If the response is marked wrong ('W' in column 6), this column contains a 0, indicating that no coreference was performed. ./data/eval/tac_kbp_2014_chinese_regular_slot_filling_evaluation_queries.xml This file contains 51 person (PER) queries and 52 organization (ORG) queries. Each query consists of the following 5 elements (see DTD for more details): 1. The namestring of the entity 2. The entity's type (PER or ORG) 3. The source document 4. The start offset 5. The end offset Note that each Chinese Slot Filling query has an identifier, formatted as "SF14_CMN_" plus a three-digit integer value (e.g., "001"). ./data/eval/tac_kbp_2014_chinese_regular_slot_filling_evaluation_manual_run.tab This file contains the results of LDC's time-limited manual run over the 2014 Chinese Slot Filling evaluation queries. These responses represent all of the unique fillers that annotators were able to find in the corpus for each of the queries. The responses file is tab-delimited, with 7 fields total. The column descriptions are as follows: 1. Query ID - the slot filling query list ID for the entity 2. Slot name - the name of the slot for the filler 3. System ID - the ID of the system that generated the response; always "LDC" in these data 4. Relation provenance - Provenance for the relation between the query entity and slot filler, consisting of up to 4 triples in the format: docid:startoffset-endoffset separated by a comma. Each of these individual spans are at most 150 UTF-8 characters. For each slot for which annotators could not find any new information, column 4 states "NIL" and Columns 5 through 7 are empty. 5. Slot filler - The (possibly normalized) response. Note that, if column 4 contains the word 'NIL' this column will be blank. 6. Filler provenance - This is either a single span (docid:startoffset-endoffset) from the document indicating where the canonical slot filler string was extracted, or (for cases in which the slot filler string in Column 5 has been normalized) a set of up to two docid:startoffset-endoffset spans for the base strings that were used to generate the normalized slot filler string. As in Column 4, multiple spans are separated by commas. 7. Confidence score - a confidence score for the response; always "1.0" in these data. Note that, if column 4 contains the word 'NIL' this column will be blank. ./data/source_corpus/{cmn_df_doclist,cmn_nw_doclist,cmn_wb_doclist} The three Chinese doclist files in this directory list all of the Chinese source documents included in this package by genre type. Note that the Chinese SF evaluation conducted in 2014 was monolingual and so, for the purposes of recreating that evaluation, only the sources included in these three lists should be used. ./data/source_corpus/{eng_ng_doclist,eng_nw_doclist} The two English doclist files in this directory list all of the English source documents included in this package by genre type. Although the Chinese SF evaluation conducted in 2014 was monolingual, a few English responses were returned by systems. In the interest of producing training data for future cross-lingual evaluations, assessors were instructed to treat the responses as potentially valid. Thus, the responses and source documents are included in this package but should not be used for the purpose of recreating the evaluation. ./data/source_corpus/chinese/discussion_forums/* The concatenated files in this directory contain 199,321 Chinese discussion forum documents selected from BOLT Phase 1 discussion forums source data releases (LDC2012E04, LDC2012E16, LDC2012E21, and LDC2012E54, which have not yet been released in the general catalog). Each forum includes at least 5 posts. ./data/source_corpus/chinese/newswire/* The concatenated files in this directory contain 2,000,256 documents selected from Chinese Gigaword Fifth Edition (LDC2011T13). ./data/source_corpus/chinese/web/* The concatenated files in this directory contain 815,886 Chinese (Mandarin) web documents selected from various GALE web collections. ./data/source_corpus/english/newswire/* The concatenated file in this directory contains 34 documents drawn from the TAC KBP 2014 English source corpus. ./data/source_corpus/english/newsgroup/* The single XML file in this directory was drawn from the TAC KBP 2014 English source corpus. ./data/training/tac_kbp_2014_chinese_regular_slot_filling_training_queries.xml This file contains 17 person (PER) queries and 15 organization (ORG) queries. Each query is structured the same as the evaluation queries described above and in the DTD. Note that each Chinese Slot Filling training query has an identifier, formatted as "SF14_CMN_TRAINING" plus a three-digit integer value (e.g., "001"). ./data/training/tac_kbp_2014_chinese_regular_slot_filling_training_manual_run.tab This file contains the results of LDC's time-limited manual run over the 2014 Chinese Slot Filling training queries. These responses represent all of the unique fillers that annotators were able to find in the corpus for each of the queries. The manual run for the training data is formatted the same as the evaluation data described above ./docs/all_files.md5 Paths (relative to the root of the corpus) and md5 checksums for all files included in the package. ./docs/guidelines/* The guidelines used by annotators in developing the 2014 Chinese Regular Slot Filling queries, gold standard data, and assessments contained in this corpus. ./docs/task_descriptions/KBP2014_TaskDefinition_EnglishSlotFilling_1.1.pdf Task Description for the 2014 English Regular Slot Filling evaluation track, written by track coordinators. Note that although English is specified, it otherwise accurately describes the specifics and formats of the 2014 Chinese SF task as well. ./dtd/kbpslotfill.dtd The dtd against which to validate both the training and evaluation queries files specified above ./tools/SFScore2014.java Scorer for 2014 regular slot filling submission files, as provided to LDC by evaluation track coordinators, with no further testing. 3. Source Corpus Information All the *.gz files, when uncompressed, comprise a concatenated stream of document units, each of which is presented as an independent XML (or XML-like) structure. There is no XML element that spans the concatenated documents, so each *.gz file is not parseable as an XML stream, but XML tags are used throughout to mark document boundaries. All the text data in the *.gz files have been taken directly from previous LDC corpus releases, and are being provided here essentially "as-is", with little or no additional quality control. An overall scan of character content in the uncompressed data indicates some relatively small quantities of various problems, especially in the web and discussion forum data, including language mismatch (characters from Chinese, Korean, Japanese, Arabic, Russian, etc.), and encoding errors (some documents have apparently undergone "double encoding" into UTF-8, and others may have been "noisy" to begin with, or may have gone through an improper encoding conversion, yielding 167,692 occurrences of the Unicode "replacement character" (U+FFFD) throughout the corpus); there are over 1000 characters whose Unicode code points lie outside the "Basic Multilanguage Plane" (BMP), i.e. above U+FFFF. Special note: the file './data/source_corpus/web/cmn-WL-31-111.gz' contains two lines with a total of six UTF-8 "characters" that map to the "Surrogate Pair" portion of the Unicode character space (between 0xD800 and 0xDFFF); some UTF8-aware processes (e.g. modern Ruby scripts) will fail when trying to read data from this file. The offending characters are on lines 262932 and 262953 of the file when it is uncompressed. When uncompressed, these documents as a whole yield 12.5 GB of data, and over 5.32 billion UTF-8 characters. The subsections below go into more detail regarding the markup and other properties of the various data subsets: 3.1 Newswire Data Newswire data use the following markup framework: ... ...

...

...
where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "

...

" tags (depending on whether or not the "doc_type_label" is "story"). If a suitable "container" or "root" tag is applied at the beginning and end of each *.gz stream, all the newswire files are parseable as XML. 3.2 Discussion Forum Data Discussion forum files use the following markup framework: ... ... ... ... ... where there may be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "..." anchor tags). As mentioned in section 2 above, each unit contains at least five post elements. If a suitable "container" or "root" tag is applied at the beginning and end of each *.gz stream, all the discussion forum files are parseable as XML. 3.3 Web Document Data "Web" files use the following markup framework: {doc_id_string} ... ... ... ... ... ... Other kinds of tags may be present ("", "", etc). 359 of the 363 *.gz data streams contain material that interferes with XML parsing (e.g. unescaped "&", or "" tags that lack a corresponding ""). 4. Annotation tasks The tasks conducted by LDC annotators in support of the Chinese Regular Slot Filling (SF) evaluation were query and manual run development and assessment of system- and human-produced responses to queries. 4.1 Query and Manual Run Development Entities, which are the basis of SF queries, are selected based on their level of non-confusability and productivity. A candidate query entity mention is considered non-confusable if it is "canonical", meaning that it is not an alias and includes more than just a first or last name. Mentions with objectionable content are also excluded. Productivity for candidate queries is determined by searching the source corpus to find whether it contains at least two slot fillers for the entity. Annotators were additionally required to check Wikipedia when considering potential query entities so as to avoid entities for which the online resource would indicate too many correct responses. Individual SF query entities are also selected based on the degree to which they help to balance certain features across the full set of queries. Specifically, these features are entity types (person and organization) and response types for slots (i.e., those that take named entities as fillers, those that take values (dates and numbers) as fillers, and those that take strings as fillers). Concurrent with query development for Chinese SF in 2014, LDC annotators produced the "manual run", or the human-produced set of responses for each of the evaluation queries. While these two tasks had historically been conducted separately for English SF, in 2014 the tasks were combined in order to leverage the knowledge about entities annotators acquire while researching for query development. During the manual run, annotators are given up to two hours per query to search the corpus and locate all valid fillers. In the event that annotators feel they have found all unique responses in less than the two hours provided, they also return some duplicate fillers in order to provide more training data for systems in the future. Justification - the minimum extents of provenance supporting the validity of a slot filler - is also provided as part of the manual run in order to pinpoint the sources of assertions. Valid justification strings clearly identify all three elements of a relation (i.e. the subject entity, the predicate slot, and the object filler), and the relation between them, with minimal extraneous text. In 2014, justification allowed for up to four justification strings, to support relations that are difficult to justify with just a single document. Following initial query development, a quality control pass is conducted to flag or correct as necessary any fillers that do not have adequate justification in the source document, or that might be at variance with the guidelines in any way. Any flagged fillers are then adjudicated by senior annotators who update, remove, or replace them as appropriate. 4.2 Assessment In assessment, annotators judge and coreference anonymized slot filler responses returned for the query set from both the manual run and from systems. Fillers are marked as correct if they are found to be both compatible with the slot descriptions and supported in the provided justification string(s) and/or its surrounding content. Fillers are assessed either as wrong if they do not meet both of the conditions for correctness, or inexact if overly insufficient or extraneous text had been selected for an otherwise correct response. Justification is assessed as correct if it succinctly and completely supports the relation, wrong if it does not support the relation at all (or if the corresponding filler is marked wrong), inexact-short if part but not all of the information necessary to support the relation was provided, or inexact-long if it contains all information necessary to support the relation but also a great deal of extraneous text. Responses with justification comprising more than 600 characters in total are automatically marked as ignored and not reviewed during assessment. After first passes of assessment are completed, quality control is performed on the data by senior annotators. Performing quality control makes sure that the extents of each annotated filler and justification are correct and that entities assessed as correct are included in the appropriate equivalence classes. 5. Using the Data 5.1 Text normalization and offset calculation Text normalization consisting of a 1-for-1 substitution of newline (0x0A) and tab (0x09) characters with space (0x20) characters was performed on the document text input to the response field. The values of the beg and end XML elements in the queries.xml files indicate character offsets to identify text extents in the source. Offset counting starts from the initial character (character 0) of the source document and includes newlines and all markup characters - that is, the offsets are based on treating the source document file as "raw text", with all its markup included. 5.2 Proper ingesting of XML queries While the character offsets are calculated based on treating the source document as "raw text", the "name" strings being referenced by the queries sometimes contain XML metacharacters, and these had to be "re-escaped" for proper inclusion in the queries.xml file. For example, an actual name like "AT&T" may show up a source document file as "AT&T" (because the source document was originally formatted as XML data). But since the source doc is being treated here as raw text, this name string is treated in queries.xml as having 7 characters (i.e., the character offsets, when provided, will point to a string of length 7). However, the "name" element itself, as presented in the queries.xml file, will be even longer - "AT&T" - because the queries.xml file is intended to be handled by an XML parser, which will return "AT&T" when this "name" element is extracted. Using the queries.xml data without XML parsing would yield a mismatch between the "name" value and the corresponding string in the source data. 6. Acknowledgemnts This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Neil Kuster (LDC) Dave Graff (LDC) Heather Simpson (LDC) Robert Parker (LDC) Hoa Dang (NIST) Heng Ji (RPI) Ralph Grishman (NYU) James Mayfield (JHU) Mihai Surdeanu (UA) Paul McNamee (JHU) Boyan Onyshkevych (DARPA) 7. References Joe Ellis, Jeremy Getman, Stephanie M. Strassel. 2014 Overview of Linguistic Resources for the TAC KBP 2014 Evaluations: Planning, Execution, and Results https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/tackbp-2014-overview.pdf TAC KBP 2014 Workshop: National Institute of Standards and Technology, Gaithersburg, Maryland, November 17-18 8. Copyright Information (c) 2016 Trustees of the University of Pennsylvania 9. Contact Information For further information about this data release, contact the following project staff at LDC: Joe Ellis, Project Manager Jeremy Getman, Lead Annotator Stephanie Strassel, PI ----------------------------------------------------------------------------- README created by Neil Kuster on February 12, 2016 updated by Jeremy Getman on March 1, 2016 updated by Neil Kuster on April 27, 2016 updated by Neil Kuster on September 14, 2016 updated by Joe Ellis on October 7, 2016 updated by Joe Ellis on November 28, 2016