TAC KBP Evaluation Source Corpora 2016-2017 Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel 1. Overview This package contains the evaluation source corpora developed in support of all TAC KBP evaluation tracks conducted in 2016 and 2017. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. More information about the TAC KBP evaluations can be found on the NIST TAC website, http://www.nist.gov/tac/ This package contains the 90,003 source documents comprising the TAC KBP 2016 evaluation source corpus, and the 90,000 source documents comprising the TAC KBP 2017 eval source corpus. 1005 of these documents (505 for the 2016 evals, 500 for the 2017 evals) were manually selected by data scouts using a topic-driven approach to ensure appropriate coverage of certain features in documents slated for annotation. The other 178,998 documents were automatically selected based on fuzzy string matches with namestrings annotated in the 1005 manually-selected documents. The 1005 core corpus documents are not presented here as standalone sets, but are found within the total set of 180,003 documents. See section 2 below for more details. The data included in this package were originally released by LDC to TAC KBP coordinators and performers under the following ecorpora catalog IDs and titles: LDC2016E63: TAC KBP 2016 Evaluation Source Corpus V1.1 LDC2016E64: TAC KBP 2016 Evaluation Core Source Corpus LDC2017E25: TAC KBP 2017 Evaluation Source Corpus V1.1 LDC2017E51: TAC KBP 2017 Evaluation Core Source Corpus Summary of data included in this package: +------+----------+-------+------------------+ | Year | Language | Genre | Source Documents | +------+----------+-------+------------------+ | 2016 | CMN | DF | 15000 | +------+----------+-------+------------------+ | 2016 | CMN | NW | 15001 | +------+----------+-------+------------------+ | 2016 | ENG | DF | 15001 | +------+----------+-------+------------------+ | 2016 | ENG | NW | 15001 | +------+----------+-------+------------------+ | 2016 | SPA | DF | 14999 | +------+----------+-------+------------------+ | 2016 | SPA | NW | 15001 | +------+----------+-------+------------------+ | 2017 | CMN | DF | 15000 | +------+----------+-------+------------------+ | 2017 | CMN | NW | 15000 | +------+----------+-------+------------------+ | 2017 | ENG | DF | 15000 | +------+----------+-------+------------------+ | 2017 | ENG | NW | 15000 | +------+----------+-------+------------------+ | 2017 | SPA | DF | 15000 | +------+----------+-------+------------------+ | 2017 | SPA | NW | 15000 | +------+----------+-------+------------------+ 2. Contents ./docs/README.txt This file. ./data/2016/{cmn,eng,spa}/{df,nw}/* These data directories contain the 90,003 XML documents comprising the TAC KBP 2016 evaluation source corpus. ./data/2017/{cmn,eng,spa}/{df,nw}/* These data directories contain the 90,000 XML documents comprising the TAC KBP 2017 evaluation source corpus. ./docs/2016_full_corpus_character_counts.tsv This is a list of lengths (in characters) of all source files contained in ./data/2016/ ./docs/2016_full_corpus_quote_regions.tsv This is a list of offset regions that align with quotes in the text, provided for each discussion forum document in ./data/2016/ ./docs/2016_core_corpus_character_counts.tsv This is a list of lengths (in characters) of the 505 core corpus source files selected for the TAC KBP 2016 evaluations. Note that these documents are not presented as a standalone set, but are found within the full 2016 evaluation source corpus in ./data/2016/ ./docs/2016_core_corpus_quote_regions.tsv This is a list of offset regions that align with quotes in the text, provided for each core corpus discussion forum document in ./data/2016/ As noted above, the core corpus documents are not presented as a standalone set, but are found within ./data/2016/ ./docs/2017_full_corpus_character_counts.tsv This is a list of lengths (in characters) of all source files contained in ./data/2017/ ./docs/2017_full_corpus_quote_regions.tsv This is a list of offset regions that align with quotes in the text, provided for each discussion forum document in ./data/2017/ ./docs/2017_core_corpus_character_counts.tsv This is a list of lengths (in characters) of the 500 core corpus source files selected for the TAC KBP 2017 evaluations. Note that these documents are not presented as a standalone set, but are found within the full 2017 evaluation source corpus in ./data/2017/ ./docs/2017_core_corpus_quote_regions.tsv This is a list of offset regions that align with quotes in the text, provided for each core corpus discussion forum document in ./data/2017/ As noted above, the core corpus documents are not presented as a standalone set, but are found within ./data/2017/ ./dtd/kbp_2016-2017_source_df.dtd DTD for all discussion forum (DF) threads in this corpus ./dtd/kbp_2016-2017_source_newswire.dtd DTD for all newswire (NW) files in this corpus 3. Newswire Data The following is a generalization of newswire markup framework: ... ...

...

...
where the HEADLINE and DATELINE tags are optional (not always present), and the TEXT content may or may not include "

...

" tags (depending on whether or not the "doc_type_label" is "story"). All the newswire files are parseable as XML. See relevant DTDs for exact details of newswire markup. Note that there are 143 instances of double-escaped characters in the NW source data. For purposes of correctly counting character offsets, these should *not* be unescaped. 4. Discussion Forum Data Discussion Forum threads consist of a continuous run of posts from a thread but they are only approximately 800 words in length (excluding metadata and text within elements). When taken from a short thread, a document may comprise the entire thread. However, when taken from longer threads, a document is a truncated version of its source, though it will always start with the preliminary post. The following is a generalization of DF thread markup framework, in which there may also be arbitrarily deep nesting of quote elements, and other elements may be present (e.g. "..." anchor tags): ... ... ... ... ... All the DF files are parseable as XML. See relevant DTD for exact details of discussion forum markup. Note that there are 754 instances of double-escaped characters in the DF source data. For purposes of correctly counting character offsets, these should *not* be unescaped. 5. Acknowledgemnts This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Dave Graff (LDC) Hoa Dang (NIST) Boyan Onyshkevych (DARPA) 6. References Joe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song, Ann Bies, & Stephanie M. Strassel. 2016 Overview of Linguistic Resources for the TAC KBP 2016 Evaluations: Methodologies and Results TAC KBP 2016 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 14-15 Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey, & Stephanie M. Strassel. 2017 Overview of Linguistic Resources for the TAC KBP 2017 Evaluations: Methodologies and Results TAC KBP 2017 Workshop: National Institute of Standards and Technology, Gaithersburg, MD, November 13-14 7. Copyright Information (c) 2018 Trustees of the University of Pennsylvania 8. Contact Information For further information about this data release, or the TAC KBP project, contact the following project staff at LDC: Jeremy Getman, Project Manager Stephanie Strassel, PI ----------------------------------------------------------------------------- README created by Joseph Carlough on March 23, 2018 updated by Jeremy Getman on May 11, 2018 updated by Jeremy Getman on May 18, 2018