TAC KBP Evaluation Source Corpora 2016-2017

            Authors: Joe Ellis, Jeremy Getman, Stephanie Strassel


1. Overview

This package contains the evaluation source corpora developed in support 
of all TAC KBP evaluation tracks conducted in 2016 and 2017.

Text Analysis Conference (TAC) is a series of workshops organized by the 
National Institute of Standards and Technology (NIST). TAC was developed 
to encourage research in natural language processing (NLP) and related 
applications by providing a large test collection, common evaluation 
procedures, and a forum for researchers to share their results. Through 
its various evaluations, the Knowledge Base Population (KBP) track of 
TAC encourages the development of systems that can match entities 
mentioned in natural texts with those appearing in a knowledge base and 
extract novel information about entities from a document collection and 
add it to a new or existing knowledge base. More information about the 
TAC KBP evaluations can be found on the NIST TAC website, 
http://www.nist.gov/tac/ 

This package contains the 90,003 source documents comprising the TAC KBP 
2016 evaluation source corpus, and the 90,000 source documents comprising 
the TAC KBP 2017 eval source corpus. 1005 of these documents (505 for the
2016 evals, 500 for the 2017 evals) were manually selected by data scouts 
using a topic-driven approach to ensure appropriate coverage of certain 
features in documents slated for annotation. The other 178,998 documents 
were automatically selected based on fuzzy string matches with namestrings 
annotated in the 1005 manually-selected documents. The 1005 core corpus 
documents are not presented here as standalone sets, but are found within 
the total set of 180,003 documents. See section 2 below for more details.

The data included in this package were originally released by LDC to TAC 
KBP coordinators and performers under the following ecorpora catalog IDs 
and titles: 

LDC2016E63: TAC KBP 2016 Evaluation Source Corpus V1.1
LDC2016E64: TAC KBP 2016 Evaluation Core Source Corpus
LDC2017E25: TAC KBP 2017 Evaluation Source Corpus V1.1
LDC2017E51: TAC KBP 2017 Evaluation Core Source Corpus

Summary of data included in this package:
+------+----------+-------+------------------+
| Year | Language | Genre | Source Documents |
+------+----------+-------+------------------+
| 2016 |   CMN    |   DF  |      15000       |
+------+----------+-------+------------------+
| 2016 |   CMN    |   NW  |      15001       |
+------+----------+-------+------------------+
| 2016 |   ENG    |   DF  |      15001       |
+------+----------+-------+------------------+
| 2016 |   ENG    |   NW  |      15001       |
+------+----------+-------+------------------+
| 2016 |   SPA    |   DF  |      14999       |
+------+----------+-------+------------------+
| 2016 |   SPA    |   NW  |      15001       |
+------+----------+-------+------------------+
| 2017 |   CMN    |   DF  |      15000       |
+------+----------+-------+------------------+
| 2017 |   CMN    |   NW  |      15000       |
+------+----------+-------+------------------+
| 2017 |   ENG    |   DF  |      15000       |
+------+----------+-------+------------------+
| 2017 |   ENG    |   NW  |      15000       |
+------+----------+-------+------------------+
| 2017 |   SPA    |   DF  |      15000       |
+------+----------+-------+------------------+
| 2017 |   SPA    |   NW  |      15000       |
+------+----------+-------+------------------+


2. Contents

./docs/README.txt

  This file.

./data/2016/{cmn,eng,spa}/{df,nw}/*

  These data directories contain the 90,003 XML documents comprising the
  TAC KBP 2016 evaluation source corpus.

./data/2017/{cmn,eng,spa}/{df,nw}/*

  These data directories contain the 90,000 XML documents comprising the
  TAC KBP 2017 evaluation source corpus.

./docs/2016_full_corpus_character_counts.tsv

  This is a list of lengths (in characters) of all source files contained
  in ./data/2016/

./docs/2016_full_corpus_quote_regions.tsv

  This is a list of offset regions that align with quotes in the text,
  provided for each discussion forum document in ./data/2016/

./docs/2016_core_corpus_character_counts.tsv

  This is a list of lengths (in characters) of the 505 core corpus source 
  files selected for the TAC KBP 2016 evaluations. Note that these 
  documents are not presented as a standalone set, but are found within 
  the full 2016 evaluation source corpus in ./data/2016/

./docs/2016_core_corpus_quote_regions.tsv

  This is a list of offset regions that align with quotes in the text,
  provided for each core corpus discussion forum document in ./data/2016/
  As noted above, the core corpus documents are not presented as a 
  standalone set, but are found within ./data/2016/
  
./docs/2017_full_corpus_character_counts.tsv

  This is a list of lengths (in characters) of all source files contained
  in ./data/2017/

./docs/2017_full_corpus_quote_regions.tsv

  This is a list of offset regions that align with quotes in the text,
  provided for each discussion forum document in ./data/2017/

./docs/2017_core_corpus_character_counts.tsv

  This is a list of lengths (in characters) of the 500 core corpus source 
  files selected for the TAC KBP 2017 evaluations. Note that these 
  documents are not presented as a standalone set, but are found within 
  the full 2017 evaluation source corpus in ./data/2017/

./docs/2017_core_corpus_quote_regions.tsv

  This is a list of offset regions that align with quotes in the text,
  provided for each core corpus discussion forum document in ./data/2017/
  As noted above, the core corpus documents are not presented as a 
  standalone set, but are found within ./data/2017/

./dtd/kbp_2016-2017_source_df.dtd

  DTD for all discussion forum (DF) threads in this corpus

./dtd/kbp_2016-2017_source_newswire.dtd

  DTD for all newswire (NW) files in this corpus


3. Newswire Data

The following is a generalization of newswire markup framework:

  <DOC id="{doc_id_string}" type="{doc_type_label}">
  <HEADLINE>
  ...
  </HEADLINE>
  <DATELINE>
  ...
  </DATELINE>
  <TEXT>
  <P>
  ...
  </P>
  ...
  </TEXT>
  </DOC>

where the HEADLINE and DATELINE tags are optional (not always
present), and the TEXT content may or may not include "<P> ... </P>"
tags (depending on whether or not the "doc_type_label" is "story").

All the newswire files are parseable as XML.

See relevant DTDs for exact details of newswire markup.

Note that there are 143 instances of double-escaped characters in the 
NW source data. For purposes of correctly counting character offsets, 
these should *not* be unescaped.


4. Discussion Forum Data

Discussion Forum threads consist of a continuous run of posts from a 
thread but they are only approximately 800 words in length (excluding 
metadata and text within <quote> elements). When taken from a short 
thread, a document may comprise the entire thread. However, when taken 
from longer threads, a document is a truncated version of its source, 
though it will always start with the preliminary post.

The following is a generalization of DF thread markup framework, in 
which there may also be arbitrarily deep nesting of quote elements, 
and other elements may be present (e.g. "<a...>...</a>" anchor tags):

  <doc id="{doc_id_string}">
  <headline>
  ...
  </headline>
  <post ...>
  ...
  <quote ...>
  ...
  </quote>
  ...
  </post>
  ...
  </doc>
 
All the DF files are parseable as XML.

See relevant DTD for exact details of discussion forum markup.

Note that there are 754 instances of double-escaped characters in the 
DF source data. For purposes of correctly counting character offsets, 
these should *not* be unescaped.


5. Acknowledgemnts

This material is based on research sponsored by Air Force Research 
Laboratory and Defense Advance Research Projects Agency under agreement 
number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce 
and distribute reprints for Governmental purposes notwithstanding any 
copyright notation thereon. The views and conclusions contained herein 
are those of the authors and should not be interpreted as necessarily 
representing the official policies or endorsements, either expressed or 
implied, of Air Force Research Laboratory and Defense Advanced Research 
Projects Agency or the U.S. Government. 

The authors acknowledge the following contributors to this data set:
Dave Graff (LDC)
Hoa Dang (NIST)
Boyan Onyshkevych (DARPA)


6. References

Joe Ellis, Jeremy Getman, Neil Kuster, Zhiyi Song, Ann Bies, & Stephanie
M. Strassel. 2016
Overview of Linguistic Resources for the TAC KBP 2016 Evaluations: 
Methodologies and Results 
TAC KBP 2016 Workshop: National Institute of Standards and Technology, 
Gaithersburg, MD, November 14-15 

Jeremy Getman, Joe Ellis, Zhiyi Song, Jennifer Tracey, & Stephanie M.
Strassel. 2017
Overview of Linguistic Resources for the TAC KBP 2017 Evaluations: 
Methodologies and Results 
TAC KBP 2017 Workshop: National Institute of Standards and Technology, 
Gaithersburg, MD, November 13-14 


7. Copyright Information

(c) 2018 Trustees of the University of Pennsylvania


8. Contact Information

For further information about this data release, or the TAC KBP
project, contact the following project staff at LDC:

    Jeremy Getman, Project Manager       <jgetman@ldc.upenn.edu>
    Stephanie Strassel, PI               <strassel@ldc.upenn.edu>

-----------------------------------------------------------------------------
README created by Joseph Carlough on March 23, 2018
       updated by Jeremy Getman on May 11, 2018
       updated by Jeremy Getman on May 18, 2018