TAC KBP Comprehensive English Source Corpora 2009-2014

      Authors: Joe Ellis, Jeremy Getman, Dave Graff, Stephanie Strassel


1. Overview

This package contains the English source documents used in support of
the TAC KBP tasks from 2009-2014.

Text Analysis Conference (TAC) is a series of workshops organized by
the National Institute of Standards and Technology (NIST). TAC was
developed to encourage research in natural language processing (NLP)
and related applications by providing a large test collection, common
evaluation procedures, and a forum for researchers to share their
results. Through its various evaluations, the Knowledge Base
Population (KBP) track of TAC encourages the development of systems
that can match entities mentioned in natural texts with those
appearing in a knowledge base and extract novel information about
entities from a document collection and add it to a new or existing
knowledge base.

This package contains the complete set of English source documents
used in multiple TAC KBP evaluations conducted from 2009 - 2014. The 
documents are being collected here as a companion to forthcoming TAC 
KBP data releases, all of which will require these same sets of source 
documents.  

The data included in this package were originally released by LDC
to TAC KBP coordinators and participants under the following ecorpora
catalog IDs and titles:

LDC2010E12: TAC 2010 KBP Source Data V1.1
LDC2014E13: TAC 2014 KBP English Source Corpus


2. Introduction

This release of TAC KBP English Source text data comprises a total of
3,877,207 distinct documents, each with a unique identifier (docid).
These documents have been used in one or more NIST evaluations of TAC
KBP systems over the course of six years, from 2009 to 2014.  During
this span of time, four distinct test sets were defined, as follows:

+--------------+----------------+
| Eval.Year(s) | Documents Used |
+--------------+----------------+
| 2009         |      1,289,649 |
+--------------+----------------+
| 2010-2011    |      1,777,888 |
+--------------+----------------+
| 2012         |      3,778,144 |
+--------------+----------------+
| 2013-2014    |      2,099,319 |
+--------------+----------------+

Many of the documents were used in two or three of the evaluations,
and had been released to TAC KBP performers in two or three separate
evaluation packages. This is because, in 2010 and again in 2012, new
documents were added to the original 2009 collection but none were 
removed. In the 2013 collection, however, some documents used in 
previous evaluations were removed. Apart from partial overlaps of 
document inventory, the various releases also differed with respect 
to how the data were organized and presented to users.

In this comprehensive release, all the documents are organized into a
set of 637 "zip" archive files, such that each document appears as a
separate data file within one particular zip archive.  (There is no
duplication of documents across the zip archive files.)

The "docs" directory contains a set of listings and tables (and the
following sections give instructions) for reconstituting each of the
test sets from the zip files in this release.

The zip archive format was chosen for overall compactness, and for
ease and efficiency of use.  Each zip archive is simply a assembly of
text files with bare docids as file names.  There is no internal
directory structure within the zip archives.

Users who prefer to have the text files accessible in uncompressed
form on a file system can decide for themselves how to organize the
extracted files into directories of their own choosing.  (Bear in mind
that some file systems, and some typical utility programs, such as
unix/linux "ls" and "find", will show degraded performance when there
are nearly four million files in a single directory.)

Another alternative is to maintain the zip archive files as the "file
system" for accessing the data, and to use any common "unzip" command
line utility, or any scripting language with a suitable zip-archive
library module, in order to access chosen documents (or all documents)
as needed.  Some scenarios are explained in the following sections.


3.0 Data Structure

3.1 The "data" directory:

The "data" directory contains the 637 zip archive files; with only one
exception, each zip file is named using an initial substring of the
docids of all documents contained in the zip file.  For example:

 zip file: AFP_ENG_200701.zip
 contains: all documents whose docids begin with "AFP_ENG_200701"
           (i.e., documents from AFP_ENG published in January 2007),
           such as "AFP_ENG_20070101.0001"

 zip file: bolt-eng-DF-170.zip
 contains: all documents whose docids begin with "bolt-eng-DF-170"
           such as "bolt-eng-DF-170-181103-15978491"

 zip file: eng-NG-31-1000.zip
 contains: all documents whose docids begin with "eng-NG-31-1000"
           such as "eng-NG-31-100001-10757252"

Note that the docid substring used as the zip archive name is always
15 characters long for BOLT discussion-forum data (bolt-eng-DF-*, nine
zip files), and is always 14 characters long for the news-group /
web-log and newswire data (eng-NG-* / eng-WL-*, and [A-X]*_ENG_20*).

The only exception to that rule is a single zip archive called:

 2009_misc_docs.zip

This one archive contains 10,912 documents from a wide assortment of
sources, including transcripts of conversational telephone speech and
various TV/radio broadcasts.  Some of these documents come from
newswire sources that are found in the other, more regular zip
archives, but are drawn from earlier epochs with very few documents
from a given month.  For many of the documents in this set, there is
substantial variation in the size and nature of the docid strings that
were assigned or maintained for use in TAC KBP, with no common
substring that serves to group them by genre or epoch.  In effect,
this zip file contains all the docids from the 2009 TAC KBP test
set whose initial substrings do not match the names of any other zip
archives in this release.

3.2 The "docs" directory:

The contents of the "docs" directory are listed and explained below.
Files whose names end in ".list" are just lists of zip-file names or
docids, one per line; file names ending in ".tab" are flat tables with
two or three tab-delimited columns per line.

 -- all_zipid_ndocs_unzipmb.tab

   637 rows (one per zip archive file), 3 columns per row:
    (a) name of the zip archive file
    (b) number of document text files contained in the archive
    (c) total megabytes of data when all documents are uncompressed

 -- all_zipid_docid_evalyrs.tab

   3877207 rows (one per document / docid), 3 columns per row:
    (a) name the zip archive file containing the doc (minus ".zip")
    (b) full docid string (unique to each doc)
    (c) evaluation test year(s) that used this docid (comma separated)

 -- all_files.md5

  Paths (relative to the root of the corpus) and md5 checksums for all files
  included in the package.

 -- 20*_docid.list (four files)

   Each file contains just the set of docids (one per line) used in
   each test set.  (See section 2 above for line counts per file).

 -- 2009_docid_sourcecorpus.tab

   For the docids that make up the 2009 test set inventory, this table
   lists the docid (column 1) and the LDC catalog-ID of the corpus in
   which the original document was first released (column 2).

 -- 2009_zipids_all.list, 2012_zipids_all.list

   These two files list the zip archive names that are needed to build
   the 2009 and 2012 test sets.  For each of these tests, the document
   inventory comprises all the contents of the zip files in the
   respective list (139 and 628 zip archives, respectively).

 -- 2010_2011_zipids_use_all.list, 2010_2011_zipids_use_part.list
    2013_2014_zipids_use_all.list, 2013_2014_zipids_use_part.list

   These four files list the zip archive names that are needed to
   build the 2010-2011 and 2013-2014 test sets.  The 2010-11 test set
   has no documents in common with the 2013-14 test set, but they both
   use material from the "eng-NG" and "eng-WL" sources.  Due to the
   way that documents from these sources were partitioned into the two
   test sets, there are 363 zip archives that contain some documents
   used only in 2010-11, and some documents used only in 2013-14; the
   two "*_zipids_use_part" lists identify the 363 zip archives from
   which distinct subsets of documents must be extracted in order to
   complete each test set (in fact, these two lists are identical).

   The two "*_zipids_use_all" lists identify the zip files whose
   contents belong entirely in one test set or the other (149 zip
   files for 2010-11, 125 zip files for 2013-14; no zip archives in
   common between these two lists).

 -- 2010_2011_zipid_docid_subsets, 2013_2014_zipid_docid_subsets

   These two subdirectories each contain 363 files named as follows:
   "{zipid}_unzip.list" (e.g. "eng-NG-31-1000_unzip.list",
   "eng-WL-11-9924_unzip.list", etc).  Each file contains a list of
   docids to be extracted from the corresponding zip archive in order
   to complete the given test-set inventory.  Section 4 below gives
   some examples of how to use these files.


4.0 Some procedures for extracting the document inventory for a given
    test set

4.1 Doing a full extraction of all zip archive contents

As indicated above, users may choose to simply extract all documents
from all the zip archives, and have them available as uncompressed
text files, each one having its docid as the file name.  But depending
on the user's computing environment, some care may be needed to avoid
having too many files in one directory, or even too many files on one
disk.  Users can sum the 2nd and 3rd columns of the table file
"all_zipid_ndocs_unzipmb.tab", to check file and megabyte counts for
the corpus as a whole or any chosen subset.

One simple and relatively safe approach would be to extract each zip
archive into a separate directory, using the zip archive name as the
directory name; this yields 637 directories, with each directory
containing between 8 and 49,162 document files.  Here's one way to do
this, via a 'bash' shell loop, with the "data" directory as the
current working directory:

   for i in *.zip
   do
    j=`basename $i .zip`
    mkdir $j
    unzip -q -d $j $i
   done

(The "-d path" option on the unzip command line causes all extracted
data files to be saved in the given path.  The "-q" option avoids
having long lists of docids printed to the screen.)

4.2 Pulling documents from zip archives as needed

For processes that operate as "stdin - stdout" stream filters, the
"unzip" command supports extracting data files to stdout, by using the
"-p" (pipe) option.  Here's an example extracting all the documents
and passing them as a stream to some other process:

 unzip -p data/XIN_ENG_201012.zip | some_other_process ...

This method can be used to extract a specific document, or any set of
documents, by supplying docid strings after the zip file name:

 unzip -p data/XIN_ENG_201012.zip XIN_ENG_20101201.0001 | some_process

The next example takes all the XIN_ENG docids from Christmas day 2010
(used in the 2012 test set), and presents these to the "unzip" command
(via the unix/linux "xargs" utility):

 grep XIN_ENG_20101225 docs/2012_docid.list |
     xargs unzip -p data/XIN_ENG_201012.zip | some_process

4.3 Extracting documents for a specific test set

The previous example demonstrates a method for using "*_use_part.list"
files, described in section 3.2 above, to extract just the documents
needed for a particular test set.  Here's an approach based on the
shell loop in 4.1 above, but this time to extract all the 2010-11 test
set documents:

  for i in docs/2010_2011_zipid_docid_subsets/*
  do
   j=`basename $i _unzip.list`
   mkdir -p 2010-11-testset/$j
   cat $i | xargs unzip -q -d 2010-11-testset/$j data/$j.zip
  done

That takes care of the 363 zip archives whose contents included some
documents for the 2013-14 test as well, without extracting the
documents for those later tests.  Now, we still need to extract from
the zip files that contain only 2010-11 material:

  for i in `cat docs/2010_2011_zipids_use_all.list`
  do
   j=`basename $i .zip`
   mkdir 2010-11-testset/$j
   unzip -q -d 2010-11-testset/$j data/$i
  done

When that's done, there will be 512 subdirectories in 2010-11-testset/
and taken together, these will contain all 1,777,888 documents used in
the 2010 and 2011 test sets.

Note that because the zip archive format provides efficient indexing
of compressed content based on the names of the data files in the
archive, it's generally faster and more efficient to pull text as
needed directly from the zip archive, whether for bulk processing of
the data, or for specific retrieval of a chosen docid.

5. Acknowledgements

This material is based on research sponsored by Air Force Research
Laboratory and Defense Advance Research Projects Agency under
agreement number FA8750-13-2-0045. The U.S. Government is authoized
to reporoduce and distribute reprints for Governmental purposes
notwithstanding any copyright notation thereon. The views and
conclusions contained herein are those of the authors and should
not be interpreted as necessarily representing the official policies
or endorsements, either expressed or implied, of Air Force Research
Laboratory and Defense Advanced Research Projects Agency or the U.S.
Government.

The authors acknowledge the following contributors to this data set:
Heather Simpson (LDC)
Robert Parker (LDC)
Hoa Dang (NIST)
Heng Ji (RPI)
Ralph Grishman (NYU)
James Mayfield (JHU)
Mihai Surdeanu (UA)
Margaret Mitchell (Microsoft)
Claire Cardie (Cornell)
Javier Artiles (Slice Technologies)
Paul McNamee (JHU)
Boyan Onyshkevych (DARPA)


6. Copyright Information

(c) 2016 Trustees of the University of Pennsylvania


7. Contact Information

For further information about this data release, contact the following
project staff at LDC:

  Joseph Ellis, Project Manager      <joellis@ldc.upenn.edu>
  Dave Graff, Technical Lead         <graff@ldc.upenn.edu>
  Stephanie Strassel, PI             <strassel@ldc.upenn.edu>

--------------------------------------------------------------------------
README created by Dana Fore, December 1, 2015
       updated by Joe Ellis, December 2, 2015 
       updated by Dana Fore, January 29, 2016