TAC KBP Comprehensive English Source Corpora 2009-2014 Authors: Joe Ellis, Jeremy Getman, Dave Graff, Stephanie Strassel 1. Overview This package contains the English source documents used in support of the TAC KBP tasks from 2009-2014. Text Analysis Conference (TAC) is a series of workshops organized by the National Institute of Standards and Technology (NIST). TAC was developed to encourage research in natural language processing (NLP) and related applications by providing a large test collection, common evaluation procedures, and a forum for researchers to share their results. Through its various evaluations, the Knowledge Base Population (KBP) track of TAC encourages the development of systems that can match entities mentioned in natural texts with those appearing in a knowledge base and extract novel information about entities from a document collection and add it to a new or existing knowledge base. This package contains the complete set of English source documents used in multiple TAC KBP evaluations conducted from 2009 - 2014. The documents are being collected here as a companion to forthcoming TAC KBP data releases, all of which will require these same sets of source documents. The data included in this package were originally released by LDC to TAC KBP coordinators and participants under the following ecorpora catalog IDs and titles: LDC2010E12: TAC 2010 KBP Source Data V1.1 LDC2014E13: TAC 2014 KBP English Source Corpus 2. Introduction This release of TAC KBP English Source text data comprises a total of 3,877,207 distinct documents, each with a unique identifier (docid). These documents have been used in one or more NIST evaluations of TAC KBP systems over the course of six years, from 2009 to 2014. During this span of time, four distinct test sets were defined, as follows: +--------------+----------------+ | Eval.Year(s) | Documents Used | +--------------+----------------+ | 2009 | 1,289,649 | +--------------+----------------+ | 2010-2011 | 1,777,888 | +--------------+----------------+ | 2012 | 3,778,144 | +--------------+----------------+ | 2013-2014 | 2,099,319 | +--------------+----------------+ Many of the documents were used in two or three of the evaluations, and had been released to TAC KBP performers in two or three separate evaluation packages. This is because, in 2010 and again in 2012, new documents were added to the original 2009 collection but none were removed. In the 2013 collection, however, some documents used in previous evaluations were removed. Apart from partial overlaps of document inventory, the various releases also differed with respect to how the data were organized and presented to users. In this comprehensive release, all the documents are organized into a set of 637 "zip" archive files, such that each document appears as a separate data file within one particular zip archive. (There is no duplication of documents across the zip archive files.) The "docs" directory contains a set of listings and tables (and the following sections give instructions) for reconstituting each of the test sets from the zip files in this release. The zip archive format was chosen for overall compactness, and for ease and efficiency of use. Each zip archive is simply a assembly of text files with bare docids as file names. There is no internal directory structure within the zip archives. Users who prefer to have the text files accessible in uncompressed form on a file system can decide for themselves how to organize the extracted files into directories of their own choosing. (Bear in mind that some file systems, and some typical utility programs, such as unix/linux "ls" and "find", will show degraded performance when there are nearly four million files in a single directory.) Another alternative is to maintain the zip archive files as the "file system" for accessing the data, and to use any common "unzip" command line utility, or any scripting language with a suitable zip-archive library module, in order to access chosen documents (or all documents) as needed. Some scenarios are explained in the following sections. 3.0 Data Structure 3.1 The "data" directory: The "data" directory contains the 637 zip archive files; with only one exception, each zip file is named using an initial substring of the docids of all documents contained in the zip file. For example: zip file: AFP_ENG_200701.zip contains: all documents whose docids begin with "AFP_ENG_200701" (i.e., documents from AFP_ENG published in January 2007), such as "AFP_ENG_20070101.0001" zip file: bolt-eng-DF-170.zip contains: all documents whose docids begin with "bolt-eng-DF-170" such as "bolt-eng-DF-170-181103-15978491" zip file: eng-NG-31-1000.zip contains: all documents whose docids begin with "eng-NG-31-1000" such as "eng-NG-31-100001-10757252" Note that the docid substring used as the zip archive name is always 15 characters long for BOLT discussion-forum data (bolt-eng-DF-*, nine zip files), and is always 14 characters long for the news-group / web-log and newswire data (eng-NG-* / eng-WL-*, and [A-X]*_ENG_20*). The only exception to that rule is a single zip archive called: 2009_misc_docs.zip This one archive contains 10,912 documents from a wide assortment of sources, including transcripts of conversational telephone speech and various TV/radio broadcasts. Some of these documents come from newswire sources that are found in the other, more regular zip archives, but are drawn from earlier epochs with very few documents from a given month. For many of the documents in this set, there is substantial variation in the size and nature of the docid strings that were assigned or maintained for use in TAC KBP, with no common substring that serves to group them by genre or epoch. In effect, this zip file contains all the docids from the 2009 TAC KBP test set whose initial substrings do not match the names of any other zip archives in this release. 3.2 The "docs" directory: The contents of the "docs" directory are listed and explained below. Files whose names end in ".list" are just lists of zip-file names or docids, one per line; file names ending in ".tab" are flat tables with two or three tab-delimited columns per line. -- all_zipid_ndocs_unzipmb.tab 637 rows (one per zip archive file), 3 columns per row: (a) name of the zip archive file (b) number of document text files contained in the archive (c) total megabytes of data when all documents are uncompressed -- all_zipid_docid_evalyrs.tab 3877207 rows (one per document / docid), 3 columns per row: (a) name the zip archive file containing the doc (minus ".zip") (b) full docid string (unique to each doc) (c) evaluation test year(s) that used this docid (comma separated) -- all_files.md5 Paths (relative to the root of the corpus) and md5 checksums for all files included in the package. -- 20*_docid.list (four files) Each file contains just the set of docids (one per line) used in each test set. (See section 2 above for line counts per file). -- 2009_docid_sourcecorpus.tab For the docids that make up the 2009 test set inventory, this table lists the docid (column 1) and the LDC catalog-ID of the corpus in which the original document was first released (column 2). -- 2009_zipids_all.list, 2012_zipids_all.list These two files list the zip archive names that are needed to build the 2009 and 2012 test sets. For each of these tests, the document inventory comprises all the contents of the zip files in the respective list (139 and 628 zip archives, respectively). -- 2010_2011_zipids_use_all.list, 2010_2011_zipids_use_part.list 2013_2014_zipids_use_all.list, 2013_2014_zipids_use_part.list These four files list the zip archive names that are needed to build the 2010-2011 and 2013-2014 test sets. The 2010-11 test set has no documents in common with the 2013-14 test set, but they both use material from the "eng-NG" and "eng-WL" sources. Due to the way that documents from these sources were partitioned into the two test sets, there are 363 zip archives that contain some documents used only in 2010-11, and some documents used only in 2013-14; the two "*_zipids_use_part" lists identify the 363 zip archives from which distinct subsets of documents must be extracted in order to complete each test set (in fact, these two lists are identical). The two "*_zipids_use_all" lists identify the zip files whose contents belong entirely in one test set or the other (149 zip files for 2010-11, 125 zip files for 2013-14; no zip archives in common between these two lists). -- 2010_2011_zipid_docid_subsets, 2013_2014_zipid_docid_subsets These two subdirectories each contain 363 files named as follows: "{zipid}_unzip.list" (e.g. "eng-NG-31-1000_unzip.list", "eng-WL-11-9924_unzip.list", etc). Each file contains a list of docids to be extracted from the corresponding zip archive in order to complete the given test-set inventory. Section 4 below gives some examples of how to use these files. 4.0 Some procedures for extracting the document inventory for a given test set 4.1 Doing a full extraction of all zip archive contents As indicated above, users may choose to simply extract all documents from all the zip archives, and have them available as uncompressed text files, each one having its docid as the file name. But depending on the user's computing environment, some care may be needed to avoid having too many files in one directory, or even too many files on one disk. Users can sum the 2nd and 3rd columns of the table file "all_zipid_ndocs_unzipmb.tab", to check file and megabyte counts for the corpus as a whole or any chosen subset. One simple and relatively safe approach would be to extract each zip archive into a separate directory, using the zip archive name as the directory name; this yields 637 directories, with each directory containing between 8 and 49,162 document files. Here's one way to do this, via a 'bash' shell loop, with the "data" directory as the current working directory: for i in *.zip do j=`basename $i .zip` mkdir $j unzip -q -d $j $i done (The "-d path" option on the unzip command line causes all extracted data files to be saved in the given path. The "-q" option avoids having long lists of docids printed to the screen.) 4.2 Pulling documents from zip archives as needed For processes that operate as "stdin - stdout" stream filters, the "unzip" command supports extracting data files to stdout, by using the "-p" (pipe) option. Here's an example extracting all the documents and passing them as a stream to some other process: unzip -p data/XIN_ENG_201012.zip | some_other_process ... This method can be used to extract a specific document, or any set of documents, by supplying docid strings after the zip file name: unzip -p data/XIN_ENG_201012.zip XIN_ENG_20101201.0001 | some_process The next example takes all the XIN_ENG docids from Christmas day 2010 (used in the 2012 test set), and presents these to the "unzip" command (via the unix/linux "xargs" utility): grep XIN_ENG_20101225 docs/2012_docid.list | xargs unzip -p data/XIN_ENG_201012.zip | some_process 4.3 Extracting documents for a specific test set The previous example demonstrates a method for using "*_use_part.list" files, described in section 3.2 above, to extract just the documents needed for a particular test set. Here's an approach based on the shell loop in 4.1 above, but this time to extract all the 2010-11 test set documents: for i in docs/2010_2011_zipid_docid_subsets/* do j=`basename $i _unzip.list` mkdir -p 2010-11-testset/$j cat $i | xargs unzip -q -d 2010-11-testset/$j data/$j.zip done That takes care of the 363 zip archives whose contents included some documents for the 2013-14 test as well, without extracting the documents for those later tests. Now, we still need to extract from the zip files that contain only 2010-11 material: for i in `cat docs/2010_2011_zipids_use_all.list` do j=`basename $i .zip` mkdir 2010-11-testset/$j unzip -q -d 2010-11-testset/$j data/$i done When that's done, there will be 512 subdirectories in 2010-11-testset/ and taken together, these will contain all 1,777,888 documents used in the 2010 and 2011 test sets. Note that because the zip archive format provides efficient indexing of compressed content based on the names of the data files in the archive, it's generally faster and more efficient to pull text as needed directly from the zip archive, whether for bulk processing of the data, or for specific retrieval of a chosen docid. 5. Acknowledgements This material is based on research sponsored by Air Force Research Laboratory and Defense Advance Research Projects Agency under agreement number FA8750-13-2-0045. The U.S. Government is authoized to reporoduce and distribute reprints for Governmental purposes notwithstanding any copyright notation thereon. The views and conclusions contained herein are those of the authors and should not be interpreted as necessarily representing the official policies or endorsements, either expressed or implied, of Air Force Research Laboratory and Defense Advanced Research Projects Agency or the U.S. Government. The authors acknowledge the following contributors to this data set: Heather Simpson (LDC) Robert Parker (LDC) Hoa Dang (NIST) Heng Ji (RPI) Ralph Grishman (NYU) James Mayfield (JHU) Mihai Surdeanu (UA) Margaret Mitchell (Microsoft) Claire Cardie (Cornell) Javier Artiles (Slice Technologies) Paul McNamee (JHU) Boyan Onyshkevych (DARPA) 6. Copyright Information (c) 2016 Trustees of the University of Pennsylvania 7. Contact Information For further information about this data release, contact the following project staff at LDC: Joseph Ellis, Project Manager Dave Graff, Technical Lead Stephanie Strassel, PI -------------------------------------------------------------------------- README created by Dana Fore, December 1, 2015 updated by Joe Ellis, December 2, 2015 updated by Dana Fore, January 29, 2016