GALE Phase 1 Distillation Training
			   LDC2007T20

                   Linguistic Data Consortium


1. Introduction

The GALE Phase 1 Distillation Training Corpus constitutes the final release
of training data created by LDC for the DARPA GALE Program Phase 1 Distillation
Go/No Go technology evaluation.

The release consists of 248 English, Chinese and/or Arabic queries and
their responses, created by LDC annotators.  Queries conform to one of ten
template types.  Query responses may include document and snippet relevance
judgments, nuggets, nugs and supernugs.  158 of the 248 queries have been
annotated for all features, while the remainder are labeled for only some
features.  In addition, not all queries have been exhaustively annotated
for a given feature, given resource constraints during corpus development.
The table below indicates the number of queries that have been labeled for
each template in each source language.  
 
	+-------------+---------+---------+---------+
	|	      | English | Chinese | Arabic  |
	+-------------+---------+---------+---------+
	| Template 1  |	 15/28  |   9/17  |  12/16  |
	| Template 3  |	 16/29  |   9/29  |  13/29  |
	| Template 4  |	 15/23  |   7/18  |  11/18  |
	| Template 5  |	 21/39  |  10/39  |  20/36  |
	| Template 6  |	 15/20  |   7/19  |   7/20  |
	| Template 8  |	 12/14  |   6/13  |   5/14  |
	| Template 9  |	 14/23  |   7/21  |  10/21  |
	| Template 11 |	 11/22  |   8/15  |   2/14  |
	| Template 15 |  12/21  |   8/11  |   5/11  |
	| Template 16 |	 13/24  |  10/12  |   8/12  |
	+-------------+---------+---------+---------+
	| Total       | 144/243 |  81/194 |  93/191 |
	+-------------+---------+---------+---------+


2. Annotation

The annotation task involves responding to a series of user queries.  For
each query, annotators first find relevant documents and identify snippets
(strings of contiguous text that answer the query) in the Arabic, Chinese
or English source document.  Annotators then create a nugget for each fact
expressed in the snippet. Semantically equivalent nuggets are grouped into
cross-language, cross-document "supernugs".  Judges at BAE Systems finally
provide relevance weights for each supernug.

Queries in this release have been annotated for the following tasks:

 - searching for relevant documents and providing yes/no judgements 
 - extracting snippets 
 - resolution of pronouns, and certain types of temporal and locative
   expressions contained in the snippets
 - creating nuggets, i.e. atomic pieces of information that an annotator
   considers a valid answer to the query
-  building nugs, i.e. clusters of semantically-equivalent nuggets 
   for each language
-  building supernugs, i.e. clusters of semantically-equivalent nugs
   across languages

The following queries have been annotated for all annotation tasks for at
least one language:

Template 1:  LDC10, LDC37, LDC43, LDC67, T036, T037, T38, T039, T040,
             T041, T042, T043, T044, T046, T047
Template 3:  3.2, 3.3, 3.4, 5.1, 6.2, LDC1, LDC2, LDC3, LDC12, LDC15, LDC16, 
             LDC28, LDC45, LDC69, LDC71, LDC100, T006, T007, T008, 
             T009, T010
Template 4:  2.5, 3.1, LDC27, LDC33, LDC38, LDC44, LDC56, LDC57, LDC61,
             LDC99, T011, T012, T013, T014, T015
Template 5:  1.3, 1.4, 3.6, 4.2, 4.3, 4.5, 5.3, 6.3, 6.4, 6.5, LDC17, LDC18, 
	     LDC32, LDC34, LDC39, LDC53, LDC54, T026, T027, T028, T029, 
             T030, T031, T032, T033, T034, T035
Template 6:  2.1, 2.7, 3.7, 4.6, 5.4, 5.9, LDC4, LDC35, LDC58, LDC59, LDC89,
             T017, T018, T019, T020
Template 8:  1.6, 1.7, LDC31, LDC65, T061, T062, T063, T064, T065, T066, 
             T067, T068
Template 9:  1.2, 3.8, 4.7, 5.2, 5.6, LDC9, LDC30, LDC36, LDC40, LDC41, LDC48,
             LDC66, T021, T022, T023, T025
Template 11: LDC50, LDC51, LDC52, T054, T055, T056, T057, T058, T059, T060, 
             T069, T070
Template 15: LDC77, LDC92, T071, T072, T073, T074, T075, T076, T077, T078, 
             T079, T080
Template 16: 3.9, LDC7, LDC90, LDC91, T081, T082, T082, T083, T084, T085, T086,
             T087, T088, T089

Additional details of the annotation task are provided in the annotation
task specification, within the /docs directory for this release.  More
information can also be found at LDC's GALE Distillation website:

    http://www.ldc.upenn.edu/Projects/GALE/Distillation

3. Training Data Sources 

   This corpus references tkn_sgm files contained in the following LDC
   publications:

    * LDC2005T16       TDT4 Mulitlingual Text and Annotations
      https://secure.ldc.upenn.edu/intranet/catalogDisplay.jsp?ldc_catalog_id=LDC2005T16

    * LDC2006T18       TDT5 Multilanguage Text
      https://secure.ldc.upenn.edu/intranet/catalogDisplay.jsp?ldc_catalog_id=LDC2006T18

4. Directory Structure
   
   This release is structured as follows

   doc/    README.txt - this document
           DistillationTrainingDataSpecV1.1.pdf - annotation guidelines
           templates.txt - description of template types

   data/ 
        database/ - raw database dump for query responses
        sql/ - SQL commands to create database
        xml/ - single XML file for each query response
        queries/ - XML formatted queries 

   dtd/    response-1.3.dtd - XML DTD for distillation training data


5. File Format Description

   Training data output is included in two formats: an XML output and
   a raw database dump.

   The XML output is included as a single XML file for each query. The
   translated query is included for each language along with each
   search that annotators used to find relevant documents, all
   documents judged relevant or not relevant to the query, and all snippets 
   that were identified in relevant documents. Pronouns, temporal and locative
   expressions that have been resolved in snippets are followed by
   their resolution text in single square brackets.

   The raw database dump is an output of the mySQL database that LDC
   uses to store all annotation information during the annotation
   process. The SQL commands to create a database with the same
   structure are included in the sql/ subdirectory. One file is
   included in the database/ subdirectory for each table in the mySQL
   database. These include:

   * Clarification: Resolution of pronouns and temporal and locative
     expressions

   * Clarification_Type: Possible "type" values for each
     clarification. These are currently Name, Temporal, Location and
     Other.

   * Document: All documents that have been given a relevance judgment
     for some query. This table records the document number used to
     identify the document in LDC's internal search engine along with
     its dockey (which is the Document Number identified in the TDT
     corpus) and its language.

   * Document_Relevance_Link: Relevance judgments for a document for a
     query.

   * Indirect_Speech: Stores the "who", "verb", "attribution" and 
     "evaluation" values for a core nugget's indirect speech
     attributes.

   * Indirect_Speech_Evaluation: Possible "evaluation" values for each
     indirect speech. These are currently Positive, Negative and
     Neutral/Unknown.

   * Language: Possible "language" values for each document. These are
     currently "Arabic", "Chinese" and "English".

   * Modification_Type: Possible "modification" values for modifier
     nuggets. These are currently Temporal, Location and Other.

   * Nug: All cross-language nugs. These are equivalent to Supernugs,
     since they may include nuggets from multiple documents and languages.
     An optional text attribute is included for nugs that do not have
     English components. An optional modification type attribute is 
     included for modifying nugs. Also included here is BAE's relevance
     weight judgment for English nugs.

   * All nuggets. This table records the snippet that a nugget came from,
     the nug that it belongs to, and its text. If the nugget modifies
     another nugget from the same snippet, this table also records the 
     character offset at which the modifying text should be inserted into
     the text of the nugget being modified to get the full text of the
     nugget. 

   * Query: All cross-language queries. This table only records the id
     and label for a query.

   * Query_Language_Link: All query-language pairs. This table records
     the text of each query for each language. Relevance judgements
     and searches are associated with entries in this table, not with
     entries in the cross-language Query table.

   * Relevance: Possible "relevance" values for document
     judgments. There are currently "Yes", "No", and "Dup" (duplicate
     document).

   * Search: All searches that were run against LDC's internal search
     engine when looking for relevant documents. For each search, the
     query that was being worked on and the text of the search are
     stored.

   * Snippet: All snippets recorded for all documents. For each
     snippet, this table records the start and end offset, the
     document, and the query the snippet is relevant to.

   * Supernug: The use of this table has become very different from the 
     meaning of "Supernug" in the Distillation community. It groups
     a nug with all of the modifying nugs related to that nug.

   For both output formats, character offsets (not byte offsets) are
   calculated on utf-8 versions of the TDT source data. Only
   characters inside the <TEXT>...</TEXT> tags are counted. Each
   newline character is counted as one (the source files use
   UNIX-style end of line characters).  All characters inside the
   <TEXT>...</TEXT> tags are counted, including other XML tags that
   may be embedded. Offsets actually count the spaces between
   characters, not the characters themselves, so that a one-character
   string has a length (end-start) of one, not zero.

   Search text in the raw database output is url-encoded. Search text
   in the XML output is not url-encoded. Instead, XML special
   characters (e.g. "&") have been encoded as XML entities.

6. Contact Information
  
   If you have questions about this data release, please contact the
   following personnel at the LDC.

   Zhiyi Song  <zhiyi@ldc.upenn.edu>	       - GALE Distillation
                                                 Project Manager

   Stephanie Strassel <strassel@ldc.upenn.edu> - GALE Project Manager

   Robert Parker <parkerrl@ldc.upenn.edu>      - GALE Distillation Programmer

   Kazuaki Maeda <maeda@ldc.upenn.edu>         - Technical Consultant/Manager


7. Sponsorship

This work was supported in part by the Defense Advanced Research Projects
Agency, GALE Program Grant No. HR0011-06-1-0003.  The content of this paper
does not necessarily reflect the position or the policy of the Government,
and no official endorsement should be inferred.

---
README created Jun 12, 2006     Olga Babko-Malalya 
       updated June 20, 2006    Julie Medero
       updated March 12, 2007   Zhiyi Song
       updated March 16, 2007   Stephanie Strassel