GALE Phase 1 Distillation Training
LDC2007T20
Linguistic Data Consortium
1. Introduction
The GALE Phase 1 Distillation Training Corpus constitutes the final release
of training data created by LDC for the DARPA GALE Program Phase 1 Distillation
Go/No Go technology evaluation.
The release consists of 248 English, Chinese and/or Arabic queries and
their responses, created by LDC annotators. Queries conform to one of ten
template types. Query responses may include document and snippet relevance
judgments, nuggets, nugs and supernugs. 158 of the 248 queries have been
annotated for all features, while the remainder are labeled for only some
features. In addition, not all queries have been exhaustively annotated
for a given feature, given resource constraints during corpus development.
The table below indicates the number of queries that have been labeled for
each template in each source language.
+-------------+---------+---------+---------+
| | English | Chinese | Arabic |
+-------------+---------+---------+---------+
| Template 1 | 15/28 | 9/17 | 12/16 |
| Template 3 | 16/29 | 9/29 | 13/29 |
| Template 4 | 15/23 | 7/18 | 11/18 |
| Template 5 | 21/39 | 10/39 | 20/36 |
| Template 6 | 15/20 | 7/19 | 7/20 |
| Template 8 | 12/14 | 6/13 | 5/14 |
| Template 9 | 14/23 | 7/21 | 10/21 |
| Template 11 | 11/22 | 8/15 | 2/14 |
| Template 15 | 12/21 | 8/11 | 5/11 |
| Template 16 | 13/24 | 10/12 | 8/12 |
+-------------+---------+---------+---------+
| Total | 144/243 | 81/194 | 93/191 |
+-------------+---------+---------+---------+
2. Annotation
The annotation task involves responding to a series of user queries. For
each query, annotators first find relevant documents and identify snippets
(strings of contiguous text that answer the query) in the Arabic, Chinese
or English source document. Annotators then create a nugget for each fact
expressed in the snippet. Semantically equivalent nuggets are grouped into
cross-language, cross-document "supernugs". Judges at BAE Systems finally
provide relevance weights for each supernug.
Queries in this release have been annotated for the following tasks:
- searching for relevant documents and providing yes/no judgements
- extracting snippets
- resolution of pronouns, and certain types of temporal and locative
expressions contained in the snippets
- creating nuggets, i.e. atomic pieces of information that an annotator
considers a valid answer to the query
- building nugs, i.e. clusters of semantically-equivalent nuggets
for each language
- building supernugs, i.e. clusters of semantically-equivalent nugs
across languages
The following queries have been annotated for all annotation tasks for at
least one language:
Template 1: LDC10, LDC37, LDC43, LDC67, T036, T037, T38, T039, T040,
T041, T042, T043, T044, T046, T047
Template 3: 3.2, 3.3, 3.4, 5.1, 6.2, LDC1, LDC2, LDC3, LDC12, LDC15, LDC16,
LDC28, LDC45, LDC69, LDC71, LDC100, T006, T007, T008,
T009, T010
Template 4: 2.5, 3.1, LDC27, LDC33, LDC38, LDC44, LDC56, LDC57, LDC61,
LDC99, T011, T012, T013, T014, T015
Template 5: 1.3, 1.4, 3.6, 4.2, 4.3, 4.5, 5.3, 6.3, 6.4, 6.5, LDC17, LDC18,
LDC32, LDC34, LDC39, LDC53, LDC54, T026, T027, T028, T029,
T030, T031, T032, T033, T034, T035
Template 6: 2.1, 2.7, 3.7, 4.6, 5.4, 5.9, LDC4, LDC35, LDC58, LDC59, LDC89,
T017, T018, T019, T020
Template 8: 1.6, 1.7, LDC31, LDC65, T061, T062, T063, T064, T065, T066,
T067, T068
Template 9: 1.2, 3.8, 4.7, 5.2, 5.6, LDC9, LDC30, LDC36, LDC40, LDC41, LDC48,
LDC66, T021, T022, T023, T025
Template 11: LDC50, LDC51, LDC52, T054, T055, T056, T057, T058, T059, T060,
T069, T070
Template 15: LDC77, LDC92, T071, T072, T073, T074, T075, T076, T077, T078,
T079, T080
Template 16: 3.9, LDC7, LDC90, LDC91, T081, T082, T082, T083, T084, T085, T086,
T087, T088, T089
Additional details of the annotation task are provided in the annotation
task specification, within the /docs directory for this release. More
information can also be found at LDC's GALE Distillation website:
http://www.ldc.upenn.edu/Projects/GALE/Distillation
3. Training Data Sources
This corpus references tkn_sgm files contained in the following LDC
publications:
* LDC2005T16 TDT4 Mulitlingual Text and Annotations
https://secure.ldc.upenn.edu/intranet/catalogDisplay.jsp?ldc_catalog_id=LDC2005T16
* LDC2006T18 TDT5 Multilanguage Text
https://secure.ldc.upenn.edu/intranet/catalogDisplay.jsp?ldc_catalog_id=LDC2006T18
4. Directory Structure
This release is structured as follows
doc/ README.txt - this document
DistillationTrainingDataSpecV1.1.pdf - annotation guidelines
templates.txt - description of template types
data/
database/ - raw database dump for query responses
sql/ - SQL commands to create database
xml/ - single XML file for each query response
queries/ - XML formatted queries
dtd/ response-1.3.dtd - XML DTD for distillation training data
5. File Format Description
Training data output is included in two formats: an XML output and
a raw database dump.
The XML output is included as a single XML file for each query. The
translated query is included for each language along with each
search that annotators used to find relevant documents, all
documents judged relevant or not relevant to the query, and all snippets
that were identified in relevant documents. Pronouns, temporal and locative
expressions that have been resolved in snippets are followed by
their resolution text in single square brackets.
The raw database dump is an output of the mySQL database that LDC
uses to store all annotation information during the annotation
process. The SQL commands to create a database with the same
structure are included in the sql/ subdirectory. One file is
included in the database/ subdirectory for each table in the mySQL
database. These include:
* Clarification: Resolution of pronouns and temporal and locative
expressions
* Clarification_Type: Possible "type" values for each
clarification. These are currently Name, Temporal, Location and
Other.
* Document: All documents that have been given a relevance judgment
for some query. This table records the document number used to
identify the document in LDC's internal search engine along with
its dockey (which is the Document Number identified in the TDT
corpus) and its language.
* Document_Relevance_Link: Relevance judgments for a document for a
query.
* Indirect_Speech: Stores the "who", "verb", "attribution" and
"evaluation" values for a core nugget's indirect speech
attributes.
* Indirect_Speech_Evaluation: Possible "evaluation" values for each
indirect speech. These are currently Positive, Negative and
Neutral/Unknown.
* Language: Possible "language" values for each document. These are
currently "Arabic", "Chinese" and "English".
* Modification_Type: Possible "modification" values for modifier
nuggets. These are currently Temporal, Location and Other.
* Nug: All cross-language nugs. These are equivalent to Supernugs,
since they may include nuggets from multiple documents and languages.
An optional text attribute is included for nugs that do not have
English components. An optional modification type attribute is
included for modifying nugs. Also included here is BAE's relevance
weight judgment for English nugs.
* All nuggets. This table records the snippet that a nugget came from,
the nug that it belongs to, and its text. If the nugget modifies
another nugget from the same snippet, this table also records the
character offset at which the modifying text should be inserted into
the text of the nugget being modified to get the full text of the
nugget.
* Query: All cross-language queries. This table only records the id
and label for a query.
* Query_Language_Link: All query-language pairs. This table records
the text of each query for each language. Relevance judgements
and searches are associated with entries in this table, not with
entries in the cross-language Query table.
* Relevance: Possible "relevance" values for document
judgments. There are currently "Yes", "No", and "Dup" (duplicate
document).
* Search: All searches that were run against LDC's internal search
engine when looking for relevant documents. For each search, the
query that was being worked on and the text of the search are
stored.
* Snippet: All snippets recorded for all documents. For each
snippet, this table records the start and end offset, the
document, and the query the snippet is relevant to.
* Supernug: The use of this table has become very different from the
meaning of "Supernug" in the Distillation community. It groups
a nug with all of the modifying nugs related to that nug.
For both output formats, character offsets (not byte offsets) are
calculated on utf-8 versions of the TDT source data. Only
characters inside the ... tags are counted. Each
newline character is counted as one (the source files use
UNIX-style end of line characters). All characters inside the
... tags are counted, including other XML tags that
may be embedded. Offsets actually count the spaces between
characters, not the characters themselves, so that a one-character
string has a length (end-start) of one, not zero.
Search text in the raw database output is url-encoded. Search text
in the XML output is not url-encoded. Instead, XML special
characters (e.g. "&") have been encoded as XML entities.
6. Contact Information
If you have questions about this data release, please contact the
following personnel at the LDC.
Zhiyi Song - GALE Distillation
Project Manager
Stephanie Strassel - GALE Project Manager
Robert Parker - GALE Distillation Programmer
Kazuaki Maeda - Technical Consultant/Manager
7. Sponsorship
This work was supported in part by the Defense Advanced Research Projects
Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this paper
does not necessarily reflect the position or the policy of the Government,
and no official endorsement should be inferred.
---
README created Jun 12, 2006 Olga Babko-Malalya
updated June 20, 2006 Julie Medero
updated March 12, 2007 Zhiyi Song
updated March 16, 2007 Stephanie Strassel