GALE Phase 1 Distillation Training LDC2007T20 Linguistic Data Consortium 1. Introduction The GALE Phase 1 Distillation Training Corpus constitutes the final release of training data created by LDC for the DARPA GALE Program Phase 1 Distillation Go/No Go technology evaluation. The release consists of 248 English, Chinese and/or Arabic queries and their responses, created by LDC annotators. Queries conform to one of ten template types. Query responses may include document and snippet relevance judgments, nuggets, nugs and supernugs. 158 of the 248 queries have been annotated for all features, while the remainder are labeled for only some features. In addition, not all queries have been exhaustively annotated for a given feature, given resource constraints during corpus development. The table below indicates the number of queries that have been labeled for each template in each source language. +-------------+---------+---------+---------+ | | English | Chinese | Arabic | +-------------+---------+---------+---------+ | Template 1 | 15/28 | 9/17 | 12/16 | | Template 3 | 16/29 | 9/29 | 13/29 | | Template 4 | 15/23 | 7/18 | 11/18 | | Template 5 | 21/39 | 10/39 | 20/36 | | Template 6 | 15/20 | 7/19 | 7/20 | | Template 8 | 12/14 | 6/13 | 5/14 | | Template 9 | 14/23 | 7/21 | 10/21 | | Template 11 | 11/22 | 8/15 | 2/14 | | Template 15 | 12/21 | 8/11 | 5/11 | | Template 16 | 13/24 | 10/12 | 8/12 | +-------------+---------+---------+---------+ | Total | 144/243 | 81/194 | 93/191 | +-------------+---------+---------+---------+ 2. Annotation The annotation task involves responding to a series of user queries. For each query, annotators first find relevant documents and identify snippets (strings of contiguous text that answer the query) in the Arabic, Chinese or English source document. Annotators then create a nugget for each fact expressed in the snippet. Semantically equivalent nuggets are grouped into cross-language, cross-document "supernugs". Judges at BAE Systems finally provide relevance weights for each supernug. Queries in this release have been annotated for the following tasks: - searching for relevant documents and providing yes/no judgements - extracting snippets - resolution of pronouns, and certain types of temporal and locative expressions contained in the snippets - creating nuggets, i.e. atomic pieces of information that an annotator considers a valid answer to the query - building nugs, i.e. clusters of semantically-equivalent nuggets for each language - building supernugs, i.e. clusters of semantically-equivalent nugs across languages The following queries have been annotated for all annotation tasks for at least one language: Template 1: LDC10, LDC37, LDC43, LDC67, T036, T037, T38, T039, T040, T041, T042, T043, T044, T046, T047 Template 3: 3.2, 3.3, 3.4, 5.1, 6.2, LDC1, LDC2, LDC3, LDC12, LDC15, LDC16, LDC28, LDC45, LDC69, LDC71, LDC100, T006, T007, T008, T009, T010 Template 4: 2.5, 3.1, LDC27, LDC33, LDC38, LDC44, LDC56, LDC57, LDC61, LDC99, T011, T012, T013, T014, T015 Template 5: 1.3, 1.4, 3.6, 4.2, 4.3, 4.5, 5.3, 6.3, 6.4, 6.5, LDC17, LDC18, LDC32, LDC34, LDC39, LDC53, LDC54, T026, T027, T028, T029, T030, T031, T032, T033, T034, T035 Template 6: 2.1, 2.7, 3.7, 4.6, 5.4, 5.9, LDC4, LDC35, LDC58, LDC59, LDC89, T017, T018, T019, T020 Template 8: 1.6, 1.7, LDC31, LDC65, T061, T062, T063, T064, T065, T066, T067, T068 Template 9: 1.2, 3.8, 4.7, 5.2, 5.6, LDC9, LDC30, LDC36, LDC40, LDC41, LDC48, LDC66, T021, T022, T023, T025 Template 11: LDC50, LDC51, LDC52, T054, T055, T056, T057, T058, T059, T060, T069, T070 Template 15: LDC77, LDC92, T071, T072, T073, T074, T075, T076, T077, T078, T079, T080 Template 16: 3.9, LDC7, LDC90, LDC91, T081, T082, T082, T083, T084, T085, T086, T087, T088, T089 Additional details of the annotation task are provided in the annotation task specification, within the /docs directory for this release. More information can also be found at LDC's GALE Distillation website: http://www.ldc.upenn.edu/Projects/GALE/Distillation 3. Training Data Sources This corpus references tkn_sgm files contained in the following LDC publications: * LDC2005T16 TDT4 Mulitlingual Text and Annotations https://secure.ldc.upenn.edu/intranet/catalogDisplay.jsp?ldc_catalog_id=LDC2005T16 * LDC2006T18 TDT5 Multilanguage Text https://secure.ldc.upenn.edu/intranet/catalogDisplay.jsp?ldc_catalog_id=LDC2006T18 4. Directory Structure This release is structured as follows doc/ README.txt - this document DistillationTrainingDataSpecV1.1.pdf - annotation guidelines templates.txt - description of template types data/ database/ - raw database dump for query responses sql/ - SQL commands to create database xml/ - single XML file for each query response queries/ - XML formatted queries dtd/ response-1.3.dtd - XML DTD for distillation training data 5. File Format Description Training data output is included in two formats: an XML output and a raw database dump. The XML output is included as a single XML file for each query. The translated query is included for each language along with each search that annotators used to find relevant documents, all documents judged relevant or not relevant to the query, and all snippets that were identified in relevant documents. Pronouns, temporal and locative expressions that have been resolved in snippets are followed by their resolution text in single square brackets. The raw database dump is an output of the mySQL database that LDC uses to store all annotation information during the annotation process. The SQL commands to create a database with the same structure are included in the sql/ subdirectory. One file is included in the database/ subdirectory for each table in the mySQL database. These include: * Clarification: Resolution of pronouns and temporal and locative expressions * Clarification_Type: Possible "type" values for each clarification. These are currently Name, Temporal, Location and Other. * Document: All documents that have been given a relevance judgment for some query. This table records the document number used to identify the document in LDC's internal search engine along with its dockey (which is the Document Number identified in the TDT corpus) and its language. * Document_Relevance_Link: Relevance judgments for a document for a query. * Indirect_Speech: Stores the "who", "verb", "attribution" and "evaluation" values for a core nugget's indirect speech attributes. * Indirect_Speech_Evaluation: Possible "evaluation" values for each indirect speech. These are currently Positive, Negative and Neutral/Unknown. * Language: Possible "language" values for each document. These are currently "Arabic", "Chinese" and "English". * Modification_Type: Possible "modification" values for modifier nuggets. These are currently Temporal, Location and Other. * Nug: All cross-language nugs. These are equivalent to Supernugs, since they may include nuggets from multiple documents and languages. An optional text attribute is included for nugs that do not have English components. An optional modification type attribute is included for modifying nugs. Also included here is BAE's relevance weight judgment for English nugs. * All nuggets. This table records the snippet that a nugget came from, the nug that it belongs to, and its text. If the nugget modifies another nugget from the same snippet, this table also records the character offset at which the modifying text should be inserted into the text of the nugget being modified to get the full text of the nugget. * Query: All cross-language queries. This table only records the id and label for a query. * Query_Language_Link: All query-language pairs. This table records the text of each query for each language. Relevance judgements and searches are associated with entries in this table, not with entries in the cross-language Query table. * Relevance: Possible "relevance" values for document judgments. There are currently "Yes", "No", and "Dup" (duplicate document). * Search: All searches that were run against LDC's internal search engine when looking for relevant documents. For each search, the query that was being worked on and the text of the search are stored. * Snippet: All snippets recorded for all documents. For each snippet, this table records the start and end offset, the document, and the query the snippet is relevant to. * Supernug: The use of this table has become very different from the meaning of "Supernug" in the Distillation community. It groups a nug with all of the modifying nugs related to that nug. For both output formats, character offsets (not byte offsets) are calculated on utf-8 versions of the TDT source data. Only characters inside the ... tags are counted. Each newline character is counted as one (the source files use UNIX-style end of line characters). All characters inside the ... tags are counted, including other XML tags that may be embedded. Offsets actually count the spaces between characters, not the characters themselves, so that a one-character string has a length (end-start) of one, not zero. Search text in the raw database output is url-encoded. Search text in the XML output is not url-encoded. Instead, XML special characters (e.g. "&") have been encoded as XML entities. 6. Contact Information If you have questions about this data release, please contact the following personnel at the LDC. Zhiyi Song - GALE Distillation Project Manager Stephanie Strassel - GALE Project Manager Robert Parker - GALE Distillation Programmer Kazuaki Maeda - Technical Consultant/Manager 7. Sponsorship This work was supported in part by the Defense Advanced Research Projects Agency, GALE Program Grant No. HR0011-06-1-0003. The content of this paper does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. --- README created Jun 12, 2006 Olga Babko-Malalya updated June 20, 2006 Julie Medero updated March 12, 2007 Zhiyi Song updated March 16, 2007 Stephanie Strassel