Machine Reading (MR) Phase 1 IC Training Data Linguistic Data Consortium 1.0 Overview This package constitutes the complete version of Machine Reading Phase 1 IC (Core Domain) Training data. This release contains 248 source documents and 116 standoff annotation files created by LDC, in both formal knowledge and traditional annotation representations. The Machine Reading (MR) program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by Machine Reading participants were required to extract and reason about facts from text in multiple domains. In Phase 1 of Machine Reading, the IC Use Cases (also referred to as the "Core Domain" or "Use Cases 3-6") tested the core domain of the MR program by extracting information about about Entities (people, organizations, geopolitical entities or "GPEs") and their involvement in four types of Relations: Attack Relations (e.g. bombings), Biographical Relations (e.g. being a citizen of a country), Affiliation Relations (e.g. being a leader of an organization), and Family Relations (e.g. having a spouse) as described in newswire text. This information was then aligned with an IC Use Cases ontology (formal knowledge representation) that would allow automated reasoning about the extracted Entities and Relations. The set of Machine Reading components required for this effort comprise the IC/Core Domain Use Cases. Annotation categories were defined in alignment with the IC Use Cases ontology, and formal knowledge output was incorporated into the configuration of the Machine Reading IC Use Cases annotation tool. The data in this package was provided to Machine Reading participants as training data for the IC Use Cases evaluation. Summary of data included in this package: +-------------+-------------+-------------+-------------------------------+ | source data | source data | annotations | annotations | RDF statements* | | (files) | (words) | | (extended) | | +-------------+-------------+-------------+-------------------------------+ | 248 | 108960 | 34943 | 35802 | 60055 | +-------------+-------------+-------------+-------------+-----------------+ * NOTE: RDF statements are produced from manual text annotations (in accordance with MR IC Use Cases ontology), and thus encode knowledge about a text annotation at several levels of abstraction. As such, there is not a one-to-one correspondence between text annotations and RDF statements. (See ./docs/IC-use-cases_20100617.rdf for details about the MR IC Use Cases ontology). 2.0 Contents This release comprises the following components and directories: ./data/annotation/ This directory contains 116 standoff annotation files in both GUI XML (traditional annotation) and RDF XML (formal knowledge representation) formats. ./data/annotation/gui_xml/ This directory contains 116 LDC GUI XML files produced simultaneously with the annotation files in the ./data/annotation/rdf_xml/ directory. These gui_xml files were created by removing annotations that were inconsistent with the Use Cases 3-6 ontology from the files in the ./data/annotation/gui_xml_extended/ directory. ./data/annotation/gui_xml_extended/ This directory contains 116 LDC GUI XML files with additional, unofficial annotations that would have been invalid once converted to RDF. These annoations are provided because they are considered interesting for research in the IC/Core Domain. ./data/annotation/rdf_xml/ This directory contains 116 RDF XML files produced simultaneously with the annotation files in the ./data/annotation/gui_xml/ directory. ./data/source/src_xml This directory contains 248 source data files in Machine Reading source data XML format. NOTE: Only a subset (116) of these documents were tagged for IC/Core Use Cases annotation categories. However, the remaining (132) source documents have been provided because they were deteremined to be on-topic for the IC/Core Domain, and may be of interest or use to researchers. ./docs/files.md5 Checksum of all files under the ./data/ directory in this release. ./docs/property-histogram.txt Histogram of RDF/OWL properties. ./docs/IC-use-cases.cfg Annotation tool configuration file. See Section 3 for more information. Text files describing the annotation tool and its output. ./docs/IC-use-cases.rng RELAX NG XML schema for the GUI XML annotation files. ./docs/MR_IC_Guidelines_V2.2.pdf Annotation guidelines under which the IC/Core Domain annotations in this corpus were produced. ./docs/mr-source-0-6.dtd DTD for validating the source data files in the ./data/source/src_xml/ directory. ./docs/IC-use-cases_20100617.rdf Latest version of the ontology under which the RDF XML files in the data/annotation/rdf_xml directory were produced. ./docs/README.txt This file. 3.0 Annotation Format Details The annotation tool config file, IC-use-cases.cfg, was used by the annotation tool to specify the structure of the GUI XML as well as create the RDF XML. The elements defined there, and the tree structure defined via the "children" attributes, is replicated in the GUI XML. The "rdf" function in each element of the config file is used to map elements of the GUI XML into RDF statements. Each rdf function is a series of case statements that conditionally output the numbered RDF triples stated within. Statements beginning with "provenance" produce text provenance rather than assumption set triples. 4.0 Annotations and Character Offsets All annotations are standoff annotations. Although the source files are valid XML, for the purposes of annotation they are considered unstructured UTF-8 character arrays, where each character offset N points to the Nth character (NB: not byte) in the file, beginning at 0. Note that this includes newlines; all newlines are Unix-style, therefore one character. Since there is one document per file, there is no distinction between the two in terms of annotation. Consider the vacuous document: blah An annotation file might contain blah. There are 13 characters previous to "blah" and the length of blah is equal to end-beg+1. The same offset counting approach is used in both GUI/XML and RDF/XML. Text extents are also included in both. Annotations appear as elements in the .gui.xml files, as in the example above. Besides the offset attributes and the ID attribute, all elements have a "type" attribute. A type="manual" element represents text selected by annotators, while a type="sentence" element represents text determined automatically to be the containing sentence of the "manual" text. An element such as indicates that the annotator selected the "Inferred" checkbox. 5.0 Annotation Approach Annotation is non-exhaustive, but an attempt was made to provide instances of all relations and their arguments where explicitly stated in a single sentence, as well as some non-explicit relations, which were marked with an "Inferred" tag by the annotator. Relations and arguments were marked "Inferred" if the annotator determined that a relation or an argument was taggableaccording to the Reasonable Interpretation rule, but only if information from outside of the current sentence was taken into account. Please refer to ./docs/MR_IC_Guidelines_V2.2.pdf for more information about the Resaonable Interpretation rule, and tagging Inferred relations and arguments. 6.0 Acknowledgments Linguistic Data Consortium (LDC) gratefully acknowledges the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09 C-xxxx. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government. Our thanks to Global InfoTek (GITI) for developing the ontology for mapping from text annotations in this corpus to formal knowledge, and granting permission for the ontology to be redistributed with this corpus. Finally, our thanks to Science Applications International Corp (SAIC) Advanced Systems and Concepts for their work designing and coordinating evaluations for the Machine Reading Program. 7.0 Copyright Information Portions © 1994-1997, 2001-2006 Agence France Presse, © 2002 An Nahar, ©1995-1998, 2000-2001, 2005-2006 The Associated Press, © 1996-1998, 2004, 2006 Los Angeles Times-Washington Post News Service, Inc., © 1994-2002, 2004-2006 New York Times, © 1994 Reuters America, Inc., © 1995-2006 Xinhua News Agency, © 2019 Trustees of the University of Pennsylvania 8.0 Authors For further information about the contents of this corpus, please contact the following project staff at LDC: Stephanie Strassel, PI Jonathan Wright, Technical Lead Kira Griffitt, Lead Annotator -------------------------------------------------------------------------- README created by Kira Griffitt on April 18, 2017 README updated by Kira Griffitt on May 1, 2017 README updated by Kira Griffitt on May 2, 2017 README updated by Kira Griffitt on May 3, 2017 README updated by Kira Griffitt on October 26, 2018 README updated by Kira Griffitt on March 24, 2019 README updated by Daniel Jaquette on February 4, 2020