Machine Reading (MR) Phase 1 IC Training Data
Linguistic Data Consortium
1.0 Overview
This package constitutes the complete version of Machine Reading Phase
1 IC (Core Domain) Training data. This release contains 248 source
documents and 116 standoff annotation files created by LDC, in both
formal knowledge and traditional annotation representations.
The Machine Reading (MR) program aimed to develop automated reading
systems to bridge the gap between knowledge contained in natural
language texts and knowledge accessible to formal reasoning
systems. The reading systems designed by Machine Reading participants
were required to extract and reason about facts from text in multiple
domains.
In Phase 1 of Machine Reading, the IC Use Cases (also referred to as
the "Core Domain" or "Use Cases 3-6") tested the core domain of the MR
program by extracting information about about Entities (people,
organizations, geopolitical entities or "GPEs") and their involvement
in four types of Relations: Attack Relations (e.g. bombings),
Biographical Relations (e.g. being a citizen of a country),
Affiliation Relations (e.g. being a leader of an organization), and
Family Relations (e.g. having a spouse) as described in newswire
text. This information was then aligned with an IC Use Cases ontology
(formal knowledge representation) that would allow automated reasoning
about the extracted Entities and Relations.
The set of Machine Reading components required for this effort
comprise the IC/Core Domain Use Cases. Annotation categories were
defined in alignment with the IC Use Cases ontology, and formal
knowledge output was incorporated into the configuration of the
Machine Reading IC Use Cases annotation tool.
The data in this package was provided to Machine Reading participants
as training data for the IC Use Cases evaluation.
Summary of data included in this package:
+-------------+-------------+-------------+-------------------------------+
| source data | source data | annotations | annotations | RDF statements* |
| (files) | (words) | | (extended) | |
+-------------+-------------+-------------+-------------------------------+
| 248 | 108960 | 34943 | 35802 | 60055 |
+-------------+-------------+-------------+-------------+-----------------+
* NOTE: RDF statements are produced from manual text annotations (in
accordance with MR IC Use Cases ontology), and thus encode knowledge
about a text annotation at several levels of abstraction. As such,
there is not a one-to-one correspondence between text annotations
and RDF statements.
(See ./docs/IC-use-cases_20100617.rdf for details about the MR IC
Use Cases ontology).
2.0 Contents
This release comprises the following components and directories:
./data/annotation/
This directory contains 116 standoff annotation files in both GUI XML
(traditional annotation) and RDF XML (formal knowledge representation)
formats.
./data/annotation/gui_xml/
This directory contains 116 LDC GUI XML files produced simultaneously
with the annotation files in the ./data/annotation/rdf_xml/
directory. These gui_xml files were created by removing annotations
that were inconsistent with the Use Cases 3-6 ontology from the files
in the ./data/annotation/gui_xml_extended/ directory.
./data/annotation/gui_xml_extended/
This directory contains 116 LDC GUI XML files with additional,
unofficial annotations that would have been invalid once converted to
RDF. These annoations are provided because they are considered
interesting for research in the IC/Core Domain.
./data/annotation/rdf_xml/
This directory contains 116 RDF XML files produced simultaneously with
the annotation files in the ./data/annotation/gui_xml/ directory.
./data/source/src_xml
This directory contains 248 source data files in Machine Reading
source data XML format.
NOTE: Only a subset (116) of these documents were tagged for IC/Core
Use Cases annotation categories. However, the remaining (132) source
documents have been provided because they were deteremined to be
on-topic for the IC/Core Domain, and may be of interest or use to
researchers.
./docs/files.md5
Checksum of all files under the ./data/ directory in this release.
./docs/property-histogram.txt
Histogram of RDF/OWL properties.
./docs/IC-use-cases.cfg
Annotation tool configuration file. See Section 3 for more
information.
Text files describing the annotation tool and its output.
./docs/IC-use-cases.rng
RELAX NG XML schema for the GUI XML annotation files.
./docs/MR_IC_Guidelines_V2.2.pdf
Annotation guidelines under which the IC/Core Domain annotations in
this corpus were produced.
./docs/mr-source-0-6.dtd
DTD for validating the source data files in the ./data/source/src_xml/
directory.
./docs/IC-use-cases_20100617.rdf
Latest version of the ontology under which the RDF XML files in the
data/annotation/rdf_xml directory were produced.
./docs/README.txt
This file.
3.0 Annotation Format Details
The annotation tool config file, IC-use-cases.cfg, was used by the
annotation tool to specify the structure of the GUI XML as well as
create the RDF XML.
The elements defined there, and the tree structure defined via the
"children" attributes, is replicated in the GUI XML. The "rdf"
function in each element of the config file is used to map elements of
the GUI XML into RDF statements. Each rdf function is a series of case
statements that conditionally output the numbered RDF triples stated
within. Statements beginning with "provenance" produce text provenance
rather than assumption set triples.
4.0 Annotations and Character Offsets
All annotations are standoff annotations. Although the source files
are valid XML, for the purposes of annotation they are considered
unstructured UTF-8 character arrays, where each character offset N
points to the Nth character (NB: not byte) in the file, beginning at
0. Note that this includes newlines; all newlines are Unix-style,
therefore one character. Since there is one document per file, there
is no distinction between the two in terms of annotation. Consider the
vacuous document:
blah
An annotation file might contain blah.
There are 13 characters previous to "blah" and the length of blah is
equal to end-beg+1. The same offset counting approach is used in both
GUI/XML and RDF/XML. Text extents are also included in both.
Annotations appear as elements in the .gui.xml files, as in the
example above. Besides the offset attributes and the ID attribute, all
elements have a "type" attribute. A type="manual" element
represents text selected by annotators, while a type="sentence"
element represents text determined automatically to be the containing
sentence of the "manual" text. An element such as indicates that the annotator selected the "Inferred"
checkbox.
5.0 Annotation Approach
Annotation is non-exhaustive, but an attempt was made to provide
instances of all relations and their arguments where explicitly stated
in a single sentence, as well as some non-explicit relations, which
were marked with an "Inferred" tag by the annotator.
Relations and arguments were marked "Inferred" if the annotator
determined that a relation or an argument was taggableaccording to the
Reasonable Interpretation rule, but only if information from outside
of the current sentence was taken into account.
Please refer to ./docs/MR_IC_Guidelines_V2.2.pdf for more information
about the Resaonable Interpretation rule, and tagging Inferred
relations and arguments.
6.0 Acknowledgments
Linguistic Data Consortium (LDC) gratefully acknowledges the support
of Defense Advanced Research Projects Agency (DARPA) Machine Reading
Program under Air Force Research Laboratory (AFRL) prime contract
no. FA8750-09 C-xxxx. Any opinions, findings, and conclusion or
recommendations expressed in this material are those of the
author(s) and do not necessarily reflect the view of the DARPA,
AFRL, or the US government.
Our thanks to Global InfoTek (GITI) for developing the ontology for
mapping from text annotations in this corpus to formal knowledge,
and granting permission for the ontology to be redistributed with
this corpus.
Finally, our thanks to Science Applications International Corp
(SAIC) Advanced Systems and Concepts for their work designing and
coordinating evaluations for the Machine Reading Program.
7.0 Copyright Information
Portions © 1994-1997, 2001-2006 Agence France Presse, © 2002 An
Nahar, ©1995-1998, 2000-2001, 2005-2006 The Associated Press, ©
1996-1998, 2004, 2006 Los Angeles Times-Washington Post News
Service, Inc., © 1994-2002, 2004-2006 New York Times, © 1994 Reuters
America, Inc., © 1995-2006 Xinhua News Agency, © 2019 Trustees of
the University of Pennsylvania
8.0 Authors
For further information about the contents of this corpus, please
contact the following project staff at LDC:
Stephanie Strassel, PI
Jonathan Wright, Technical Lead
Kira Griffitt, Lead Annotator
--------------------------------------------------------------------------
README created by Kira Griffitt on April 18, 2017
README updated by Kira Griffitt on May 1, 2017
README updated by Kira Griffitt on May 2, 2017
README updated by Kira Griffitt on May 3, 2017
README updated by Kira Griffitt on October 26, 2018
README updated by Kira Griffitt on March 24, 2019
README updated by Daniel Jaquette on February 4, 2020