Machine Reading Phase 1 IC Training Data

Item Name: Machine Reading Phase 1 IC Training Data
Author(s): Heather Simpson, Stephanie Strassel, Jonathan Wright, Kira Griffitt
LDC Catalog No.: LDC2020T04
ISBN: 1-58563-916-8
ISLRN: 013-884-229-405-9
Release Date: February 17, 2020
Member Year(s): 2020
DCMI Type(s): Text
Data Source(s): newswire
Project(s): Machine Reading
Application(s): machine reading, knowledge representation, information extraction
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2020T04 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Simpson, Heather, et al. Machine Reading Phase 1 IC Training Data LDC2020T04. Web Download. Philadelphia: Linguistic Data Consortium, 2020.
Related Works: View


Machine Reading Phase 1 IC Training Data was developed by the Linguistic Data Consortium and contains 248 English source documents and 116 standoff annotation files used in the DARPA (Defense Advanced Research Projects Agency) Machine Reading program.

The Machine Reading (MR) program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by program participants were required to extract and reason about facts from text in multiple domains.

The data in this release constitutes the training data for the IC (Core Domain) task. The IC Use Cases tested the core domain by extracting information about about Entities (people, organizations, geopolitical entities or "GPEs") and their involvement in four types of Relations: Attack Relations (e.g. bombings), Biographical Relations (e.g. being a citizen of a country), Affiliation Relations (e.g. being a leader of an organization), and Family Relations (e.g. having a spouse) as described in newswire text. This information was then aligned with an IC Use Cases ontology that would allow automated reasoning about the extracted Entities and Relations.


This release contains 248 source documents (108,960 words) from English newswire stories in English Gigaword Fourth Edition (LDC2009T13). Roughly half of those documents (116) were annotated for IC/Core Use Cases. Annotation was non-exhaustive, but an attempt was made to provide instances of all relations and their arguments where explicitly stated in a single sentence, as well as some non-explicit relations, which were marked with an "Inferred" tag by the annotator.

Annotations are in GUI XML (traditional annotation) and RDF XML (formal knowledge representation) formats. A second set of GUI XML is provided with additional, unofficial annotations. All source and annotation files are presented as UTF-8 encoded XML files with associated dtds, schemas or ontologies.


The Linguistic Data Consortium gratefully acknowledges the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09 C-xxxx. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government.


Please view the following samples:


None at this time.

Available Media

View Fees

Login for the applicable fee