Machine Reading (MR) Phase 1 NFL Scoring Training Data Linguistic Data Consortium 1.0 Overview This package constitutes the complete version of Machine Reading Phase 1 NFL Scoring Training data. This release contains 110 NFL scoring source documents, 110 standoff annotation files created by LDC, in both formal knowledge and traditional annotation representations. The Machine Reading (MR) program aimed to develop automated reading systems to bridge the gap between knowledge contained in natural language texts and knowledge accessible to formal reasoning systems. The reading systems designed by Machine Reading participants were required to extract and reason about facts from text in multiple domains. In Phase 1 of the program, the NFL Scoring Use Case tested the domain of sports by extracting information about scoring events and outcomes of games in NFL football, and aligning that information with an NFL Scoring ontology (formal knowledge representation). The set of Machine Reading components required for this effort comprise the NFL Scoring Use Cases (also referred to as Use Cases 1 and 2). Annotation categories were defined in alignment with the NFL Scoring ontology, and formal knowledge output was incorporated into the configuration of the Machine Reading NFL Scoring annotation tool. The data in this package was used by Machine Reading participants as training data for the NFL Scoring Use Cases evaluation. Summary of data included in this package: +-------------+-------------+-------------+-----------------+ | source data | source data | annotations | RDF statements* | | (files) | (words) | | | +-------------+-------------+-------------+-----------------+ | 110 | 70233 | 9029 | 75552 | +-------------+-------------+-------------------------------+ * Note that RDF statements are produced from manual text annotations (in accordance with MR NFL ontology), and thus encode knowledge about a text annotation at several levels of abstraction. As such, there is not a one-to-one correspondence between text annotations and RDF statements. (See docs/NFL-P2dryrun-scoring.rdf for details about the MR NFL scoring ontology). 2.0 Contents This release comprises the following components and directories: data/annotation This directory contains 110 standoff annotation files, corresponding to the 110 source data files, in both GUI XML (traditional annotation) and RDF XML (formal knowledge representation) formats. These files were read and manually annotated for instances of NFL Scoring annotation categories. Please note that the following 5 source files did not contain any instances of NFLScoring relations, and thus their corresponding gui.xml and rdf.xml files do not contain any annotations or RDF statements, respectively. APW_ENG_19980401.1875 NYT_ENG_19980111.0176 NYT_ENG_19980111.0254 NYT_ENG_19981229.0365 VOA20010111.2000.1065 data/annotation/gui_xml/ This directory contains 110 GUI XML files produced simultaneously with the annotation files in data/annotation/rdf_xml/. All files were validated against the provided mr-annotation-0-1.dtd data/annotation/rdf_xml/ This directory contains 110 RDF XML files produced simultaneously with the annotation files in data/annotation/gui_xml/. data/source/src_xml This directory contains 110 source data files in Machine Reading source data XML format (validated by mr-source-0-6.dtd). docs/gui_classes_hierarchy.txt docs/gui_classes_output.txt docs/gui_readme.txt Text files describing the annotation tool and its output. docs/mr-annotation-0-1.dtd DTD for validating the GUI XML annotation files in the data/annotation/gui_xml/ directory. docs/MR_P1_NFLScoring_Annotation_Guidelines_V1.0.pdf Annotation guidelines under which the the NFL scoring annotations in this corpus were produced. docs/mr-source-0-6.dtd DTD for validating the source data files in the data/source/src_xml directory. docs/NFL-P2dryrun-scoring.rdf Latest version of the ontology under which the RDF XML files in the data/annotation/rdf_xml directory were produced. docs/README.txt This file. 3.0 Annotations and Character Offsets All annotations are standoff annotations. Although the source files are valid XML, for the purposes of annotation they are considered unstructured UTF-8 character arrays, where each character offset N points to the Nth character (NB: not byte) in the file, beginning at 0. Note that this includes newlines; all newlines are Unix-style, therefore one character. Since there is one document per file, there is no distinction between the two in terms of annotation. Consider the vacuous document: blah An annotation file might contain blah. There are 13 characters previous to "blah" and the length of blah is equal to end-beg+1. The same offset counting approach is used in both GUI/XML and RDF/XML. Text extents are also included in both. Annotations appear as elements in the .gui.xml files, as in the example above. Besides the offset attributes and the ID attribute, all elements have a "type" attribute. A type="manual" element represents text selected by annotators, while a type="sentence" element represents text determined automatically to be the containing sentence of the "manual" text. An element such as indicates that the annotator selected the "Inferred" checkbox. 4.0 Annotation Approach Annotation is non-exhaustive, but an attempt was made to provide instances of all relations and their arguments where explicitly stated in a single sentence, as well as some non-explicit relations. Non-explicit relations were provided at the annotator's discretion, and are marked with an "Inferred" tag by the annotator. Explicitness was considered a subjective judgment on the part of the annotator, with the exception of ScoringCounts, where annotators were instructed to provide an "Inferred" tag if the NFLTeam argument was not explicit in the sentence. Please refer to docs/MR_P1_NFLScoring_Annotation_Guidelines_V1.0.pdf for more information. 5.0 Domain-Specific Reasoning System (DSRS) In the Machine Reading program, an official Domain-Specific Reasoning System (DSRS) was provided to performers to allow them to access background knowledge and make inferences about a specific reading-task (domain). A subsequent version of this corpus will include links to an unofficial DSRS interface that non-MR researchers can use to access background knowledge about the NFL Scoring domain, and make inferences about the formal NFL Scoring knowledge encoded in the RDF XML annotation files in this package. This DSRS interface, along with the NFL scoring ontology file in the docs/ directory, should allow researchers outside the Machine Reading program to interact with and reason about the NFL Scoring data in this package in way similar to MR researchers. 6.0 Acknowledgments Linguistic Data Consortium (LDC) gratefully acknowledges the support of Defense Advanced Research Projects Agency (DARPA) Machine Reading Program under Air Force Research Laboratory (AFRL) prime contract no. FA8750-09 C-xxxx. Any opinions, findings, and conclusion or recommendations expressed in this material are those of the author(s) and do not necessarily reflect the view of the DARPA, AFRL, or the US government. Our thanks to Global InfoTek (GITI) for developing the ontology for mapping from text annotations in this corpus to formal knowledge, and granting permission for the ontology to be redistributed with this corpus. Finally, our thanks to Science Applications International Corp (SAIC) Advanced Systems and Concepts for their work designing and coordinating evaluations for the Machine Reading Program. 7.0 Copyright Information Portions © 1995-1996, 2002-2005 Agence France Presse, ©1998, 2000-2001 The Associated Press, © 1994, 1996, 1998, 2005 New York Times, © 2019 Trustees of the University of Pennsylvania 8.0 Authors For further information about the contents of this corpus, please contact the following project staff at LDC: Heather Simpson, Project Manager Stephanie Strassel, PI Jonathan Wright, Technical Lead Kira Griffitt, Lead Annotator -------------------------------------------------------------------------- README created by Kira Griffitt on April 5, 2013 README updated by Kira Griffitt on April 10, 2013 README updated by Kira Griffitt on November 10, 2015 README updated by Kira Griffitt on November 13, 2015 README updated by Jonathan Wright on November 13, 2015 README updated by Kira Griffitt on November 8, 2016 README updated by Kira Griffitt on November 9, 2016 README updated by Kira Griffitt on October 26, 2018 README updated by Kira Griffitt on March 24, 2019