Machine Reading (MR) Phase 1 NFL Scoring Training Data
                      Linguistic Data Consortium

1.0 Overview

This package constitutes the complete version of Machine Reading Phase
1 NFL Scoring Training data. This release contains 110 NFL scoring
source documents, 110 standoff annotation files created by LDC, in
both formal knowledge and traditional annotation representations.

The Machine Reading (MR) program aimed to develop automated reading
systems to bridge the gap between knowledge contained in natural
language texts and knowledge accessible to formal reasoning
systems. The reading systems designed by Machine Reading participants
were required to extract and reason about facts from text in multiple
domains.

In Phase 1 of the program, the NFL Scoring Use Case tested the domain
of sports by extracting information about scoring events and outcomes
of games in NFL football, and aligning that information with an NFL
Scoring ontology (formal knowledge representation).

The set of Machine Reading components required for this effort
comprise the NFL Scoring Use Cases (also referred to as Use Cases 1
and 2). Annotation categories were defined in alignment with the NFL
Scoring ontology, and formal knowledge output was incorporated into
the configuration of the Machine Reading NFL Scoring annotation tool.

The data in this package was used by Machine Reading participants as
training data for the NFL Scoring Use Cases evaluation.

Summary of data included in this package:

+-------------+-------------+-------------+-----------------+
| source data | source data | annotations | RDF statements* |
|   (files)   |   (words)   |             |                 |
+-------------+-------------+-------------+-----------------+
|         110 |       70233 |        9029 |           75552 |
+-------------+-------------+-------------------------------+

* Note that RDF statements are produced from manual text annotations
  (in accordance with MR NFL ontology), and thus encode knowledge
  about a text annotation at several levels of abstraction. As such,
  there is not a one-to-one correspondence between text annotations
  and RDF statements. (See docs/NFL-P2dryrun-scoring.rdf for details
  about the MR NFL scoring ontology).

2.0 Contents

This release comprises the following components and directories:

  data/annotation

This directory contains 110 standoff annotation files, corresponding
to the 110 source data files, in both GUI XML (traditional annotation)
and RDF XML (formal knowledge representation) formats. These files
were read and manually annotated for instances of NFL Scoring
annotation categories.

Please note that the following 5 source files did not contain any
instances of NFLScoring relations, and thus their corresponding
gui.xml and rdf.xml files do not contain any annotations or RDF
statements, respectively.

APW_ENG_19980401.1875
NYT_ENG_19980111.0176
NYT_ENG_19980111.0254
NYT_ENG_19981229.0365
VOA20010111.2000.1065

  data/annotation/gui_xml/

This directory contains 110 GUI XML files produced simultaneously
with the annotation files in data/annotation/rdf_xml/. All files were
validated against the provided mr-annotation-0-1.dtd

  data/annotation/rdf_xml/

This directory contains 110 RDF XML files produced simultaneously with
the annotation files in data/annotation/gui_xml/.

  data/source/src_xml

This directory contains 110 source data files in Machine Reading
source data XML format (validated by mr-source-0-6.dtd).

  docs/gui_classes_hierarchy.txt
  docs/gui_classes_output.txt
  docs/gui_readme.txt

Text files describing the annotation tool and its output.

  docs/mr-annotation-0-1.dtd     

DTD for validating the GUI XML annotation files in the
data/annotation/gui_xml/ directory.

  docs/MR_P1_NFLScoring_Annotation_Guidelines_V1.0.pdf

Annotation guidelines under which the the NFL scoring annotations in
this corpus were produced.

  docs/mr-source-0-6.dtd

DTD for validating the source data files in the data/source/src_xml
directory.

  docs/NFL-P2dryrun-scoring.rdf

Latest version of the ontology under which the RDF XML files in the
data/annotation/rdf_xml directory were produced.

  docs/README.txt

This file.

3.0 Annotations and Character Offsets

All annotations are standoff annotations. Although the source files
are valid XML, for the purposes of annotation they are considered
unstructured UTF-8 character arrays, where each character offset N
points to the Nth character (NB: not byte) in the file, beginning at
0. Note that this includes newlines; all newlines are Unix-style,
therefore one character. Since there is one document per file, there
is no distinction between the two in terms of annotation. Consider the
vacuous document:

  <DOC>
  <TEXT>
  blah
  </TEXT>
  </DOC>

An annotation file might contain <text beg="13" end="16">blah</text>.
There are 13 characters previous to "blah" and the length of blah is
equal to end-beg+1. The same offset counting approach is used in both
GUI/XML and RDF/XML. Text extents are also included in both.

Annotations appear as <text> elements in the .gui.xml files, as in the
example above. Besides the offset attributes and the ID attribute,
all <text> elements have a "type" attribute. A type="manual" element
represents text selected by annotators, while a type="sentence"
element represents text determined automatically to be the containing
sentence of the "manual" text. An element such as <text type="manual"
inferred="true"> indicates that the annotator selected the "Inferred"
checkbox.

4.0 Annotation Approach

Annotation is non-exhaustive, but an attempt was made to provide
instances of all relations and their arguments where explicitly stated
in a single sentence, as well as some non-explicit
relations. Non-explicit relations were provided at the annotator's
discretion, and are marked with an "Inferred" tag by the annotator.

Explicitness was considered a subjective judgment on the part of the
annotator, with the exception of ScoringCounts, where annotators were
instructed to provide an "Inferred" tag if the NFLTeam argument was
not explicit in the sentence. 

Please refer to docs/MR_P1_NFLScoring_Annotation_Guidelines_V1.0.pdf
for more information.

5.0 Domain-Specific Reasoning System (DSRS)

In the Machine Reading program, an official Domain-Specific Reasoning
System (DSRS) was provided to performers to allow them to access
background knowledge and make inferences about a specific reading-task
(domain).

A subsequent version of this corpus will include links to an unofficial
DSRS interface that non-MR researchers can use to access background
knowledge about the NFL Scoring domain, and make inferences about the
formal NFL Scoring knowledge encoded in the RDF XML annotation files
in this package.

This DSRS interface, along with the NFL scoring ontology file in the
docs/ directory, should allow researchers outside the Machine Reading
program to interact with and reason about the NFL Scoring data in this
package in way similar to MR researchers.

6.0 Acknowledgments

  Linguistic Data Consortium (LDC) gratefully acknowledges the support
  of Defense Advanced Research Projects Agency (DARPA) Machine Reading
  Program under Air Force Research Laboratory (AFRL) prime contract
  no. FA8750-09 C-xxxx. Any opinions, findings, and conclusion or
  recommendations expressed in this material are those of the
  author(s) and do not necessarily reflect the view of the DARPA,
  AFRL, or the US government.

  Our thanks to Global InfoTek (GITI) for developing the ontology for
  mapping from text annotations in this corpus to formal knowledge,
  and granting permission for the ontology to be redistributed with
  this corpus.

  Finally, our thanks to Science Applications International Corp
  (SAIC) Advanced Systems and Concepts for their work designing and
  coordinating evaluations for the Machine Reading Program.

7.0 Copyright Information

  Portions © 1995-1996, 2002-2005 Agence France Presse, ©1998,
  2000-2001 The Associated Press, © 1994, 1996, 1998, 2005 New York
  Times, © 2019 Trustees of the University of Pennsylvania

8.0 Authors

For further information about the contents of this corpus, please contact
the following project staff at LDC:

  Heather Simpson, Project Manager <hsimpson@ldc.upenn.edu> 
  Stephanie Strassel, PI           <strassel@ldc.upenn.edu> 
  Jonathan Wright, Technical Lead  <jdwright@ldc.upenn.edu> 
  Kira Griffitt, Lead Annotator    <kiragrif@ldc.upenn.edu>

--------------------------------------------------------------------------
README created by Kira Griffitt on April 5, 2013
README updated by Kira Griffitt on April 10, 2013
README updated by Kira Griffitt on November 10, 2015
README updated by Kira Griffitt on November 13, 2015  
README updated by Jonathan Wright on November 13, 2015  
README updated by Kira Griffitt on November 8, 2016
README updated by Kira Griffitt on November 9, 2016
README updated by Kira Griffitt on October 26, 2018
README updated by Kira Griffitt on March 24, 2019