BOLT English Co-reference -- DF, SMS/Chat, and CTS 
                    
            Raytheon BBN Technologies
                    
  Authors: Nitin Agarwal, Michelle Franchini, Michelle Kappler, 
           Linnea Micciula, Sameer Pradhan, Lance Ramshaw


1. Introduction

The DARPA BOLT Program created new techniques for automated translation and
linguistic analysis that can be applied to informal genres of text and speech
common to online and in-person communications. The BOLT data team led by
Linguistic Data Consortium was responsible for collecting informal data
sources including discussion forums, text messaging and chat and conversational
telephone speech in English, Chinese and Egyptian Arabic, and applying
annotations including translation, word alignment, Treebanking, PropBanking,
co-reference and queries/responses.

This corpus contains the data produced for the co-reference annotation task
led by Raytheon BBN Technologies. Co-reference annotation aims to fill in all
of the connections between specific mentions in the text that refer to the
same entities and events in the discourse context. Co-reference here is limited
to noun phrases (including proper nouns, nominals, pronouns. and null
arguments), possessives, proper noun pre-modifiers, and verbs.

Co-reference data for this release is performed on BOLT Treebank annotation.
Tokens resulted from treebank annotation (including empty categories/traces)
are directly used for co-reference annotation.

Data covers three genres: Discussion Forum (DF), SMS/Chat, and Conversational
Telephone Speech (CTS).


2. Annotation Data Profile

Language   Genre           File     SourceTokens 
------------------------------------------------
English    DF              848      415159
English    SMS/Chat        483      100291
English    CTS             53       109718
----------------------------------------------
Total                      1384     625168

Notes: sourceTokens = tree tokens
     
3. Annotation

The annotation guidelines are included in this package, and can be
found at docs/.  The guideines were largely developed under the
OntoNotes effort, which was part of the DARPA GALE project.  They were
extended as part of this BOLT effort to better cover the new data
genres.

The annotation data is stored in data/. 

The co-reference annotation was performed using the Callisto
annotation tool, developed at MITRE, with a configuration customized
for the co-reference task.  Spans were created automatically for each
noun phrase as found in the Treebank parse, and annotators linked
those spans together to create co-reference chains.  The tool also
allowed annotators to add spans for verbs, pronouns, or proper
pre-modifiers whenever they are co-referent with a noun phrase.

4. File Format

Co-reference annotation output data is stored in an XML format with
the following structure:

<!ELEMENT DOC (#PCDATA|COREF)*>
<!ATTLIST DOC DOCNO CDATA #IMPLIED>
<!ELEMENT COREF (#PCDATA|COREF)*>
<!ATTLIST COREF ID CDATA #REQUIRED>
<!ATTLIST COREF SPEAKER CDATA #IMPLIED>
<!ATTLIST COREF TYPE CDATA #REQUIRED>
<!ATTLIST COREF SUBTYPE CDATA #IMPLIED>
<!ATTLIST COREF E_OFF CDATA #IMPLIED>
<!ATTLIST COREF S_OFF CDATA #IMPLIED>

Each line in the co-reference annotation output files lists the tokens
from one sentence in the underlying Treebank release.  XML <COREF> brackets 
are wrapped around each mention of a coreferent entity, for example:
   <COREF ID="2" TYPE="IDENT">They</COREF> report on just how 
   difficult it *EXP* will be *T* *PRO* to regulate the amount 
   of CO2 being released * .
The ID attribute number tells which entity this mention belongs to.

"IDENT"-type tags are used for normal linguistic co-reference. "APPOS"
tags are used to mark the elements in an appositive construction,
as described in the guidelines.

During annotation, speaker labels that are not part of the Treebank layer 
were still provided to the annotators. If the annotators marked a chain of 
mentions as being co-referent with one of the speakers, a SPEAKER attribute 
on the COREF tag provides that information.

In a few cases, the co-referent mention is only a sub-portion of the token.
In such cases, an additional S-OFF (start offset) or E-OFF (end offset) tag
is used to specify the character offset where the substring begins or ends.

Examples of output:

And <COREF ID="1" TYPE="IDENT">Iran</COREF> will never dare *PRO* nuke 
<COREF ID="16" TYPE="IDENT">US</COREF> , not even *PRO* using terrorists .

A few of the longer files were split into multiple subfiles to ease the
annotation. 

5. Data Directory Structure

 - data/*.coref: co-reference annotation data

6. Documentation

 - docs/filelist.txt: the list of files showing package structures.
 - docs/English_OntoNotes_guidelines_revised_10-11-07.pdf: annotation guidelines
 - docs/coref.dtd: data format file

7. Data Validation and Sanity Check

 - Validate XML files against DTD (in the docs/)
 - Verify co-reference tokens match tree tokens from treebank annotation
 - Verify .coref filename stems consistent with tree filename stems

8. Acknowledgements

 This material is based upon work supported by the Defense Advanced Research
 Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does
 not necessarily reflect the position or the policy of the Government, and no
 official endorsement should be inferred.

 Stephen Grimes, Xuansong Li, and Stephanie Strassel from LDC made contributions
 to co-reference data via drafting documentation, sanity-checking data, 
 specifying data format, and streamlining data release process.   

9. Copyright Info

(c) 2012, 2013, 2014, 2015, 2020 Trustees of the University of Pennsylvania.

10. Contact Information

If you have questions about this data release, please contact the
following personnel:

Lance Ramshaw <lance.ramshaw@bbn.com>
Stephanie Strassl <strassel@ldc.upenn.edu>
Xuansong Li <xuansong@ldc.upenn.edu>

--------------------------------------------------------------------------
README Created August 27, 2015 by Lance Ramshaw and Xuansong Li