BOLT English Co-reference -- DF, SMS/Chat, and CTS Raytheon BBN Technologies Authors: Nitin Agarwal, Michelle Franchini, Michelle Kappler, Linnea Micciula, Sameer Pradhan, Lance Ramshaw 1. Introduction The DARPA BOLT Program created new techniques for automated translation and linguistic analysis that can be applied to informal genres of text and speech common to online and in-person communications. The BOLT data team led by Linguistic Data Consortium was responsible for collecting informal data sources including discussion forums, text messaging and chat and conversational telephone speech in English, Chinese and Egyptian Arabic, and applying annotations including translation, word alignment, Treebanking, PropBanking, co-reference and queries/responses. This corpus contains the data produced for the co-reference annotation task led by Raytheon BBN Technologies. Co-reference annotation aims to fill in all of the connections between specific mentions in the text that refer to the same entities and events in the discourse context. Co-reference here is limited to noun phrases (including proper nouns, nominals, pronouns. and null arguments), possessives, proper noun pre-modifiers, and verbs. Co-reference data for this release is performed on BOLT Treebank annotation. Tokens resulted from treebank annotation (including empty categories/traces) are directly used for co-reference annotation. Data covers three genres: Discussion Forum (DF), SMS/Chat, and Conversational Telephone Speech (CTS). 2. Annotation Data Profile Language Genre File SourceTokens ------------------------------------------------ English DF 848 415159 English SMS/Chat 483 100291 English CTS 53 109718 ---------------------------------------------- Total 1384 625168 Notes: sourceTokens = tree tokens 3. Annotation The annotation guidelines are included in this package, and can be found at docs/. The guideines were largely developed under the OntoNotes effort, which was part of the DARPA GALE project. They were extended as part of this BOLT effort to better cover the new data genres. The annotation data is stored in data/. The co-reference annotation was performed using the Callisto annotation tool, developed at MITRE, with a configuration customized for the co-reference task. Spans were created automatically for each noun phrase as found in the Treebank parse, and annotators linked those spans together to create co-reference chains. The tool also allowed annotators to add spans for verbs, pronouns, or proper pre-modifiers whenever they are co-referent with a noun phrase. 4. File Format Co-reference annotation output data is stored in an XML format with the following structure: Each line in the co-reference annotation output files lists the tokens from one sentence in the underlying Treebank release. XML brackets are wrapped around each mention of a coreferent entity, for example: They report on just how difficult it *EXP* will be *T* *PRO* to regulate the amount of CO2 being released * . The ID attribute number tells which entity this mention belongs to. "IDENT"-type tags are used for normal linguistic co-reference. "APPOS" tags are used to mark the elements in an appositive construction, as described in the guidelines. During annotation, speaker labels that are not part of the Treebank layer were still provided to the annotators. If the annotators marked a chain of mentions as being co-referent with one of the speakers, a SPEAKER attribute on the COREF tag provides that information. In a few cases, the co-referent mention is only a sub-portion of the token. In such cases, an additional S-OFF (start offset) or E-OFF (end offset) tag is used to specify the character offset where the substring begins or ends. Examples of output: And Iran will never dare *PRO* nuke US , not even *PRO* using terrorists . A few of the longer files were split into multiple subfiles to ease the annotation. 5. Data Directory Structure - data/*.coref: co-reference annotation data 6. Documentation - docs/filelist.txt: the list of files showing package structures. - docs/English_OntoNotes_guidelines_revised_10-11-07.pdf: annotation guidelines - docs/coref.dtd: data format file 7. Data Validation and Sanity Check - Validate XML files against DTD (in the docs/) - Verify co-reference tokens match tree tokens from treebank annotation - Verify .coref filename stems consistent with tree filename stems 8. Acknowledgements This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. Stephen Grimes, Xuansong Li, and Stephanie Strassel from LDC made contributions to co-reference data via drafting documentation, sanity-checking data, specifying data format, and streamlining data release process. 9. Copyright Info (c) 2012, 2013, 2014, 2015, 2020 Trustees of the University of Pennsylvania. 10. Contact Information If you have questions about this data release, please contact the following personnel: Lance Ramshaw Stephanie Strassl Xuansong Li -------------------------------------------------------------------------- README Created August 27, 2015 by Lance Ramshaw and Xuansong Li