BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech

Item Name: BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
Author(s): Nitin Agarwal, Michelle Franchini, Michelle Kappler, Linnea Micciulla, Sameer Pradhan, Lance Ramshaw
LDC Catalog No.: LDC2020T20
ISBN: 1-58563-951-6
ISLRN: 494-155-932-422-8
DOI: https://doi.org/10.35111/8wq1-d250
Release Date: December 15, 2020
Member Year(s): 2020
DCMI Type(s): Text
Data Source(s): discussion forum, telephone conversations, text chat conversations
Project(s): BOLT
Application(s): coreference resolution
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2020T20 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Agarwal, Nitin, et al. BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech LDC2020T20. Web Download. Philadelphia: Linguistic Data Consortium, 2020.
Related Works: View

Introduction

BOLT English Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on English discussion forum (DF), SMS/Chat and conversational telephone speech (CTS).

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The Linguistic Data Consortium (LDC) supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.

Data

DF data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. CTS data was taken from LDC's Arabic and Chinese CALLHOME and CALLFRIEND telephone collections; the audio files were transcribed and translated into English.

Co-reference annotation aims to fill in all of the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers and verbs.

Annotation files are presented in UTF-8 encoded XML format.

Acknowledgements

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Samples

Please view these samples:

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee