BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech
| Item Name: | BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech |
| Author(s): | Nitin Agarwal, Michelle Francini, Michelle Kappler, Linnea Micciulla, Sameer Pradhan, Lance Ramshaw |
| LDC Catalog No.: | LDC2021T14 |
| ISBN: | 1-58563-969-9 |
| ISLRN: | 176-795-802-758-5 |
| DOI: | https://doi.org/10.35111/4z7r-vh07 |
| Release Date: | July 15, 2021 |
| Member Year(s): | 2021 |
| DCMI Type(s): | Text |
| Data Source(s): | discussion forum, telephone conversations, text chat conversations |
| Project(s): | BOLT |
| Application(s): | coreference resolution |
| Language(s): | Egyptian Arabic |
| Language ID(s): | arz |
| License(s): |
LDC User Agreement for Non-Members |
| Online Documentation: | LDC2021T14 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Agarwal, Nitin, et al. BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech LDC2021T14. Web Download. Philadelphia: Linguistic Data Consortium, 2021. |
| Related Works: | View |
Introduction
BOLT Egyptian Arabic Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech was developed by Raytheon BBN Technologies and consists of co-reference annotation on Egyptian Arabic discussion forum (DF), SMS/Chat and conversational telephone speech (CTS).
The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. The Linguistic Data Consortium (LDC) supported the BOLT program by collecting informal data sources -- discussion forums, text messaging and chat -- in Chinese, Egyptian Arabic and English. The collected data was translated and annotated for various tasks including word alignment, treebanking, propbanking and co-reference.
Data
DF data was collected from the web using a combination of manual and automatic processes. SMS/Chat material was donated or collected via live platforms. CTS data was taken from LDC's Egyptian Arabic CALLHOME and CALLFRIEND telephone collections.
Co-reference annotation aims to fill in all of the connections between specific mentions in the text that refer to the same entities and events in the discourse context. BOLT co-reference annotation was performed on BOLT treebank annotation. It covers noun phrases (including proper nouns, nominals, pronouns and null arguments), possessives, proper noun pre-modifiers and verbs.
Annotation files are presented in UTF-8 encoded XML format.
Sponsorship
This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.
Samples
Please view the following samples:
Updates
None at this time.