Home › Language Resources › Data

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations

Item Name:	BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations
Author(s):	Jennifer Tracey, Song Chen, Dana Delgado, Stephanie Strassel
LDC Catalog No.:	LDC2025T05
ISLRN:	075-534-579-254-4
DOI:	https://doi.org/10.35111/jsx-25t05
Release Date:	May 15, 2025
Member Year(s):	2025
DCMI Type(s):	Text
Data Source(s):	telephone conversations, transcribed speech
Project(s):	BOLT
Application(s):	cross-lingual information retrieval, information retrieval, machine translation, speaker identification
Language(s):	English, Mandarin Chinese
Language ID(s):	eng, cmn
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2025T05 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Tracey, Jennifer, et al. BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations LDC2025T05. Web Download. Philadelphia: Linguistic Data Consortium, 2025.
Related Works: Hide	View isAnnotationOf LDC2025S04 BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Audio hasAnnotation LDC2016T13 Chinese Treebank 9.0 LDC2020T15 BOLT Chinese-English Word Alignment and Tagging -- Conversational Telephone Speech Training LDC2021T07 BOLT Chinese Co-reference -- Discussion Forum, SMS/Chat, and Conversational Telephone Speech isSimilarWith LDC2016T19 BOLT Chinese-English Word Alignment and Tagging -- Discussion Forum Training LDC2019T13 BOLT Chinese-English Word Alignment and Tagging -- SMS/Chat Training LDC2021T11 BOLT Chinese SMS/Chat Parallel Training Data LDC2025T14 BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and Translations

Introduction

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations was developed by the Linguistic Data Consortium (LDC) and consists of transcripts and their corresponding English translations for 93 hours of conversational telephone speech between native speakers of the Mandarin Chinese dialect spoken in mainland China.

The DARPA BOLT (Broad Operational Language Translation) program developed machine translation and information retrieval for less formal genres, focusing particularly on user-generated content. LDC supported the BOLT program by collecting informal data sources -- discussion forums, conversational telephone speech, text messaging and chat -- in Chinese, Egyptian Arabic and English. The telephone data was transcribed, translated and annotated for various tasks including word alignment, treebanking, and co-reference.

Data

The source audio recordings consist of 236 telephone conversations taken from LDC's multilingual CALLFRIEND and CALLHOME series developed to support speech identification and language identification technology development.

Transcribers were required to produce a verbatim transcript of all speech within a file using simplified Chinese orthography and to add minimal markup to capture salient features of the speech. Some transcripts include redactions for potential personally identifying information. Further information about the transcription methodology is contained in the transcription guidelines accompanying this release. All speech data was transcribed.

The goal of the BOLT translation task was to translate the Chinese transcripts into fluent English while preserving the meaning present in the original Chinese text. Transcripts in the development and evaluation partitions received first pass and gold standard translations. Further information about the translation methodology is contained in the translation guidelines accompanying this release. 89% of the transcripts were translated into English.

The transcripts are divided into training, development and evaluation partitions as follows:

partition	doc count	su count	src ntoken	src nword	eng nword
train	30	7,490	72,242	48,161	60,303
dev	101	31,429	332,201	221,467	215,714
eval	170	113,149	1,189,415	792,943	880,202
total	301	152,068	1,593,858	1,062,572	1,155,679

Transcripts and translations are presented in xml format, UTF-8 encoded.

Samples

Acknowledgement

This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred.

Updates

No updates at this time.

BOLT CTS CALLFRIEND CALLHOME Mainland Mandarin Chinese Transcripts and Translations

Introduction

Data

Samples

Acknowledgement

Updates

Copyright

Available Media

View Fees