Title: BOLT CTS CallFriend CallHome Mainland Mandarin Chinese Transcripts and Translations Authors: Jennifer Tracey, Song Chen, Dana Delgado, Stephanie Strassel 1.0 Introduction The DARPA BOLT (Broad Operational Language Translation) program is aimed at developing technology that enables English speakers to retrieve and understand information from informal foreign language sources including chat, text messaging and spoken conversations. Linguistic Data Consortium collected, transcribed, translated and annotated training, development and evaluation data in English, Mandarin Chinese and Egyptian Arabic to support BOLT. This corpus consists of transcripts and translations of Mainland Mandarin Chinese conversational telephone speech (CTS) data used in the BOLT Phase 3 Evaluation. The recordings were originally collected as part of the CallHome and CallFriend projects and subsequently transcribed and translated for BOLT, but have not been published previously outside of the BOLT Program. Audio recordings corresponding to the Transcripts and translations in this corpus are available in a separate release: LDC2025S04 BOLT CTS CallFriend and CallHome Mainland Mandarin Chinese Audio 2.0 Directory structure docs/README.txt - this file data/{dev,eval,train} The data in this package are organized by the partition as used by the BOLT program - development data (dev), training data (train) and evaluation data (eval). Dev and Eval data are further separated into gold standard (gs) and first pass (1p) - see section 6.1 for more details. Throughout this README, "su" refers to sentence/segment unit. /train/transcripts/..su.xml /train/translation/..su.xml and /{dev,eval}/1p/transcripts/..su.xml /{dev,eval}/1p/translation/..su.xml where is the original CH/CF audio file name is cmn which stands for Mandarin Chinese is eng which stands for English and /{dev,eval}/gs/transcripts/...su.xml /{dev,eval}/gs/translation/...su.xml where is the original CH/CF audio file name is head, middle or tail is cmn which stands for Mandarin Chinese is eng which stands for English Each gold-standard file represents a chunk which is a part of the original conversation (head, middle or tail). Multiple chunks may be drawn from the same original conversation. docs/ filestats.tab - inventory of files, including an SU count, token count, word count, duration in seconds, and source release for the corresponding audio for each file. BOLT_Phase3_Chinese_CTS_Transcription_guidelines_V1.6.pdf - Transcription guidelines for Mandarin Chinese BOLT_Alternative_Translation_Guideline_Chinese_V1.2.pdf - Gold standard translation guidelines for Mandarin Chinese BOLT_Phase3_Chinese_CTS_translation_guidelines_V1.pdf - First pass(1p) translation guidelines for Mandarin Chinese su-cts.dtd - a DTD for CTS su.xml files 3.0 Data profile This release includes Mandarin Chinese transcript files and English translations. The following table shows the data volume of this package. +-----------+----------+----------+----------- +-----------+-----------+ | partition |doc_count | su_count | src_ntoken | src_nword | eng_nword | +-----------+----------+----------+------------+-----------+-----------+ | dev | 30 | 7,490 | 72,242 | 48,161 | 60,303 | +-----------+----------+----------+------------+-----------+-----------+ | eval | 101 | 31,429 | 332,201 | 221,467 | 215,714 | +-----------+----------+----------+------------+-----------+-----------+ | train | 170 | 113,149 | 1,189,415 | 792,943 | 880,202 | +-----------+----------+----------+------------+-----------+-----------+ | total | 301 | 152,068 | 1,593,858 | 1,062,572 | 1,155,679 | +-----------+----------+----------+------------+-----------+-----------+ Note that source word counts (src_nword) are estimated based on an average rate of 1.5 Chinese characters per word. 4.0 Data format All transcript and translation documents are in xml format. Note that the Speaker IDs are based on the side of the phone call. For example: Speaker A and A1 are on the same side of the call, while speaker B is on the other end. See docs/su-cts.dtd for more details. 5. Transcription Transcribers used standard Simplified Chinese Orthography for transcription. See transcription guidelines in /docs for details. 5.1 Transcription Redaction During transcription, annotators were instructed to indicate regions that may contain potentially sensitive personal identifying information (PII). In this packages, transcripts in the dev and eval partitions as well as portions of the training have been redacted for thes regions using the following method. The remainder of the training partition files were transcribed and released prior to BOLT transcription, so redaction was not applied to the transcripts. (1) Annotators segment and transcribe in the normal way, covering all content. In addition, they indicate regions that may contain sensitive PII. (2) Time stamps of segemnts that are marked as containing sensitive PII are adjusted if necessary, so that the segment cover the minimal amount regions on the given channel necessary to contain just the PII, along with any texts that is spoken continuously (within a single prosodic phrase) adjacent to the PII string. (3) In the transcript and translation file, the full text of each PII segment is replaced with the string "[redacted]". 5.2 Transcription markup Below are the metacharacters used to indicate certain features in the transcripts: - partial words - - incomplete utterance -- - mispronounced words + - speaker noise {cough},{laugh},{lipsmack},{sneeze} - semi-intelligible speech ((text)) - unintelligble speech (()) - idiosyncratic words * - foreign language text note: "language" can be any language or "unknown" - non-Putonghua dialect text - PII [redacted] 7. Translation Transcripts serve as the input for translation. 7.1 First Pass and Gold Standard Translations Dev and Eval transcripts in this release include both first pass and gold standard translation versions. After transcription was completed on the full file, a first pass translation was done. After first pass translation was completed, selected portions of the conversations were then subjected to additional quality control in order to obtain gold standard translation. LDC performed gold standard translation quality control on selected conversations in three QC iterations. The first iteration of additional QC was done by junior bilingual annotators to correct translation errors. The second iteration of QC was done by senior bilingual annotators to add translation alternatives and correct any remaining errors. The third iteration of QC is done by native English-speaking annotators to improve target language use. 7.2 Translation mark-ups Below are the translation tags that are use to mark certain features in the target translation: - translation alternatives [intended meaning | literal meaning] - correction of typo =text - best guess translation ((text)) - partial word %pw - filled pause %text (closest English equivalent such as um, uh, etc.) - Personal Identyifying information [redacted] In addition, there are some mark-up features that occur in the transcription of the source. Most of the transcription mark-ups reflect pronunciation features and are not carried over into the translation. 7.0 Known Issues 25 files included in this package have transcripts but do not have translations: ma_0834 ma_1289 ma_1755 ma_1887 ma_2187 ma_4011 ma_4030 ma_4034 ma_4041 ma_4050 ma_4066 ma_4085 ma_4097 ma_4103 ma_4151 ma_4201 ma_4203 ma_4204 ma_4268 ma_5606 ma_5807 ma_5813 ma_5817 ma_5837 ma_5923 8.0 Sponsorship This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 9.0 Contact Information If you have any questions about the data in this release, please contact the following personnel at the LDC. Song Chen - BOLT Manager Dana Delgado - BOLT Coordinator Stephanie Strassel - BOLT PI ----------- README created by Song Chen on May 14, 2024 updated by Song Chen on July 11, 2024 updated by Stephanie Strassel on August 1, 2024