Title: BOLT CTS CallFriend And CallHome Mainland Mandarin Chinese Audio Authors: David Graff, Song Chen, Stephanie Strassel 1.0 Introduction The DARPA BOLT (Broad Operational Language Translation) program is aimed at developing technology that enables English speakers to retrieve and understand information from informal foreign language sources including chat, text messaging and spoken conversations. Linguistic Data Consortium collected, transcribed, translated and annotated training, development and evaluation data in English, Mandarin Chinese and Egyptian Arabic to support BOLT. This corpus consists of Mainland Mandarin Chinese conversational telephone speech (CTS) recordings used in the BOLT Phase 3 Evaluation. The data was originally collected as part of the CallHome and CallFriend projects but has not been published previously outside of the BOLT program. Transcripts and translations for the audio recordings in this corpus are available in a separate release: LDC2025T05 BOLT CTS CallFriend and CallHome Mainland Mandarin Chinese Transcripts and Translation. 2.0 Directory structure docs/README.txt - this file data/{dev,eval,train} The data in this package are organized by the partition as used by the BOLT program - development data (dev), training data (train) and evaluation data (eval). /{train,dev,eval}/.flac where is the original CallHome/CallFriend audio file name docs/ filestats.tab - inventory of files, including partition, collecction (CallFriend or CallHome), duration 3.0 Data profile The following table shows the data volume for this package. +-----------+-------------+------------+ | partition | audio files |total hours | |-----------+-------------+------------+ | train | 170 | 71.28 | |-----------+-------------+------------+ | dev | 18 | 4.12 | |-----------+-------------+------------+ | eval | 48 | 18.48 | +-----------+-------------+------------+ | total | 236 | 93.88 | +-----------+-------------+------------+ 4.0 Data format All audio files are presented in FLAC-compressed MS-WAV/RIFF format; they are two-channel, 16-bit sample data (converted from original mu-law samples) at 8000 samples/second. The audio content is 2 interleaved channels per file (representing the two independently recorded sides of the telephone conversation), comprised of 16-bit PCM samples at 8000 samples per second. 4.1 Echo Cancellation Any audio files that happen to contain cross-channel "echo" (caused by the public telephone network on short-distance calls involving land-line phones) were conditioned by a standard echo-cancellation process. 4.2 Audio Redaction All calls in this package were transcribed as part of the BOLT program. During transcription of most calls, annotators indicated regions containing potentially sensitive personal identifying information (PII). In this package, audio files in the dev and eval partitions as well as portions of the training data have been redacted for these regions using the following method. The remainder of the training partition files did not have audio redaction applied since they were originally published prior to BOLT transcription. (1) During transcription, annotators segment and transcribe in the normal way, covering all content. In addition, they indicate regions that may contain sensitive PII. (2) Time stamps of segments that are marked as containing sensitive PII are adjusted if necessary, so that they cover the minimal amount of signal on the given channel necessary to contain just the PII, along with any speech that is spoken continuously (within a single prosodic phrase) adjacent to the PII string. (3) Based on the adjusted time stamps of these segments, the audio files are processed to replace those segments with silence; in particular, each sample value on the affected channel will be set to a new value, selected at random from the set -2, -1, 1, 2. With regard to this handling of the audio data, note that all audio files were originally recorded as mu-law samples. When converting to 16-bit PCM in the standard way, the relevant mu-law encoded values relate to integer values as follows: mu-law integer 0xff 0 0xfe 8 0xfd 16 ... 0x7d -16 0x7e -8 0x7f 0 By replacing original values with randomized, non-zero values in the range -2 .. 2, the redacted sensitive II segments will be digitally distinct from the rest of the data in each audio file, but will be effectively indistinguishable from "total silence" in terms of typical signal analysis measures (RMS energy, spectral coefficients, etc), and will not induce any undesirable side-effects (zero-energy frame buffers, frequency-domain artifacts, etc). 5.0 Documentation included in this package The ./docs directory (relative to the root directory of ths package) contains a tab-delimited table file file_info.tab to provide additional information for each audio file included in the package. The columns are tab-delimited and the intial line of the file provides the column labels as shown below: Col.# Content 1. lang - the language of the file (all files have the value of arb) 2. partition - the partition of the file (train | dev | eval) 3. file_id - the unique file ID 4. original_audio_package - the original catalogue number the audio file is released 5. collection - the collection that the file was from, in which CF stands for CallFriend and CH stands for CallHome 6. duration - the duration of the file in seconds 6.0 Sponsorship This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. 7.0 Contact Information If you have any questions about the data in this release, please contact the following personnel at the LDC. Song Chen - BOLT Manager Stephanie Strassel - BOLT PI ----------- README created by Song Chen on May 14, 2024 updated by Song Chen on June 13, 2024 updated by Song Chen on July 11, 2024 updated by Stephanie Strassel on August 1, 2024