Title: BOLT CTS CallFriend And CallHome Mainland Mandarin Chinese Audio
Authors: David Graff, Song Chen, Stephanie Strassel

1.0 Introduction

The DARPA BOLT (Broad Operational Language Translation) program is aimed at
developing technology that enables English speakers to retrieve and
understand information from informal foreign language sources including
chat, text messaging and spoken conversations. Linguistic Data Consortium
collected, transcribed, translated and annotated training, development and
evaluation data in English, Mandarin Chinese and Egyptian Arabic to support
BOLT. 

This corpus consists of Mainland Mandarin Chinese conversational telephone
speech (CTS) recordings used in the BOLT Phase 3 Evaluation. The data was
originally collected as part of the CallHome and CallFriend projects but
has not been published previously outside of the BOLT program. Transcripts
and translations for the audio recordings in this corpus are available in a
separate release:

  LDC2025T05 BOLT CTS CallFriend and CallHome Mainland Mandarin Chinese
	     Transcripts and Translation.

2.0 Directory structure

 docs/README.txt         - this file

 data/{dev,eval,train}

The data in this package are organized by the partition as used by the BOLT
program - development data (dev), training data (train) and evaluation data
(eval).

  /{train,dev,eval}/<original_filename>.flac

where
    <original_filename> is the original CallHome/CallFriend audio file name

 docs/
    filestats.tab   - inventory of files, including partition, collecction
    (CallFriend or CallHome), duration

3.0 Data profile

The following table shows the data volume for this package.

   +-----------+-------------+------------+
   | partition | audio files |total hours |
   |-----------+-------------+------------+
   | train     |    170      |   71.28    |
   |-----------+-------------+------------+
   | dev       |     18      |    4.12    |
   |-----------+-------------+------------+
   | eval      |     48      |   18.48    |
   +-----------+-------------+------------+
   | total     |    236      |   93.88    |
   +-----------+-------------+------------+

4.0 Data format

All audio files are presented in FLAC-compressed MS-WAV/RIFF format; they
are two-channel, 16-bit sample data (converted from original mu-law samples)
at 8000 samples/second. The audio content is 2 interleaved channels per file
(representing the two independently recorded sides of the telephone
conversation), comprised of 16-bit PCM samples at 8000 samples per second.

4.1 Echo Cancellation

Any audio files that happen to contain cross-channel "echo" (caused by the
public telephone network on short-distance calls involving land-line phones)
were conditioned by a standard echo-cancellation process.

4.2 Audio Redaction

All calls in this package were transcribed as part of the BOLT program.
During transcription of most calls, annotators indicated regions containing
potentially sensitive personal identifying information (PII). In this
package, audio files in the dev and eval partitions as well as portions of
the training data have been redacted for these regions using the following
method. The remainder of the training partition files did not have audio
redaction applied since they were originally published prior to BOLT
transcription.

(1) During transcription, annotators segment and transcribe in the normal way,
covering all content. In addition, they indicate regions that may contain
sensitive PII.

(2) Time stamps of segments that are marked as containing sensitive PII are
adjusted if necessary, so that they cover the minimal amount of signal on
the given channel necessary to contain just the PII, along with any speech
that is spoken continuously (within a single prosodic phrase) adjacent to
the PII string.

(3) Based on the adjusted time stamps of these segments, the audio files are
processed to replace those segments with silence; in particular, each sample
value on the affected channel will be set to a new value, selected at random
from the set -2, -1, 1, 2.

With regard to this handling of the audio data, note that all audio files
were originally recorded as mu-law samples.  When converting to 16-bit PCM
in the standard way, the relevant mu-law encoded values relate to integer
values as follows:

 mu-law integer
 0xff   0
 0xfe   8
 0xfd   16
 ...
 0x7d   -16
 0x7e   -8
 0x7f   0

By replacing original values with randomized, non-zero values in the range
-2 .. 2, the redacted sensitive II segments will be digitally distinct
from the rest of the data in each audio file, but will be effectively
indistinguishable from "total silence" in terms of typical signal analysis
measures (RMS energy, spectral coefficients, etc), and will not induce any
undesirable side-effects (zero-energy frame buffers, frequency-domain
artifacts, etc).

5.0 Documentation included in this package

The ./docs directory (relative to the root directory of ths package) contains
a tab-delimited table file file_info.tab to provide additional information
for each audio file included in the package.

The columns are tab-delimited and the intial line of the file provides the
column labels as shown below:

 Col.#  Content
 1.     lang    - the language of the file (all files have the value of arb)
 2.     partition       - the partition of the file (train | dev | eval)
 3.     file_id - the unique file ID
 4.     original_audio_package  - the original catalogue number the audio
                                  file is released
 5.     collection      - the collection that the file was from, in which
                          CF stands for CallFriend and CH stands for CallHome
 6.     duration        - the duration of the file in seconds

6.0 Sponsorship

This material is based upon work supported by the Defense Advanced Research
Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content
does not necessarily reflect the position or the policy of the Government,
and no official endorsement should be inferred.

7.0 Contact Information

If you have any questions about the data in this release, please contact
the following personnel at the LDC.

Song Chen <zhiyi@ldc.upenn.edu> - BOLT Manager
Stephanie Strassel <strassel@ldc.upenn.edu> - BOLT PI


-----------
README created by Song Chen on May 14, 2024
       updated by Song Chen on June 13, 2024
       updated by Song Chen on July 11, 2024
       updated by Stephanie Strassel on August 1, 2024