Title: BOLT CTS CallFriend CallHome Mainland Mandarin Chinese Transcripts and Translations

Authors: Jennifer Tracey, Song Chen, Dana Delgado, Stephanie Strassel

1.0 Introduction

The DARPA BOLT (Broad Operational Language Translation) program is aimed at
developing technology that enables English speakers to retrieve and
understand information from informal foreign language sources including
chat, text messaging and spoken conversations. Linguistic Data Consortium
collected, transcribed, translated and annotated training, development and
evaluation data in English, Mandarin Chinese and Egyptian Arabic to support
BOLT.

This corpus consists of transcripts and translations of Mainland Mandarin
Chinese conversational telephone speech (CTS) data used in the BOLT Phase 3
Evaluation. The recordings were originally collected as part of the
CallHome and CallFriend projects and subsequently transcribed and
translated for BOLT, but have not been published previously outside of the
BOLT Program.

Audio recordings corresponding to the Transcripts and translations in this
corpus are available in a separate release:

  LDC2025S04 BOLT CTS CallFriend and CallHome Mainland Mandarin Chinese Audio

2.0 Directory structure

 docs/README.txt         - this file

 data/{dev,eval,train}

The data in this package are organized by the partition as used by
the BOLT program - development data (dev), training data (train)
and evaluation data (eval). Dev and Eval data are further separated
into gold standard (gs) and first pass (1p) - see section 6.1 for
more details. Throughout this README, "su" refers to sentence/segment
unit. 

  /train/transcripts/<original_filename>.<transcript_lang>.su.xml
  /train/translation/<original_filename>.<translation_lang>.su.xml

and

  /{dev,eval}/1p/transcripts/<original_filename>.<transcript_lang>.su.xml
  /{dev,eval}/1p/translation/<original_filename>.<translation_lang>.su.xml

where
    <original_filename>      is the original CH/CF audio file name
    <transcript_lang>        is cmn which stands for Mandarin Chinese
    <translation lang>       is eng which stands for English

and

/{dev,eval}/gs/transcripts/<original_filename>.<chunk>.<transcript_lang>.su.xml
/{dev,eval}/gs/translation/<original_filename>.<chunk>.<translation_lang>.su.xml

where
    <original_filename>      is the original CH/CF audio file name
    <chunk>                  is head, middle or tail
    <transcript_lang>        is cmn which stands for Mandarin Chinese
    <translation lang>       is eng which stands for English

Each gold-standard file represents a chunk which is a part of the original
conversation (head, middle or tail). Multiple chunks may be drawn from the
same original conversation.

docs/
  filestats.tab   - inventory of files, including an SU count, token
                    count, word count, duration in seconds, and source
                    release for the corresponding audio for each file.
  BOLT_Phase3_Chinese_CTS_Transcription_guidelines_V1.6.pdf
                  - Transcription guidelines for Mandarin Chinese
  BOLT_Alternative_Translation_Guideline_Chinese_V1.2.pdf
                  - Gold standard translation guidelines for Mandarin Chinese 
  BOLT_Phase3_Chinese_CTS_translation_guidelines_V1.pdf
                  - First pass(1p) translation guidelines for Mandarin Chinese
  su-cts.dtd      - a DTD for CTS su.xml files


3.0 Data profile

This release includes Mandarin Chinese transcript files and English
translations.

The following table shows the data volume of this package.
 +-----------+----------+----------+----------- +-----------+-----------+
 | partition |doc_count | su_count | src_ntoken | src_nword | eng_nword |
 +-----------+----------+----------+------------+-----------+-----------+
 |  dev      |   30     |   7,490  |   72,242  	|   48,161  |	60,303  |
 +-----------+----------+----------+------------+-----------+-----------+
 |  eval     |   101    |  31,429  |  332,201   |  221,467  |  215,714  |
 +-----------+----------+----------+------------+-----------+-----------+
 |  train    |   170    | 113,149  | 1,189,415  |  792,943  |  880,202  |
 +-----------+----------+----------+------------+-----------+-----------+
 |  total    |   301    | 152,068  | 1,593,858  | 1,062,572 | 1,155,679 |
 +-----------+----------+----------+------------+-----------+-----------+

Note that source word counts (src_nword) are estimated based on an average
rate of 1.5 Chinese characters per word.  

4.0 Data format

All transcript and translation documents are in xml format.

Note that the Speaker IDs are based on the side of the phone call.
For example: Speaker A and A1 are on the same side of the call, while
speaker B is on the other end.

See docs/su-cts.dtd for more details.

5. Transcription

Transcribers used standard Simplified Chinese Orthography for transcription. 
See transcription guidelines in /docs for details.

5.1 Transcription Redaction

During transcription, annotators were instructed to indicate regions 
that may contain potentially sensitive personal identifying information 
(PII). In this packages, transcripts in the dev and eval partitions as 
well as portions of the training have been redacted for thes regions 
using the following method. The remainder of the training partition files 
were transcribed and released prior to BOLT transcription, so redaction 
was not applied to the transcripts.

 (1) Annotators segment and transcribe in the normal way, covering all
 content. In addition, they indicate regions that may contain sensitive
 PII.

 (2) Time stamps of segemnts that are marked as containing sensitive
 PII are adjusted if necessary, so that the segment cover the minimal
 amount regions on the given channel necessary to contain just the PII,
 along with any texts that is spoken continuously (within a single
 prosodic phrase) adjacent to the PII string.

 (3) In the transcript and translation file, the full text of each PII
 segment is replaced with the string "[redacted]".

5.2 Transcription markup

Below are the metacharacters used to indicate certain features in the
transcripts:

    - partial words            -
    - incomplete utterance     --
    - mispronounced words      +
    - speaker noise            {cough},{laugh},{lipsmack},{sneeze}
    - semi-intelligible speech ((text))
    - unintelligble speech     (())
    - idiosyncratic words      *
    - foreign language        <foreign lang="langauge"> text </foreign>
      note: "language" can be any language or "unknown"
    - non-Putonghua dialect     <foreign lang="non-PTH"> text </foreign>
    - PII                     [redacted]

7. Translation

Transcripts serve as the input for translation.

7.1 First Pass and Gold Standard Translations

Dev and Eval transcripts in this release include both first pass and gold
standard translation versions. After transcription was completed on the
full file, a first pass translation was done. After first pass translation
was completed, selected portions of the conversations were then subjected
to additional quality control in order to obtain gold standard
translation. 

LDC performed gold standard translation quality control on selected
conversations in three QC iterations. The first iteration of additional QC
was done by junior bilingual annotators to correct translation errors. The
second iteration of QC was done by senior bilingual annotators to add
translation alternatives and correct any remaining errors. The third
iteration of QC is done by native English-speaking annotators to improve
target language use.

7.2 Translation mark-ups

Below are the translation tags that are use to mark certain features in the
target translation:

    - translation alternatives  [intended meaning | literal meaning]
    - correction of typo        =text
    - best guess translation    ((text))
    - partial word              %pw
    - filled pause              %text (closest English equivalent such as
    				um, uh, etc.)
    - Personal Identyifying information         [redacted]

In addition, there are some mark-up features that occur in the transcription
of the source. Most of the transcription mark-ups reflect pronunciation
features and are not carried over into the translation.

7.0 Known Issues

25 files included in this package have transcripts but do not have
translations:

ma_0834
ma_1289
ma_1755
ma_1887
ma_2187
ma_4011
ma_4030
ma_4034
ma_4041
ma_4050
ma_4066
ma_4085
ma_4097
ma_4103
ma_4151
ma_4201
ma_4203
ma_4204
ma_4268
ma_5606
ma_5807
ma_5813
ma_5817
ma_5837
ma_5923

8.0 Sponsorship

This material is based upon work supported by the Defense Advanced Research
Projects Agency (DARPA) under Contract No. HR0011-11-C-0145. The content
does not necessarily reflect the position or the policy of the Government, 
and no official endorsement should be inferred.

9.0 Contact Information

If you have any questions about the data in this release, please
contact the following personnel at the LDC.

Song Chen <zhiyi@ldc.upenn.edu> - BOLT Manager
Dana Delgado <foredana@ldc.upenn.edu> - BOLT Coordinator
Stephanie Strassel <strassel@ldc.upenn.edu> - BOLT PI

-----------
README created by Song Chen on May 14, 2024
       updated by Song Chen on July 11, 2024
       updated by Stephanie Strassel on August 1, 2024