CALLHOME Japanese Omnibus Release

                               October 17, 2023

                          Linguistic Data Consortium


1. Overview
===========
This is an updated release of the CALLHOME Japanese corpus. The original
CALLHOME corpus was collected and transcribed by the Linguistic Data
Consortium primarily in support of the project on Large Vocabulary
Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department
of Defense.

This re-release combines the original CALLHOME Japanese Speech (LDC96S37) and
Transcripts (LDC96T18) corpora, and updates the directory structure, file
formats, documentation, etc. to modern standards.


2. Directory structure
======================
- data/flac/  --  FLAC files containing call audio
- data/transcripts/orig/  --  original transcripts in WebTrans TSV format
- data/transcripts/updated/  --  updated transcripts in WebTrans TSV format
- docs/calldata.tbl  --  basic information about each call including what
  partition (train/dev/test) it belongs to and results from the call quality
  audit
- docs/doc_calldata.txt  --  documentation of "calldata.tbl"
- docs/speakerdata.tbl  --  audit-derived information about the transcribed
  speakers
- docs/doc_speakerdata.txt  --  documentation of "speakerdata.tbl"
- docs/pindata.tbl  --  participant supplied demographics for initiator of
  each call
- docs/doc_pindata.txt  --  documentation of "pindata.tbl"
- docs/lex_segm.txt  --  a documentation of general principles for Japanese
  word segmentation
- docs/file.tbl  --  listing of md5 checksums, sizes, dates, and file names
- docs/README.txt  --  this file
- docs/transcription_specs_orig.txt  --  documentation of the original
  transcription specifications
- docs/transcription_specs_updated.pdf  --  documentation of updated
  transcription specifications


3. CALLHOME
===========
3.1 Data acquisition
--------------------
Speakers were solicited by Rutgers and the LDC to participate in this
telephone speech collection effort through personal contacts and via the
internet. A total of 200 call originators were found, each of whom placed a
telephone call via a toll-free robot operator maintained originally by Rutgers
University, and later by the LDC. Access to the robot operator was possible
via a unique Personal Identification Number (PIN) issued by the recruiting
staff at Rutgers or the LDC when the caller enrolled in the project. The
participants were made aware that their telephone call would be recorded, as
were the call recipients. The call was allowed only if both parties agreed to
being recorded. Each caller was allowed to talk up to 30 minutes. Each caller
was allowed to place only one telephone call.

In all, 200 calls were transcribed. Of these, 80 were designated as training
calls, 20 as development test calls, and 100 as evaluation test calls. Of
these 100 evaluation test calls, 20 were exposed in the original CALLHOME
releases (LDC96S37 and LDC96T18); the remainder were withheld for use in
future evaluations and remain unexposed in this release. For each of the
training and development test calls, a contiguous 10-minute region was
selected for transcription; for the evaluation test calls, a 5-minute region
was transcribed.


3.2 Data verification
---------------------
After a successful call was completed, a human audit of each telephone call
was conducted to verify that the proper language was spoken, to check the
quality of the recording, and to select and describe the region to be
transcribed. The description of the transcribed region provides information
about channel quality, number of speakers, their gender, and other attributes.

The information about each call may be found in the file
"docs/calldata.tbl", and its contents are described in greater detail in
"docs/doc_calldata.txt". The audit-derived information about the transcribed
speakers may be found in the file "docs/speakerdata.tbl", whose contents are
described in the file "docs/doc_speakerdata.txt".


3.3 Speaker demographics
------------------------
Information on speaker demographics can be found in the file
"docs/pindata.tbl", whose contents are described in the file
"docs/doc_pindata.txt".


4. Word segmentation
--------------------
Segmentation of the Japanese transcripts was performed by hand at the LDC by
Megumi Kobayashi and Masayo Kaneko. Word segmentation principles for Japanese
were formulated in collaboration with LVCSR CALLHOME contractors, especially
Yoshiko Ito and Paul Bamberg at Dragon Systems. These principles are described
in "docs/lex_segm.txt". Certain dialect words, tagged with "dia", are
exceptions to these principles; dialect-specific contractions never occur in
uncontracted form.


5. File formats
===============
5.1 Audio
---------
Audio is provided as 8 kHz, 16 bit two channel FLAC files converted from the
original SHORTEN compressed SPHERE files. No resampling or additional
processing was performed.


5.2 Transcripts
---------------
The transcripts are released as UTF-8 TSV files in the format output by
WebTrans (a web-based transcription tool in use at LDC). Each file consists of
a sequence of transribed speech segments, one per line, each line having the
following six tab-delimited fields:

- Audio  --  basename of audio file
- Channel  --  channel segment is on in audio file (1-indexed)
- Beg  --  onset of speaker turn in seconds from beginning of audio file
- End  --  offset of speaker turn in seconds from beginning of audio file
- Text  --  transcript
- Speaker  --  speaker id; within CALLHOME speaker ids are only guaranteed to
  be unique within the scope of a call


6. Transcription
================
In this release we provide two versions of the transcripts:

- the version previously released in LDC96T18 (Section 6.1)
- an "updated" version which has been transformed to more closely resemble
  output of current LDC transcription tasks (Section 6.2)


6.1 Original transcription
--------------------------
The transcripts are identical to those in the LDC96T18 release except that the
text encoding has been updated to UTF-8.

For details regarding the original transcription guidelines, please see the
document:

    docs/transcription_specs_orig.txt


6.2 Updated transcription
-------------------------
The updated transcripts conform to the most recent version of in-house LDC
transcription specifications as described in:

    docs/transcription_specs_updated.pdf

with the following exceptions:

- When the initial portion of a word is ellided, this is indicated by '-'; e.g.

      Three senators -stained from the vote.

- {noise} indicates a background noise not made by a speaker
- Speech in foreign language is marked by inline XML elements. The element is
  named "foreign" and has a single mandatory "lang" attribute. The value of
  this attribute is the ISO 639-3 code for the language. E.g.:

    <foreign lang="eng"> hellow </foreign>

A nonexhaustive list of the transformations applied:

- Normalized unintelligible regions annotations; e.g., by removing of
  leading/trailing whitespace:

      (( text)) -> ((text))

- Normalized speaker-produced noises to the set allowed in modern guidelines:

  - {laugh}
  - {cough}
  - {breath}
  - {lipsmack}
  - {NSV} (catch all for all other noises)

  E.g., {breath_noise} -> {breath}

- Mapped all background noises not made by speaker to {noise}; e.g.

      [click]  -->  {noise}
      [static] -->  {noise}

- Removed background or channel sound annotations, e.g.,

      [echoing] text [/echoing] -> text

- Removed inline comments, e.g.,

      [[distortion]] -> ''

- Use inline XML to mark speech from foreign languages. Maked using an element
  named "foreign" with a single mandatory attribute "lang" that indicates a
  three letter ISO 639-3 language code. E.g.

      <English_w1_w2> --> <foreign lang="eng"> w1 w2 </foreign>

- Mark initialisms and spoken words using '~'; e.g.

   C E O  -->  ~CEO
   CD ROM  --> ~CD ROM

- Removed redundant whitespaces, e.g., w1  w2 -> w1 w2
- Miscellaneous typo fixes.


7. Metadata
===========
7.1 calldata.tbl
----------------
This is a tab-delimited file containing metadata for all calls. Please see the
document "docs/doc_calldata.txt" for details.


7.2 speakerdata.tbl
-------------------
This is a tab-delimited file containing metadata for all speakers in the
corpus. Please see the document "docs/doc_speakerdata.txt" for details.


7.3 pindata.tbl
---------------
This is a tab-delimited file containing metadata for all participants who
initiated a call. Please see the document "docs/doc_pindata.txt" for
details.


7.4 file.tbl
------------
Expected sizes, modification times, and MD5 checksums for all files within the
"data/" directory are recorded in "docs/file.tbl". This is a tab-delimited
table containing one file per line, each line having the following 4 fields:

- checksum  --  MD5 checksum of file
- size  --  size of file in bytes
- datetime  --  last modification date in YYYY-MM-DD_HH:MM:SS format
- path  --  path to file relative to root of release directory


8. Contacts
===========
If you have questions about this data release, please contact the following
LDC personnel:

    Neville Ryant
    <nryant@ldc.upenn.edu>