CALLHOME Japanese Omnibus Release October 17, 2023 Linguistic Data Consortium 1. Overview =========== This is an updated release of the CALLHOME Japanese corpus. The original CALLHOME corpus was collected and transcribed by the Linguistic Data Consortium primarily in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), sponsored by the U.S. Department of Defense. This re-release combines the original CALLHOME Japanese Speech (LDC96S37) and Transcripts (LDC96T18) corpora, and updates the directory structure, file formats, documentation, etc. to modern standards. 2. Directory structure ====================== - data/flac/ -- FLAC files containing call audio - data/transcripts/orig/ -- original transcripts in WebTrans TSV format - data/transcripts/updated/ -- updated transcripts in WebTrans TSV format - docs/calldata.tbl -- basic information about each call including what partition (train/dev/test) it belongs to and results from the call quality audit - docs/doc_calldata.txt -- documentation of "calldata.tbl" - docs/speakerdata.tbl -- audit-derived information about the transcribed speakers - docs/doc_speakerdata.txt -- documentation of "speakerdata.tbl" - docs/pindata.tbl -- participant supplied demographics for initiator of each call - docs/doc_pindata.txt -- documentation of "pindata.tbl" - docs/lex_segm.txt -- a documentation of general principles for Japanese word segmentation - docs/file.tbl -- listing of md5 checksums, sizes, dates, and file names - docs/README.txt -- this file - docs/transcription_specs_orig.txt -- documentation of the original transcription specifications - docs/transcription_specs_updated.pdf -- documentation of updated transcription specifications 3. CALLHOME =========== 3.1 Data acquisition -------------------- Speakers were solicited by Rutgers and the LDC to participate in this telephone speech collection effort through personal contacts and via the internet. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained originally by Rutgers University, and later by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at Rutgers or the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Each caller was allowed to place only one telephone call. In all, 200 calls were transcribed. Of these, 80 were designated as training calls, 20 as development test calls, and 100 as evaluation test calls. Of these 100 evaluation test calls, 20 were exposed in the original CALLHOME releases (LDC96S37 and LDC96T18); the remainder were withheld for use in future evaluations and remain unexposed in this release. For each of the training and development test calls, a contiguous 10-minute region was selected for transcription; for the evaluation test calls, a 5-minute region was transcribed. 3.2 Data verification --------------------- After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, to check the quality of the recording, and to select and describe the region to be transcribed. The description of the transcribed region provides information about channel quality, number of speakers, their gender, and other attributes. The information about each call may be found in the file "docs/calldata.tbl", and its contents are described in greater detail in "docs/doc_calldata.txt". The audit-derived information about the transcribed speakers may be found in the file "docs/speakerdata.tbl", whose contents are described in the file "docs/doc_speakerdata.txt". 3.3 Speaker demographics ------------------------ Information on speaker demographics can be found in the file "docs/pindata.tbl", whose contents are described in the file "docs/doc_pindata.txt". 4. Word segmentation -------------------- Segmentation of the Japanese transcripts was performed by hand at the LDC by Megumi Kobayashi and Masayo Kaneko. Word segmentation principles for Japanese were formulated in collaboration with LVCSR CALLHOME contractors, especially Yoshiko Ito and Paul Bamberg at Dragon Systems. These principles are described in "docs/lex_segm.txt". Certain dialect words, tagged with "dia", are exceptions to these principles; dialect-specific contractions never occur in uncontracted form. 5. File formats =============== 5.1 Audio --------- Audio is provided as 8 kHz, 16 bit two channel FLAC files converted from the original SHORTEN compressed SPHERE files. No resampling or additional processing was performed. 5.2 Transcripts --------------- The transcripts are released as UTF-8 TSV files in the format output by WebTrans (a web-based transcription tool in use at LDC). Each file consists of a sequence of transribed speech segments, one per line, each line having the following six tab-delimited fields: - Audio -- basename of audio file - Channel -- channel segment is on in audio file (1-indexed) - Beg -- onset of speaker turn in seconds from beginning of audio file - End -- offset of speaker turn in seconds from beginning of audio file - Text -- transcript - Speaker -- speaker id; within CALLHOME speaker ids are only guaranteed to be unique within the scope of a call 6. Transcription ================ In this release we provide two versions of the transcripts: - the version previously released in LDC96T18 (Section 6.1) - an "updated" version which has been transformed to more closely resemble output of current LDC transcription tasks (Section 6.2) 6.1 Original transcription -------------------------- The transcripts are identical to those in the LDC96T18 release except that the text encoding has been updated to UTF-8. For details regarding the original transcription guidelines, please see the document: docs/transcription_specs_orig.txt 6.2 Updated transcription ------------------------- The updated transcripts conform to the most recent version of in-house LDC transcription specifications as described in: docs/transcription_specs_updated.pdf with the following exceptions: - When the initial portion of a word is ellided, this is indicated by '-'; e.g. Three senators -stained from the vote. - {noise} indicates a background noise not made by a speaker - Speech in foreign language is marked by inline XML elements. The element is named "foreign" and has a single mandatory "lang" attribute. The value of this attribute is the ISO 639-3 code for the language. E.g.: hellow A nonexhaustive list of the transformations applied: - Normalized unintelligible regions annotations; e.g., by removing of leading/trailing whitespace: (( text)) -> ((text)) - Normalized speaker-produced noises to the set allowed in modern guidelines: - {laugh} - {cough} - {breath} - {lipsmack} - {NSV} (catch all for all other noises) E.g., {breath_noise} -> {breath} - Mapped all background noises not made by speaker to {noise}; e.g. [click] --> {noise} [static] --> {noise} - Removed background or channel sound annotations, e.g., [echoing] text [/echoing] -> text - Removed inline comments, e.g., [[distortion]] -> '' - Use inline XML to mark speech from foreign languages. Maked using an element named "foreign" with a single mandatory attribute "lang" that indicates a three letter ISO 639-3 language code. E.g. --> w1 w2 - Mark initialisms and spoken words using '~'; e.g. C E O --> ~CEO CD ROM --> ~CD ROM - Removed redundant whitespaces, e.g., w1 w2 -> w1 w2 - Miscellaneous typo fixes. 7. Metadata =========== 7.1 calldata.tbl ---------------- This is a tab-delimited file containing metadata for all calls. Please see the document "docs/doc_calldata.txt" for details. 7.2 speakerdata.tbl ------------------- This is a tab-delimited file containing metadata for all speakers in the corpus. Please see the document "docs/doc_speakerdata.txt" for details. 7.3 pindata.tbl --------------- This is a tab-delimited file containing metadata for all participants who initiated a call. Please see the document "docs/doc_pindata.txt" for details. 7.4 file.tbl ------------ Expected sizes, modification times, and MD5 checksums for all files within the "data/" directory are recorded in "docs/file.tbl". This is a tab-delimited table containing one file per line, each line having the following 4 fields: - checksum -- MD5 checksum of file - size -- size of file in bytes - datetime -- last modification date in YYYY-MM-DD_HH:MM:SS format - path -- path to file relative to root of release directory 8. Contacts =========== If you have questions about this data release, please contact the following LDC personnel: Neville Ryant