README FILE FOR CALLFRIEND RUSSIAN SPEECH
LDC Catalog-ID: LDC2023S08

Authors: David Miller, Kevin Walker, David Graff, Alexandra Canavan


1. Introduction and Background

This corpus contains audio recordings of 100 telephone converations among
native speakers of Russian.  These calls were recorded by the Linguistic Data
Consortium in 1999.  One hundred native Russian speakers living in the
continental U.S. (mostly in the greater Philadelphia region) were recruited
and given incentives to make a single phone call, lasting up to 30 minutes, to
a native Russian-speaking family member or friend also living in the U.S.

Though carried out as a separate project, CALLFRIEND Russian has much in
common with the LDC's earlier CALLFRIEND and CALLHOME collections (conducted
between 1995 and 1997).  Recruited speakers provided little or no demographic
information, and no information at all was provided regarding the persons they
chose to call.  All participants provided consent to be recorded for language
research purposes as part of the call collection, and were instructed to speak
on any topics they chose.  Like other CALLFRIEND collections, all callers and
callees were located within the continental U.S. (unlike CALLHOME, where all
callees were overseas).  Unlike most other CALLFRIEND collections, the Russian
collection has been (mostly) transcribed, and a pronouncing lexicon was
developed from the transcripts (similar to what was done in CALLHOME) -- the
transcript and lexicon are available as a separate corpus (LDC2023T09).  Also,
unlike the earlier collections, the LDC did not divide the call inventory into
partitions for training, development testing and evaluation.

All recordings were routed through the LDC's call collection platform in
Philadelphia, and stored as 2-channel ("4-wire"), 8-KHz mu-law samples taken
directly from the public telephone network via a T-1 circuit.  Many recruits
in the Philadelphia area chose to call acquaintences who lived nearby, and
most of the calls involved land-lines; as a result, many calls had noticeable
cross-channel echo.  For this release, all calls have been converted from the
original mu-law to 16-bit PCM samples, and have undergone an echo-cancellation
process.  (Some cross-channel echo may not have been eliminated completely,
but on the whole it has been reduced.)


2. Directory Structure

  docs/
    README.txt        -- this file
    all_call_info.tab -- see 3.1 below
    file.tbl          -- list of MD5 checksums for data files

  data/
    speech/      -- contains 100 ru_*.flac files -- see section 3.2


3. File Formats

3.1 all_call_info.tab

This plain-text, tab-delimited table contains column headings on the first
line, followed by one line (or row) for each of the 100 calls in the corpus.
The columns are as follows:

  1  file_id      ru_####
  2  audio_len    flac file duration in seconds
  3  tr_bgn       seconds from start of audio to start of first transcript segment
  4  tr_end       seconds from start of audio to end of last transcript segment
  5  tr_len       total duration of span covered by transcript (col.4 - col.3)
  6  nsegs        number of transcribed segments
  7  seg_sec_sum  sum of segment durations
  8  ntokens      total of word tokens in all segments
  9  nspkrs       total number of speakers present
  10 spkrA_info   gender, segment count, total segment duration for A-channel speaker
  11 spkrB_info   gender, segment count, total segment duration for B-channel speaker(s)

Columns 10 and 11 contain multiple sub-parts separated by colons -- e.g.:

   A:F:191:585.6

This example shows that on channel A, the speaker is female, there are 191
segments transcribed, and these sum to 585.6 seconds of speech.  There are two
calls in which more than one speaker was present on channel B; in these cases,
column 11 contains multiple, space-separated tokens, with distinct digits
after the "B" -- e.g.:

   B1:F:90:240.80 B2:F:90:109.43

This example shows two speakers on channel B (labeled "B1" and "B2" in the
corresponding transcript file); both are female, both have 90 segments
transcribed, and the total durations of their segments are 240.8 and 109.43
seconds, respectively.

3.2 audio files (data/speech/ru_*.flac)

Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file
containing 2-channel, 8-KHz, 16-bit PCM sample data.

Because many calls involved land-line circuits that did not extend far from
the recording platform, echo cancellation was not applied by the public
telephone network, and in some calls, the amount of cross-channel echo was
relatively extreme.  To reduce this problem, we applied an echo-cancellation
process, which implemented the algorithm described here:

	Messerschmitt, David; Hedberg, David; Cole, Christopher;
	Haoui, Amine; Winship, Peter; "Digital Voice Echo Canceller
	with a TMS32020," in Digital Signal Processing Applications
	with the TMS320 Family, pp. 415-437, Texas Instruments, Inc., 1986.

The software used to run this process at the LDC was provided in 1997 by Joe
Picone and Aravind Ganapathiraju at Mississipi State University.  While the
process may have left behind some residual cross-channel echo in some calls,
the overall reduction of echo was significant.


---------------
README file created by David Graff, June 4, 2021