README FILE FOR CALLFRIEND RUSSIAN TEXT
LDC Catalog-ID: LDC2023T09

Authors: David Miller, Kevin Walker, David Graff, Alexandra Canavan


1. Introduction and Background

This corpus contains transcripts of 100 telephone converations among native
speakers of Russian.  These calls were recorded by the Linguistic Data
Consortium in 1999, and the audio data are available as a separate corpus
(LDC2023S08).  One hundred native Russian speakers living in the continental
U.S. (mostly in the greater Philadelphia region) were recruited and given
incentives to make a single phone call, lasting up to 30 minutes, to a native
Russian-speaking family member or friend also living in the U.S.

Though carried out as a separate project, CALLFRIEND Russian has much in
common with the LDC's earlier CALLFRIEND and CALLHOME collections (conducted
between 1995 and 1997).  Recruited speakers provided little or no demographic
information, and no information at all was provided regarding the persons they
chose to call.  All participants provided consent to be recorded for language
research purposes as part of the call collection, and were instructed to speak
on any topics they chose.  Like other CALLFRIEND collections, all callers and
callees were located within the continental U.S. (unlike CALLHOME, where all
callees were overseas).  Unlike most other CALLFRIEND collections, the Russian
collection has been (mostly) transcribed, and a pronouncing lexicon was
developed from the transcripts (similar to what was done in CALLHOME).  Also,
unlike the earlier collections, the LDC did not divide the call inventory into
partitions for training, development testing and evaluation.

All recordings were routed through the LDC's call collection platform in
Philadelphia, and stored as 2-channel ("4-wire"), 8-KHz mu-law samples taken
directly from the public telephone network via a T-1 circuit.  Many recruits
in the Philadelphia area chose to call acquaintences who lived nearby, and
most of the calls involved land-lines; as a result, many calls had noticeable
cross-channel echo.  For publication, all calls have been converted from the
original mu-law to 16-bit PCM samples, and have undergone an echo-cancellation
process.  (Some cross-channel echo may not have been eliminated completely,
but on the whole it has been reduced.)

The transcripts and lexicon are presented as plain-text, tab-delimited files
with UTF-8 character encoding.  Further details are provided below.


2. Directory Structure

  docs/
    README.txt        -- this file
    all_call_info.tab -- see 3.1 below
    file.tbl          -- list of MD5 checksums for data files

  data/
    transcripts/ -- contains 100 ru_*.txt files -- see section 3.2
    lexicon/     -- see section 3.3


3. File Formats

3.1 all_call_info.tab

This plain-text, tab-delimited table contains column headings on the first
line, followed by one line (or row) for each of the 100 calls in the corpus.
The columns are as follows:

  1  file_id      ru_####
  2  audio_len    flac file duration in seconds
  3  tr_bgn       seconds from start of audio to start of first transcript segment
  4  tr_end       seconds from start of audio to end of last transcript segment
  5  tr_len       total duration of span covered by transcript (col.4 - col.3)
  6  nsegs        number of transcribed segments
  7  seg_sec_sum  sum of segment durations
  8  ntokens      total of word tokens in all segments
  9  nspkrs       total number of speakers present
  10 spkrA_info   gender, segment count, total segment duration for A-channel speaker
  11 spkrB_info   gender, segment count, total segment duration for B-channel speaker(s)

Columns 10 and 11 contain multiple sub-parts separated by colons -- e.g.:

   A:F:191:585.6

This example shows that on channel A, the speaker is female, there are 191
segments transcribed, and these sum to 585.6 seconds of speech.  There are two
calls in which more than one speaker was present on channel B; in these cases,
column 11 contains multiple, space-separated tokens, with distinct digits
after the "B" -- e.g.:

   B1:F:90:240.80 B2:F:90:109.43

This example shows two speakers on channel B (labeled "B1" and "B2" in the
corresponding transcript file); both are female, both have 90 segments
transcribed, and the total durations of their segments are 240.8 and 109.43
seconds, respectively.

3.2 transcript files (data/transcripts/ru_*.txt)

Each transcript file is presented as a plain-text, tab-delimited table of 4
columns per row, with no header line (i.e. the first line/row of each
transcript file is the first transcribed segment).  The four columns are:

  1  begin_offset  -- in seconds, from start of audio file)
  2  end_offset    -- in seconds (segment duration = col.2 - col.1)
  3  speaker_label -- usually "A:" or "B:" (see note below)
  4  transcript_text -- space-separated tokens, UTF-8 encoded

There are two calls in which channel B contains more than one speaker
(i.e. the telephone hand set was shared sequentially by to or more people);
in these cases, a digit is included with the speaker label: "B1:", "B2:",
etc.

The transcript_text tokens include the following patterns of markup:

  <English word(s)>  -- portions spoken in English (or other non-Russian language)
  [background], [distortion], [static] -- audible interferance in the channel
  [[skip]] -- portions of speech left untranscribed
  {breath}, {laugh}, {lipsmack}, {cough}, {sneeze} -- non-speech noise from speaker

Also, some categories of "word" tokens are marked by these initial characters
attached to each such token:

  * "nonce" word (made up by the speaker, peculiar to the context)
  + mispronunciation of the word (spelling indicates intended word)
  ~ single alphabetic letter pronounced as such (e.g. "~T ~V")
  @ acronym pronounced as a word (e.g. @UNICEF)
  ^ proper name
  % hesitation sound (non-lexeme)

The following three transcript files cover only a small portion of the
corresponding speech files (the transcription project ended before these could
be completed):

   ru_9301 -- less than 1 minute covered from 30-minute call
   ru_9360 -- about 2 minutes covered from 30-minute call
   ru_9366 -- about 1 minute covered from 30-minute call

3.3 lexicon data

Unlike the CALLHOME lexicons, which sought to provide reasonable coverage for
each language, including common words that didn't happen to occur in the
transcripts of collected calls, the CALLFRIEND Russian lexicon covers only the
word forms that appear in the 97 transcript files.

The data/lexicon/ directory contains its own README.1st file, which explains
the content in full detail.


---------------
README file created by David Graff, June 4, 2021