README FILE FOR CALLFRIEND RUSSIAN SPEECH LDC Catalog-ID: LDC2023S08 Authors: David Miller, Kevin Walker, David Graff, Alexandra Canavan 1. Introduction and Background This corpus contains audio recordings of 100 telephone converations among native speakers of Russian. These calls were recorded by the Linguistic Data Consortium in 1999. One hundred native Russian speakers living in the continental U.S. (mostly in the greater Philadelphia region) were recruited and given incentives to make a single phone call, lasting up to 30 minutes, to a native Russian-speaking family member or friend also living in the U.S. Though carried out as a separate project, CALLFRIEND Russian has much in common with the LDC's earlier CALLFRIEND and CALLHOME collections (conducted between 1995 and 1997). Recruited speakers provided little or no demographic information, and no information at all was provided regarding the persons they chose to call. All participants provided consent to be recorded for language research purposes as part of the call collection, and were instructed to speak on any topics they chose. Like other CALLFRIEND collections, all callers and callees were located within the continental U.S. (unlike CALLHOME, where all callees were overseas). Unlike most other CALLFRIEND collections, the Russian collection has been (mostly) transcribed, and a pronouncing lexicon was developed from the transcripts (similar to what was done in CALLHOME) -- the transcript and lexicon are available as a separate corpus (LDC2023T09). Also, unlike the earlier collections, the LDC did not divide the call inventory into partitions for training, development testing and evaluation. All recordings were routed through the LDC's call collection platform in Philadelphia, and stored as 2-channel ("4-wire"), 8-KHz mu-law samples taken directly from the public telephone network via a T-1 circuit. Many recruits in the Philadelphia area chose to call acquaintences who lived nearby, and most of the calls involved land-lines; as a result, many calls had noticeable cross-channel echo. For this release, all calls have been converted from the original mu-law to 16-bit PCM samples, and have undergone an echo-cancellation process. (Some cross-channel echo may not have been eliminated completely, but on the whole it has been reduced.) 2. Directory Structure docs/ README.txt -- this file all_call_info.tab -- see 3.1 below file.tbl -- list of MD5 checksums for data files data/ speech/ -- contains 100 ru_*.flac files -- see section 3.2 3. File Formats 3.1 all_call_info.tab This plain-text, tab-delimited table contains column headings on the first line, followed by one line (or row) for each of the 100 calls in the corpus. The columns are as follows: 1 file_id ru_#### 2 audio_len flac file duration in seconds 3 tr_bgn seconds from start of audio to start of first transcript segment 4 tr_end seconds from start of audio to end of last transcript segment 5 tr_len total duration of span covered by transcript (col.4 - col.3) 6 nsegs number of transcribed segments 7 seg_sec_sum sum of segment durations 8 ntokens total of word tokens in all segments 9 nspkrs total number of speakers present 10 spkrA_info gender, segment count, total segment duration for A-channel speaker 11 spkrB_info gender, segment count, total segment duration for B-channel speaker(s) Columns 10 and 11 contain multiple sub-parts separated by colons -- e.g.: A:F:191:585.6 This example shows that on channel A, the speaker is female, there are 191 segments transcribed, and these sum to 585.6 seconds of speech. There are two calls in which more than one speaker was present on channel B; in these cases, column 11 contains multiple, space-separated tokens, with distinct digits after the "B" -- e.g.: B1:F:90:240.80 B2:F:90:109.43 This example shows two speakers on channel B (labeled "B1" and "B2" in the corresponding transcript file); both are female, both have 90 segments transcribed, and the total durations of their segments are 240.8 and 109.43 seconds, respectively. 3.2 audio files (data/speech/ru_*.flac) Each audio file is a FLAC-compressed MS-WAV (RIFF) format audio file containing 2-channel, 8-KHz, 16-bit PCM sample data. Because many calls involved land-line circuits that did not extend far from the recording platform, echo cancellation was not applied by the public telephone network, and in some calls, the amount of cross-channel echo was relatively extreme. To reduce this problem, we applied an echo-cancellation process, which implemented the algorithm described here: Messerschmitt, David; Hedberg, David; Cole, Christopher; Haoui, Amine; Winship, Peter; "Digital Voice Echo Canceller with a TMS32020," in Digital Signal Processing Applications with the TMS320 Family, pp. 415-437, Texas Instruments, Inc., 1986. The software used to run this process at the LDC was provided in 1997 by Joe Picone and Aravind Ganapathiraju at Mississipi State University. While the process may have left behind some residual cross-channel echo in some calls, the overall reduction of echo was significant. --------------- README file created by David Graff, June 4, 2021