README FILE FOR CALLFRIEND RUSSIAN TEXT LDC Catalog-ID: LDC2023T09 Authors: David Miller, Kevin Walker, David Graff, Alexandra Canavan 1. Introduction and Background This corpus contains transcripts of 100 telephone converations among native speakers of Russian. These calls were recorded by the Linguistic Data Consortium in 1999, and the audio data are available as a separate corpus (LDC2023S08). One hundred native Russian speakers living in the continental U.S. (mostly in the greater Philadelphia region) were recruited and given incentives to make a single phone call, lasting up to 30 minutes, to a native Russian-speaking family member or friend also living in the U.S. Though carried out as a separate project, CALLFRIEND Russian has much in common with the LDC's earlier CALLFRIEND and CALLHOME collections (conducted between 1995 and 1997). Recruited speakers provided little or no demographic information, and no information at all was provided regarding the persons they chose to call. All participants provided consent to be recorded for language research purposes as part of the call collection, and were instructed to speak on any topics they chose. Like other CALLFRIEND collections, all callers and callees were located within the continental U.S. (unlike CALLHOME, where all callees were overseas). Unlike most other CALLFRIEND collections, the Russian collection has been (mostly) transcribed, and a pronouncing lexicon was developed from the transcripts (similar to what was done in CALLHOME). Also, unlike the earlier collections, the LDC did not divide the call inventory into partitions for training, development testing and evaluation. All recordings were routed through the LDC's call collection platform in Philadelphia, and stored as 2-channel ("4-wire"), 8-KHz mu-law samples taken directly from the public telephone network via a T-1 circuit. Many recruits in the Philadelphia area chose to call acquaintences who lived nearby, and most of the calls involved land-lines; as a result, many calls had noticeable cross-channel echo. For publication, all calls have been converted from the original mu-law to 16-bit PCM samples, and have undergone an echo-cancellation process. (Some cross-channel echo may not have been eliminated completely, but on the whole it has been reduced.) The transcripts and lexicon are presented as plain-text, tab-delimited files with UTF-8 character encoding. Further details are provided below. 2. Directory Structure docs/ README.txt -- this file all_call_info.tab -- see 3.1 below file.tbl -- list of MD5 checksums for data files data/ transcripts/ -- contains 100 ru_*.txt files -- see section 3.2 lexicon/ -- see section 3.3 3. File Formats 3.1 all_call_info.tab This plain-text, tab-delimited table contains column headings on the first line, followed by one line (or row) for each of the 100 calls in the corpus. The columns are as follows: 1 file_id ru_#### 2 audio_len flac file duration in seconds 3 tr_bgn seconds from start of audio to start of first transcript segment 4 tr_end seconds from start of audio to end of last transcript segment 5 tr_len total duration of span covered by transcript (col.4 - col.3) 6 nsegs number of transcribed segments 7 seg_sec_sum sum of segment durations 8 ntokens total of word tokens in all segments 9 nspkrs total number of speakers present 10 spkrA_info gender, segment count, total segment duration for A-channel speaker 11 spkrB_info gender, segment count, total segment duration for B-channel speaker(s) Columns 10 and 11 contain multiple sub-parts separated by colons -- e.g.: A:F:191:585.6 This example shows that on channel A, the speaker is female, there are 191 segments transcribed, and these sum to 585.6 seconds of speech. There are two calls in which more than one speaker was present on channel B; in these cases, column 11 contains multiple, space-separated tokens, with distinct digits after the "B" -- e.g.: B1:F:90:240.80 B2:F:90:109.43 This example shows two speakers on channel B (labeled "B1" and "B2" in the corresponding transcript file); both are female, both have 90 segments transcribed, and the total durations of their segments are 240.8 and 109.43 seconds, respectively. 3.2 transcript files (data/transcripts/ru_*.txt) Each transcript file is presented as a plain-text, tab-delimited table of 4 columns per row, with no header line (i.e. the first line/row of each transcript file is the first transcribed segment). The four columns are: 1 begin_offset -- in seconds, from start of audio file) 2 end_offset -- in seconds (segment duration = col.2 - col.1) 3 speaker_label -- usually "A:" or "B:" (see note below) 4 transcript_text -- space-separated tokens, UTF-8 encoded There are two calls in which channel B contains more than one speaker (i.e. the telephone hand set was shared sequentially by to or more people); in these cases, a digit is included with the speaker label: "B1:", "B2:", etc. The transcript_text tokens include the following patterns of markup: -- portions spoken in English (or other non-Russian language) [background], [distortion], [static] -- audible interferance in the channel [[skip]] -- portions of speech left untranscribed {breath}, {laugh}, {lipsmack}, {cough}, {sneeze} -- non-speech noise from speaker Also, some categories of "word" tokens are marked by these initial characters attached to each such token: * "nonce" word (made up by the speaker, peculiar to the context) + mispronunciation of the word (spelling indicates intended word) ~ single alphabetic letter pronounced as such (e.g. "~T ~V") @ acronym pronounced as a word (e.g. @UNICEF) ^ proper name % hesitation sound (non-lexeme) The following three transcript files cover only a small portion of the corresponding speech files (the transcription project ended before these could be completed): ru_9301 -- less than 1 minute covered from 30-minute call ru_9360 -- about 2 minutes covered from 30-minute call ru_9366 -- about 1 minute covered from 30-minute call 3.3 lexicon data Unlike the CALLHOME lexicons, which sought to provide reasonable coverage for each language, including common words that didn't happen to occur in the transcripts of collected calls, the CALLFRIEND Russian lexicon covers only the word forms that appear in the 97 transcript files. The data/lexicon/ directory contains its own README.1st file, which explains the content in full detail. --------------- README file created by David Graff, June 4, 2021