README FILE FOR: Mixer 4 and 5 Speech LDC Catalog ID: LDC2020S03 1.0 Introduction The Mixer-4 and Mixer-5 Speech Collections comprise recordings made via the public telephone network (total of 2568 calls) and multiple microphones in office-room settings (total of 2152 sessions). The offices used for recording sessions were located at the Linguistic Data Consortium (hosted by University of Pennsylvania, Philadelphia PA), and at the International Computer Science Institute (ICSI, affilliated University of California at Berkeley). Mixer-4 and Mixer-5 collections were conducted simultaneously, as a collaborative, carefully coordinated activity at both recording sites. There are some differences and some commonalities in the respective protocols of the two collections, as follows: - Both Mixer-4 and Mixer-5 involved collection of unscripted telephone conversations via a "robot operator" recording platform, which both dialed out automatically to, and handled incoming calls from, the set of speakers recruited and paid to participate in conversational telephone speech (CTS). Telephone calls were not differentiated by project: in a given call, a Mixer-4 participant could be connected with a Mixer-5 participant. - Some Mixer-4 speakers and all Mixer-5 speakers were recruited from the Philadelphia PA and Berkeley CA regions, in order to participate in multi-channel recording sessions conducted at the LDC and ICSI. Other Mixer-4 speakers, who would engage only in CTS recordings, were recruited from across the continental U.S. - In Mixer-4, local-area speakers were asked to come to ICSI or the LDC so that some of their CTS calls could be conducted in an office equipped with multiple microphones. As a result of these sessions, some of telephone calls in the Mixer-4 collection have multi-channel audio data, including the 8kHz telephone channel captured via the robot operator, and 14 channels of 16kHz audio recorded via fixed microphones in the recording room. There are 252 such calls. - In Mixer-5, speakers were asked to come to ICSI or the LDC to record a series of up to six 30-minute sessions per speaker. During these sessions, the speaker would be guided by an interviewer through a sequence of activities, described in more detail below. Each session included a CTS call in one of three vocal-effort conditions: (a) normal, (b) high vocal effort, (c) low vocal effort; these phone calls were NOT routed through the LDC's robot operator, so the collection has only the 14 channels from room microphones, recorded at 16kHz. There are 1900 such sessions. Nearly all speakers have English as their native language. The telephone audio portion of the corpus is similar to earlier Switchboard collections: recruited speakers are connected through a robot operator to carry on casual conversations lasting up to 10 minutes, usually about a daily topic that is announced by the robot operator at the start of the call. (Speakers were not required to discuss the announced topic.) The raw digital audio content for each call side was captured as a separate channel; each full conversation is presented as a 2-channel interleaved audio file with a NIST SPHERE header, 8000 samples/second and u-law sample encoding. Each speaker was asked to do up to 15 calls (most of them completed at least 10). The multi-microphone portion involves 14 distinct microphones set up identically, in terms of distance from the recruited speaker, at each recording site. The 14 channels were recorded synchronously into separate single-channel files, using 16-bit PCM sample encoding at 16000 samples/second. Each multi-channel session was guided by an LDC staff person, who used specialized prompting and recording software to manage the session. For Mixer-5, each session contained a mixture of distinct activities: 1. Repeating questions 2. Informal conversation 3. Transcript reading 4. Telephone call (with varied vocal effort conditions) The recordings in this corpus were used in NIST Speaker Recognition Evaluation (SRE) test sets for 2008. Researchers interested in applying those benchmark test sets should consult the NIST Evaluation Plan (available at http://nist.gov/itl/iad/mig/sre.cfm) for guidelines on allowable training data for those tests. 2.0 Directory Structure and Data Files As described in the introduction, there are two types of audio data, and these are kept in separate directories, as follows: data/ ulaw_sphere/ - contains 2568 8-KHz 2-channel NIST SPHERE files mc_cts_flac/ - Mixer-4 14-channel CTS recordings: 1 channel / directory: CH01/ - 252 16-KHz 1-channel flac/ms-wav files ... - for each channel CH14/ mc_ivs_flac/ - Mixer-5 14-channel Interview recordings: CH01/ - 1900 16-KHz 1-channel flac/ms-wav files ... - for each channel CH14/ All audio files have names that indicate the date and time when the recording began, along with other identifying information, as follows: ulaw_sphere/: {yyyymmdd}_{hrmnsc}_{callid}.sph mc_cts_flac/CH{nn}/: {yyyymmdd}_{hrmnsc}_{loc}_{subjid}_{callid}_{A|B}_CH{nn}.flac mc_ivs_flac/CH{nn}/: {yyyymmdd}_{hrmnsc}_{loc}_{subjid}.flac where: yyyymmdd is the year, month and date of recording hrmnsc is the hour, minute and second when recording began loc is either "PHL" or "BER", indicating where recording was done subjid is a numeric identifier assigned to the speaker callid is a unique, incremental number assigned to each call A|B is the "call-side" (channel 1 or 2, respectively) in ulaw data nn is a two-digit microphone channel identifier (01-14) Note that the "hrmnsc" values are local to the recording location; in particular, for Mixer-4 "mc_cts_flac" files that were recorded in Berkeley, their "hrmnsc" values are offset by approximately 3 hours, relative to the corresponding "ulaw_sphere" 2-channel telephone recording (which was recorded simultaneously by LDC's robot operator). When the flac files are uncompressed, they become ms-wav/RIFF files (flac compression does not presently support SPHERE file format. The telephone audio is presented in SPHERE format because (a) this is consistent with other telephone audio releases from the LDC, and (b) flac does not support ulaw sample encoding. The current release of the open-source "sox" utility is able to handle both formats as input; other utilities are available for both flac and SPHERE formats. 3.0 Related Documentation The "docs" directory contains the following files; their various contents are explained in the subsections below: 1 mx45_call_info.csv -- details on CTS calls 2 mx45_spkr_info.csv -- details on speakers 3 mx4_mc_cts_info.csv -- Mixer-4 multi-channel sessions 4 mx5_mc_ivs_info.csv -- Mixer-5 multi-channel sessions 5 updated_NIST_KEYS -- modified versions of SRE08 "answer keys" 6 iv_seg_tdf -- files with time-stamps for IV session events All the "*.csv" files are comma-delimited, plain-text "flat files". Care has been taken to ensure that no field values contain commas as part of the field data, so quotation marks are never used around field values, and nothing is done to mark or "escape" certain characters (such as apostrophes). Consecutive commas on a line, or a comma at the end of a line, indicate empty or null field values (which tends to be fairly common in the "spkr" table). 3.1 Calls table (mx45_call_info.csv) Each row in this table provides the available information about a 2-channel telephone conversation. A couple of fields ("lang", "eng_stat") are actually irrelevant to Mixer-4/5, because this was an English-only collection, but the field inventory is fixed in order to provide consistency with other telephone corpus releases from the LDC. 1 call_id - numeric identifier, relates to audio file name 2 call_date - relates to audio file name 3 lang - "USE" (U.S.English) if both speakers are native, ENG otherwise 4 eng_stat - always has the value "All_ENG" 5 sid_a - subjid of the speaker on channel A 6 phid_a - telephone ID on channel A 7 ph_categ_a - one of: M (main phone), O (other phone) 8 phtyp_a - one of: 1 (cell phone), 2 (cordless), 3 (standard) 9 phmic_a - one of: 1 (spkr-phone), 2 (headset), 3 (earbud), 4 (handheld) 10 cnvq_a - audit judgment of conversation quality (Good,Acceptable,Unsuitable) 11 sigq_a - audit judgment of signal quality (Good,Acceptable,Unsuitable) 12 tbug_a - Y or N: auditor found a technical problem channel A 13-20 - same as 5-12, applied to channel B 21 topic - numeric ID of the topic announced to the callers 3.2 Subjects table (mx45_spkr_info.csv) Each row in this table provides demographic information about one of the speakers in the collection. LDC staffers who lead the interview sessions and were involved in many of the phone calls that were conducted during those sessions are included in this table. 1 subjid - numeric identifier, links to calls and interviews 2 sex - M or F 3 yob - year of birth 4 edu_years - years of formal education 5 edu_degree - highest education degree earned 6 edu_deg_yr - year in which highest degree was earned 7 edu_contig - Y or N: were all edu_years spent contiguously? 8 esl_age - for ESL speakers, age when English was learned 9 ntv_lg - native language (ISO 639-3 code) 10 oth_lgs - other languages (ISO 639-3 codes, '/'-separated) 11 occup - occupation 12 cntry_born - country where born 13 state_born - state where born 14 city_born - city where born 15 cntry_rsd - country where raised 16 state_rsd - state where raised 17 city_rsd - city where raised 18 ethnic - ethnicity 19 smoker - Y or N 20 ht_cm - height in centimeters 21 wt_kg - weight in kilograms 22 mother_born - country (state city) where mother was born 23 mother_raised - country (state city) where mother was raised 24 mother_lang - mother's native language 25 mother_edu - mother's years of formal education 26 father_born - country (state city) where father was born 27 father_raised - country (state city) where father was raised 28 father_lang - father's native language 29 father_edu - father's years of formal education 3.3 Mixer-4 multi-channel CTS sessions table (mx4_mc_cts_info.csv) Each row in this table provides information about a single in-office multi-channel recording session for a Mixer-4 telephone call. Fields 2 and 5 establish the relative timing for the beginnings of the two sets of audio data: sessions varied in terms of which device began recording first ("ulaw" telephone robot operator or "mc" 14-channel microphone system), and in terms of how much time elapsed until the other device began recording. Whichever device began later will have an "_ofs" value of "0.0", and the other will have a value greater than zero, indicating that you must seek that far into the latter recording in order to reach the point where the former recording begins. 1 mc_cts_flac_id - common portion of multi-channel file name 2 mc_ofs - alignment offset (in seconds) relative to ulaw data 3 ulaw_sphere_id - file-ID of 2-channel telephone audio data 4 ulaw_ch - A or B (call-side corresponding to MC data) 5 ulaw_ofs - alignment offset (in seconds) relative to MC data 3.4 Mixer-5 multi-channel interview sessions table (mx5_mc_ivs_info.csv) Each row in this table provides information about a single in-office multi-channel recording session for a Mixer-5 interview. 1 common_audio_fileid 2 session_num - ranges from 1 to 6 3 intervwr_id - subject_id of staff person asking questions 4 session_dur 3.5 Updated NIST 2008 answer key data This directory contains modified versions of three table files that were originally circulated by NIST following the SRE 2008 evaluation. This is only a subset of the complete NIST distribution of results and answer keys. In particular, it consists of the three tables that happen to contain references to original LDC data file names, as delivered to NIST in preparation for creating the SRE 2008 test set. In preparing the data for general release, we have modified the file names of the multi-channel session data to make them more consistent and informative. Changes to the NIST tables were simply a matter of replacing the file name strings as needed. 3.6 Time-stamped information about Mixer-5 interview sessions For most of the Mixer-5 interview sessions, there is a "tdf" file (a tab-delimited format used by the LDC's "xtrans" audio transcription tool), with time-stamped entries for the major component portions of the recording session: the initial set of "repeating questions", the "informal interview", the reading aloud of sentences, and the telephone call with one or another "vocal effort" condition. In the case of the sentence readings, each sentence prompt is shown together with the time stamp for when that sentence was presented to the speaker. While most of the time stamps are reasonably accurate, they all derive from the software used to present sentence prompts and to generally guide the interviewer through the session. As such, there may be mistakes in the alignment of sentences to time stamps in some cases, and the sentence text may not match what the speaker actually said (e.g. if they stumbled or erred in the reading). 4.0 Known Problems and Difficulties The following subsections list some issues where the data being published fall short of expectations. 4.1 Some channels missing from some in-office recording sessions A few of the multi-channel recording sessions (two in the Mixer-4 CTS set and three in the Mixer-5 interview set) do not have complete sets of 14 audio channels, due to the occasional failure of one or another microphone during the affected sessions. 4.2 Significant variability in recording levels on some in-office mics There were unexpected difficulties in monitoring of recording levels on all channels in both the Berkeley and Philadelphia office setups across the entire calendar duration of the collection project. As a result, a given channel may show considerable variation across sessions in terms of signal level. Some of the microphones in the recording rooms were purposefully set at distances beyond the stated performance specifications for the given mic. This was done to cause some channels to serve as "stress tests" for relevant speech technologies. In addition, there were difficulties with other channels in balancing between conflicting goals: (a) always maintain a constant gain setting on each channel across all sessions; (b) take as much care as possible in setting levels in order to avoid clipping; (c) employ techniques during the Mixer-5 phone call components to explicitly evoke both unusually loud and unusually soft speech. Of course, there are also wide differences among speakers in terms of their intrinsic loudness, and during informal conversation, they would often cover a wide dynamic range. Taken together, these factors tended to lead to relatively lower gain settings than would have been optimal in many cases. 4.3 Variable performance in transcript reading task While reading transcript sentences aloud, speakers would sometimes stumble, restart, or misread the prompt text (adding, skipping or substituting one or more words); there were also some cases where speakers interjected a remark or question in between the sentence prompts. The recording process would not be stopped for problems of this sort, and no attempt has been made to edit them out of the audio files. As a result, there will be some difficulty in trying to align the sentence prompt text with the corresponding audio portion in some number of sessions. (It's likely that a detailed alignment of text to the read speech will be made available as a supplemental annotation corpus.) 4.4 Resampling applied to Mixer-4 multi-channel recordings The 14 channels presented in the "mc_cts_flac" and "mc_ivs_flac" directories were originally recorded via matched pairs of 8-channel A/D converters, running from a common clock signal. But relative to the 8KHz ulaw telephone channel (recorded via the public telephone network), it was found that the nominal 16000Hz sample rate applied to channels 01-14 was actually closer to 15899 Hz. Because the Mixer-4 multi-channel recordings are intended for use with the ulaw telephone channel data, the discrepancy in sample rates was measured and confirmed manually over numerous sessions, and we also manually confirmed that by doing a digital resampling from 15899 to 16000, the alignment of channels 01-14 to the 8KHz ulaw signal was correct. This resampling has been applied _only_ to the 14 flac channels of the Mixer-4 CTS session data. The 14 channels of Mixer-5 interview audio have been left as-is, because there were no other simultaneous recordings of these sessions at other sample rates. Note that by leaving the Mixer-5 flac data as-is, any reference to time stamps used by NIST for evaluation test segments from these recordings will still be correct and appropriate. However, NIST SRE 2008 test segments drawn from Mixer-4 multi-channel CTS sessions were originally based on speech-detection time stamps that were computed from the ulaw telephone data (at 8000 Hz sample rate). The sample-rate discrepancy problem in the multi-channel data (assumed to be 16000 Hz but actually 15899 Hz) was not recognized until after the evaluation, when it became evident that short segments extracted near the beginning of a recording session scored reasonably well, but system performance on short segments degraded as these segments were closer to the end of the 10 minute recording, where the office microphone channels were off by nearly 4 seconds relative the ulaw telephone channel. Having resampled the 14 channels of office-mic data for all Mixer-4 CTS sessions, the use of NIST segment specifications for SRE08 should yield noticeably better scores using the present version of the data. 4.5 Some speakers completed interview sessions but no phone calls In order to provide as much useful data for as many speakers as possible, this release includes 60 Mixer 5 speakers who completed one or more interview sessions but did not complete any telephone calls. ------- README prepared 2017-04-21 by David Graff