README FILE FOR: Mixer 7 Spanish Speech LDC Catalog ID: LDC2023S04 1.0 Introduction The Mixer-7 Spanish Speech Corpus comprises recordings made via the public telephone network (total of 2583 calls) and multiple microphones in office-room settings (total of 678 sessions conducted by LDC staff). All recruited speakers and some LDC staff have Spanish as their native language (214 distinct speakers, including LDC staff). Recordings took place between December 2010 and March 2012. The telephone audio portion of the corpus is similar to earlier Mixer collections: recruited speakers are connected through a robot operator to carry on casual conversations lasting up to 10 minutes, usually about a daily topic that is announced by the robot operator at the start of the call. The raw digital audio content for each call side is captured as a separate channel, and each full conversation is presented as a 2-channel interleaved audio file, with 8000 samples/second and u-law sample encoding. Each speaker was asked to complete 15 calls. The multi-microphone portion involves 14 distinct microphones set up identically in two distinct office rooms at the Linguistic Data Consortium. The same speakers who were taking part in the telephone collection were also brought in to record up to 4 sessions on distinct days, with each session lasting up to 90 minutes, typically producing 75 minutes of speech of various types. The 14 channels were recorded synchronously into separate single-channel files, using 16-bit PCM sample encoding at 16000 samples/second. Each multi-channel session was guided by an LDC staff person, who used specialized prompting and recording software to manage the session. Activities recorded in each session consisted of seven components: 1. Repeating questions - usually less than 1 minute 2. Informal conversation, "near" condition - 15 minutes 3. Telephone call, low or high vocal effort condition - 10 minutes 4. Transcript reading - 15 minutes 5. Telephone call, cell or speaker phone condition - 10 minutes 6. Informal conversation, "far" condition - 15 minutes 7. Telephone call, varied condition - 10 minutes More information about the session protocol is provided in the file "mx7_collection_doc.pdf". The recordings in this corpus were used in NIST Speaker Recognition Evaluation (SRE) test set for 2012. Researchers interested in applying those benchmark test sets should consult the respective NIST Evaluation Plans (available at http://nist.gov/itl/iad/mig/sre.cfm) for guidelines on allowable training data for those tests. 2.0 Directory Structure and Data Files As described in the introduction, there are two types of audio data, and these are kept in separate directories, as follows: data/ ulaw_sphere/ - contains 2583 8-KHz 2-channel NIST SPHERE files pcm_flac/ - contains 14 directories, 1 microphone / directory: CH01/ - up to 4159 16-KHz 1-channel flac/ms-wav files ... - for each channel CH14/ All audio files have names that indicate the date and time when the recording began, along with other identifying information, as follows: ulaw_sphere/: {yyyymmdd}_{hrmnsc}_{callid}.sph pcm_flac/CH{nn}/: {yyyymmdd}_{hrmnsc}_{room}_{comp}_{subjid}_CH{nn}.flac where: yyyymmdd is the year, month and date of recording hrmnsc is the hour, minute and second when recording began callid is a unique, incremental number assigned to each call room is either "LDC" or "HRM", indicating which office was used comp is an abbreviated label for the interview component (see below) subjid is a numeric identifier assigned to the speaker nn is a two-digit microphone channel identifier (01-14) When the flac files are uncompressed, they become ms-wav/RIFF files (flac compression does not presently support SPHERE file format). For each 14-channel set of original full-length interview recordings, the various component segments have been extracted into separate audio files, leaving out the transition periods between components. As a result, within each CH{nn} directory, there is a set of up to seven files with the same "{yyyymmdd}_{hrmnsc}_{room}" and "{subjid}_CH{nn}" portions in their file names, and different strings for the "{comp}" portion, as follows: ivfr -- interview, "far" condition ivnr -- interview, "near" condition phce[123] -- phone call, "cell" condition phhv[12] -- phone call, "high vocal effort" condition phlv[123] -- phone call, "low vocal effort" condition phsp[12] -- phone call, "speaker-phone" condition rdtr -- transcript reading rptq -- repeating questions The digits appearing next to the various "ph.." labels reflect the relative order of the phone call component within the full session. The telephone audio is presented in SPHERE format because (a) this is consistent with other telephone audio releases from the LDC, and (b) flac does not support ulaw sample encoding. The current release of the open-source "sox" utility is able to handle both formats as input; other utilities are available for both flac and SPHERE formats. 3.0 Related Documentation The "docs" directory contains the following files (along with this "readme.txt" file); their various contents are explained in the subsections below: 1 mx7spa_subjs.csv 2 mx7spa_calls.csv 3 mx7spa_ivcomponents.csv 4 mx7_transcript_sentences.txt 5 mx7_collection_doc.pdf All the "*.csv" files are comma-delimited, plain-text "flat files". Care has been taken to ensure that no field values contain commas as part of the field data, so quotation marks are never used around field values, and nothing is done to mark or "escape" certain characters (such as apostrophes). Consecutive commas on a line of data indicate empty or null field values (which tends to be fairly common in the subjects table). 3.1 Subjects table (mx7spa_subjs.csv) Each row in this table provides demographic information about one of the speakers in the collection. LDC staffers who lead the interview sessions and were involved in many of the phone calls that were conducted during those sessions are included in this table. 1 subjid - numeric identifier, links to calls and interviews 2 sex - M or F 3 yob - year of birth 4 edu_years - years of formal education 5 edu_degree - highest education degree earned 6 edu_deg_yr - year in which highest degree was earned 7 edu_contig - Y or N: were all edu_years spent contiguously? 8 esl_age - for ESL speakers, age when English was learned 9 ntv_lg - native language (ISO 639-3 code) 10 oth_lgs - other languages (ISO 639-3 codes, '/'-separated) 11 occup - occupation 12 cntry_born - country where born 13 state_born - state where born 14 city_born - city where born 15 cntry_rsd - country where raised 16 state_rsd - state where raised 17 city_rsd - city where raised 18 ethnic - ethnicity 19 smoker - Y or N 20 ht_cm - height in centimeters 21 wt_kg - weight in kilograms 22 mother_born - country (state city) where mother was born 23 mother_raised - country (state city) where mother was raised 24 mother_lang - mother's native language 25 mother_edu - mother's years of formal education 26 father_born - country (state city) where father was born 27 father_raised - country (state city) where father was raised 28 father_lang - father's native language 29 father_edu - father's years of formal education 3.2 Calls table (mx7spa_calls.csv) Each row in this table provides the available information about a 2-channel telephone conversation. A couple of fields ("lang", "eng_stat") are actually irrelevant to Mixer-6, because this was an English-only collection, but the field inventory is fixed in order to provide consistency with other telephone corpus releases from the LDC. 1 call_id - numeric identifier, links to audio file name 2 call_date - links to audio file name 3 lang - language in which the conversation was conducted 4 eng_stat - one of: AllENG, SomeENG, NoENG 5 sid_a - subjid of the speaker on channel A (may be marked by "*"; see 4.5 below) 6 phid_a - telephone ID on channel A 7 ph_categ_a - one of: M (main phone), O (other phone) 8 phtyp_a - one of: 1 (cell phone), 2 (cordless), 3 (standard) 9 phmic_a - one of: 1 (spkr-phone), 2 (headset), 3 (earbud), 4 (handheld) 10 cnvq_a - audit judgment of conversation quality (Good,Acceptable,Unsuitable) 11 sigq_a - audit judgment of signal quality (Good,Acceptable,Unsuitable) 12 tbug_a - Y or N: auditor found a technical problem channel A 13-20 - same as 5-12, applied to channel B 21 topic - numeric ID of the topic announced to the callers 3.3 Interview components table (mx7spa_ivcomponents.csv) Each row in this table provide time stamps for partitioning a single in-office multi-channel recording session into its distinct components. The time stamps provided for the call components are all based on careful alignment between the 10-minute ulaw telephone audio and the 45-minute session recording, so when the call portions of the session audio files are extracted using the stated time stamps, the resulting excerpt should align within a few milliseconds with the corresponding ulaw channel. The time stamps for the other components (rptq, intv, rdtr) are based on session log files (or have been set by manual annotation, in about 50 cases where log files were not available). The time stamps from log files are prone to have varying margins of accuracy relative to the actual start and end of the given component. (See the discussion in section 4.3 below about the "rdtr" time stamps.) 1 comp_type - one of: rptq, ivnr, ivfr, phce, phsp, phhv, phlv, rdtr 2 date - 8-digit year-month-day (YYYYMMDD) 3 iv_bgntime 6-digit hour-minute-second (HRMNSC) 4 place - recording-room, one of: HRM, LDC 5 comp_id - same as comp_type, incl. one-digit ID# for ph.. types (e.g. "phce1") 6 subj_id - six-digit ID# references col.1 in mx7spa_subjs.csv 7 duration - in seconds 8 interlocutor_id - six-digit ID# references col.1 in mx7spa_subjs.csv 9 language - ENG, SPA or MXD 10 call_id - for ph.. types, 4-digit ID# references col.1 in mx7spa_calls.csv 11 call_chan - for ph.. types, "A" or "B" 12 ivchans_missing - semi-colon-delimited list of missing channels, if any Columns 2-6 can be concatenated with underscore characters to form the file-ID (minus the channel specification) for the corresponding set of pcm_flac audio files -- e.g. taking the first row following the table header: rptq,20101201,150149,HRM,rptq,121169,23.363,121114,ENG,na,na,na,na,all_present columns 2-6 can be joined as follows to form the file-ID: 20101201_150149_HRM_rptq_121169 Since column 12 in that row shows "all_present", we can locate 14 channel files for this interview component (repeating questions for subj_id 121169) as follows: ./data/pcm_flac/CH*/20101201_150149_HRM_rptq_121169* In cases where one or more channels are missing (due to problems in the recording session), the missing channels are listed in column 12; if more than one channel is missing for a given component, the missing channels are conjoined with semi-colons in ascending order (e.g. "CH08;CH11;CH13"). 3.5 List of Transcript sentences (mx7_transcript_sentences.txt) This is simply a list of the 335 sentence prompts displayed to speakers in order to be read aloud. All the sentences are in English, and have been drawn from transcripts of spontaneous converations in earlier LDC data collections. In every session, the transcript-reading component always presented this list in the order shown (one sentence at a time), and always started at the first sentence. If the speaker got to the end of the list quickly, with time remaining in the session schedule for transcript reading, the list was simply presented again, starting over at the first sentence. (So, some sessions may contain more than 335 sentence readings in this component, and in this case, sentences at the start of the list will have been read twice.) 3.6 Corpus collection specifications (mx7_collection_doc.pdf) This pdf file provides more detailed information about the Mixer 6 collection project: the procedures for recruiting and recording, microphone specifications, interview and auditing protocols, etc. 4.0 Known Problems and Difficulties The following subsections list some issues where the data being published fall short of expectations. 4.1 Some channels missing from some in-office recording sessions As detailed in the mx7spa_ivcomponents.csv table (col.12, "ivchans_missing"), there were a number of sessions where one or more channels failed to record as intended. In some cases a channel failure occurred mid-session, such that all channels were recorded for initial components in the session, and some were missing for later components. The following list summarizes how many session components are lacking a given channel (how many rows of the "ivcomponents" table contain the given channel in col.12): 4075 all_present (components with no missing channels) 17 CH01 10 CH02 7 CH03 7 CH04 39 CH05 20 CH06 8 CH07 23 CH08 27 CH09 8 CH10 25 CH11 25 CH12 22 CH13 9 CH14 There was also variation in the number and distribution of failed channels in a given session; the following list summarizes the patterns of co-occurrences in channel failures: 10 CH01;CH02 7 CH01;CH05;CH07;CH09;CH10 7 CH03;CH05;CH08;CH11;CH12 7 CH04 10 CH05 7 CH05;CH06;CH09 1 CH05;CH07;CH10;CH11;CH12 7 CH05;CH09;CH12 6 CH06;CH08;CH09;CH13 7 CH06;CH11 3 CH08;CH11;CH12;CH13;CH14 7 CH08;CH11;CH13 7 CH12 6 CH13;CH14 4.2 Variable performance in transcript reading task While reading transcript sentences aloud, speakers would sometimes stumble, restart, or misread the prompt text (adding, skipping or substituting one or more words); there were also some cases where speakers interjected a remark or question in between the sentence prompts. Another issue was a relatively low level of English reading proficiency among some native Spanish speakers, causing the transcript reading task to be difficult and error-prone. The recording process would not be stopped for problems of this sort, and no attempt has been made to edit them out of the audio files. As a result, there will be some difficulty in trying to align the sentence prompt text with the corresponding audio portion in some number of sessions. 4.3 Relatively low recording levels in some channels or some sessions Some of the microphones in the recording rooms were purposefully set at distances beyond the stated performance specifications for the given mic. This was done to cause some channels to serve as "stress tests" for relevant speech technologies. In addition, there were difficulties with other channels in balancing between conflicting goals: (a) always maintain a constant gain setting on each channel across all sessions; (b) take as much care as possible in setting levels in order to avoid clipping; (c) employ techniques during the phone call components to explicitly evoke both unusually loud and unusually soft speech. Of course, there are also wide differences among speakers in terms of their intrinsic loudness, and during informal conversation, they would often cover a wide dynamic range. Taken together, these factors tended to lead to relatively lower gain settings than would have been optimal in many cases. 4.4 Resampling applied to multi-channel recordings The 14 channels presented in the "pcm_flac" directory were originally recorded via matched pairs of 8-channel A/D converters, running from a common clock signal. But relative to the 8KHz ulaw telephone channel (recorded via the public telephone network), and a pair of "reserved" channels recorded via a separate, 22KHz A/D device, it was found that the nominal 16000Hz sample rate applied to channels 01-14 was actually closer to 15899 Hz. This was measured and confirmed manual over numerous sessions, and we also manually confirmed that by doing a digital resampling from 15899 to 16000, the alignment of channels 01-14 to both the 8KHz ulaw and 22KHz signals was correct. This resampling has been applied to all the pcm_flac data, and should have no more than a negligible effect on signal analysis. 4.5 Some call-side speaker-IDs not fully confirmed by audits As indicated above in the description of the mx7spa_calls table, some rows have an asterisk attached to the A- or B-side subj_id value. These are cases where the given call side was not audited at all, or the audit judgment was uncertain as to actual speaker-ID. The affects either side A or side B (never both) in 29 calls. ------- README prepared 2022-04-08 by David Graff