README FILE FOR: Mixer-6 Speech Corpus LDC Catalog ID: 1.0 Introduction The Mixer-6 Speech comprises recordings made via the public telephone network (total of 4410 calls) and multiple microphones in office-room settings (total of 1425 sessions). The speakers all have English as their native language, and were mostly raised in the Philadelphia region (594 distinct speakers, including LDC staff). The telephone audio portion of the corpus is similar to earlier Switchboard collections: recruited speakers are connected through a robot operator to carry on casual conversations lasting up to 10 minutes, usually about a daily topic that is announced by the robot operator at the start of the call. The raw digital audio content for each call side is captured as a separate channel, and each full conversation is presented as a 2-channel interleaved audio file, with 8000 samples/second and u-law sample encoding. Each speaker was asked to complete 15 calls. The multi-microphone portion involves 14 distinct microphones set up identically in two distinct office rooms at the Linguistic Data Consortium. The same speakers who were taking part in the telephone collection were also brought in to record up to 3 sessions on distinct days, with each session lasting 45 minutes on average. The 14 channels were recorded synchronously into separate single-channel files, using 16-bit PCM sample encoding at 16000 samples/second. Each multi-channel session was guided by an LDC staff person, who used specialized prompting and recording software to manage the session. Activities recorded in each session consisted of four components: 1. Repeating questions - usually less than 1 minute 2. Informal conversation - about 15 minutes 3. Transcript reading - usually about 15 minutes 4. Telephone call - 10 minutes More information about the session protocol is provided in the file "mx6_collection_doc.pdf". The recordings in this corpus were used in NIST Speaker Recognition Evaluation (SRE) test sets for 2010 and 2012. Researchers interested in applying those benchmark test sets should consult the respective NIST Evaluation Plans (available at http://nist.gov/itl/iad/mig/sre.cfm) for guidelines on allowable training data for those tests. 2.0 Directory Structure and Data Files As described in the introduction, there are two types of audio data, and these are kept in separate directories, as follows: data/ ulaw_sphere/ - contains 4410 8-KHz 2-channel NIST SPHERE files pcm_flac/ - contains 14 directories, 1 microphone / directory: CH01/ - usually 1425 16-KHz 1-channel flac/ms-wav files ... - for each channel CH14/ All audio files have names that indicate the date and time when the recording began, along with other identifying information, as follows: ulaw_sphere/: {yyyymmdd}_{hrmnsc}_{callid}.sph pcm_flac/CH{nn}/: {yyyymmdd}_{hrmnsc}_{room}_{subjid}_CH{nn}.flac where: yyyymmdd is the year, month and date of recording hrmnsc is the hour, minute and second when recording began callid is a unique, incremental number assigned to each call room is either "LDC" or "HRM", indicating which office was used subjid is a numeric identifier assigned to the speaker nn is a two-digit microphone channel identifier (01-14) When the flac files are uncompressed, they become ms-wav/RIFF files (flac compression does not presently support SPHERE file format). The telephone audio is presented in SPHERE format because (a) this is consistent with other telephone audio releases from the LDC, and (b) flac does not support ulaw sample encoding. The current release of the open-source "sox" utility is able to handle both formats as input; other utilities are available for both flac and SPHERE formats. 3.0 Related Documentation The "docs" directory contains the following files (along with this "readme.txt" file); their various contents are explained in the subsections below: 1 mx6_subjs.csv 2 mx6_calls.csv 3 mx6_intvs.csv 4 mx6_ivcomponents.csv 5 mx6_transcript_sentences.txt 6 mx6_collection_doc.pdf All the "*.csv" files are comma-delimited, plain-text "flat files". Care has been taken to ensure that no field values contain commas as part of the field data, so quotation marks are never used around field values, and nothing is done to mark or "escape" certain characters (such as apostrophes). Consecutive commas on a line of data indicate empty or null field values (which tends to be fairly common in the subjects table). 3.1 Subjects table (mx6_subjs.csv) Each row in this table provides demographic information about one of the speakers in the collection. LDC staffers who lead the interview sessions and were involved in many of the phone calls that were conducted during those sessions are included in this table. 1 subjid - numeric identifier, links to calls and interviews 2 sex - M or F 3 yob - year of birth 4 edu_years - years of formal education 5 edu_degree - highest education degree earned 6 edu_deg_yr - year in which highest degree was earned 7 edu_contig - Y or N: were all edu_years spent contiguously? 8 esl_age - for ESL speakers, age when English was learned 9 ntv_lg - native language (ISO 639-3 code) 10 oth_lgs - other languages (ISO 639-3 codes, '/'-separated) 11 occup - occupation 12 cntry_born - country where born 13 state_born - state where born 14 city_born - city where born 15 cntry_rsd - country where raised 16 state_rsd - state where raised 17 city_rsd - city where raised 18 ethnic - ethnicity 19 smoker - Y or N 20 ht_cm - height in centimeters 21 wt_kg - weight in kilograms 22 mother_born - country (state city) where mother was born 23 mother_raised - country (state city) where mother was raised 24 mother_lang - mother's native language 25 mother_edu - mother's years of formal education 26 father_born - country (state city) where father was born 27 father_raised - country (state city) where father was raised 28 father_lang - father's native language 29 father_edu - father's years of formal education 3.2 Calls table (mx6_calls.csv) Each row in this table provides the available information about a 2-channel telephone conversation. A couple of fields ("lang", "eng_stat") are actually irrelevant to Mixer-6, because this was an English-only collection, but the field inventory is fixed in order to provide consistency with other telephone corpus releases from the LDC. 1 call_id - numeric identifier, links to audio file name 2 call_date - links to audio file name 3 lang - language in which the conversation was conducted 4 eng_stat - one of: AllENG, SomeENG, NoENG 5 sid_a - subjid of the speaker channel A 6 phid_a - telephone ID on channel A 7 ph_categ_a - one of: M (main phone), O (other phone) 8 phtyp_a - one of: 1 (cell phone), 2 (cordless), 3 (standard) 9 phmic_a - one of: 1 (spkr-phone), 2 (headset), 3 (earbud), 4 (handheld) 10 cnvq_a - audit judgment of conversation quality (Good,Acceptable,Unsuitable) 11 sigq_a - audit judgment of signal quality (Good,Acceptable,Unsuitable) 12 tbug_a - Y or N: auditor found a technical problem channel A 13-20 - same as 5-12, applied to channel B 21 topic - numeric ID of the topic announced to the callers (refer to mx6_collection_doc.pdf for the numbered list of topics) 3.3 Interviews table (mx6_intvs.csv) Each row in this table provides available information about a single in-office multi-channel recording session. In field 5 below, "high_ve" and "low_ve" refer to high and low vocal effort call conditions, resepctively. 1 subj_id - numeric identifier, links to subjects table 2 session_fileid - audio file name, includes date, time, location 3 duration - in seconds, for the entire session recording 4 interviewer_id - subjid of LDC staff person conducting the session 5 call_type - one of: high_ve, low_ve, cell, normal 6 call_id - numeric identifier, links to calls table 7 call_chan - A or B: side of 2-channel ulaw audio matching IV audio 8 wb_tconv_offset - seconds from start of IV audio where call begins 3.4 Interview components table (mx6_ivcomponents.csv) Each row in this table provide time stamps for partitioning a single in-office multi-channel recording session into its distinct components. The time stamps provided for the call components are all based on careful alignment between the 10-minute ulaw telephone audio and the 45-minute session recording, so when the call portions of the session audio files are extracted using the stated time stamps, the resulting excerpt should align within a few milliseconds with the corresponding ulaw channel. The time stamps for the other components (rptq, intv, rdtr) are based on session log files (or have been set by manual annotation, in about 50 cases where log files were not available). The time stamps from log files are prone to have varying margins of accuracy relative to the actual start and end of the given component. (See the discussion in section 4.3 below about the "rdtr" time stamps.) 1 session_id - matches session_fileid in Interviews table 2 rptq_bgn - offset to start of "Repeating Questions" 3 rptq_end 4 intv_bgn - offset to start of "Informal Conversation" 5 intv_end 6 rdtr_bgn - offset to start of "Transcript Reading" 7 rdtr_end 8 call_bgn - offset to start of telephone call 9 call_end 10 call_type - matches call_type field in Interviews table 3.5 List of Transcript sentences (mx6_transcript_sentences.txt) This is simply a list of the 335 sentence prompts displayed to speakers in order to be read aloud. In every session, this component always presented the list in the order shown (one sentence at a time), and always started at the first sentence. If the speaker got to the end of the list quickly, with time remaining in the session schedule for transcript reading, the list was simply presented again, starting over at the first sentence. (So, some sessions contain more than 335 sentence readings in this component, and in this case, sentences at the start of the list will have been read twice.) Note that the sentences in this text file are numbered, but during the recording sessions, numbers were not presented to the people being recorded, and so were not spoken - they simply read the sentence text only. 3.6 Corpus collection specifications (mx6_collection_doc.pdf) This pdf file provides more detailed information about the Mixer 6 collection project: the procedures for recruiting and recording, microphone specifications, interview and auditing protocols, etc. 4.0 Known Problems and Difficulties The following subsections list some issues where the data being published fall short of expectations. 4.1 Some channels missing from some in-office recording sessions A few of the interview sessions do not have complete sets of 14 audio channels in the pcm_flac subdirectories: 20090908_110608_LDC_120333 lacks CH05 20090728_150523_LDC_120218 lacks CH08 20091123_095708_HRM_120721 lacks CH08 20090821_090546_LDC_120274 lacks CH11 and CH12 4.2 Lack of 2-channel ulaw recordings for some in-office phone calls There were 10 in-office multi-channel recording sessions in which the telephone-call component was not successful, either because a telephone conversation never took place (e.g. due to technical problems in the recording setup), or because there was a failure or loss of the corresponding 2-channel ulaw recording. These 10 cases have empty values in fields 5-8 of the interviews table (mx6_intvs.csv); in the interview components table (mx6_ivcomponents.csv), the call begin and end offset values are set to zero and the "call_type" field is set to "no_call". 4.3 Variable performance in transcript reading task While reading transcript sentences aloud, speakers would sometimes stumble, restart, or misread the prompt text (adding, skipping or substituting one or more words); there were also some cases where speakers interjected a remark or question in between the sentence prompts. The recording process would not be stopped for problems of this sort, and no attempt has been made to edit them out of the audio files. As a result, there will be some difficulty in trying to align the sentence prompt text with the corresponding audio portion in some number of sessions. (It's likely that a detailed alignment of text to the read speech will be made available as a supplemental annotation corpus.) Corpus users should also expect some amount of variability in the margins between the time stamps supplied for the "rdtr_bgn"/"rdtr_end" offsets in the Interview Components table, and the actually start and end points of the first and list prompted sentence, respectively. The "rdtr_end" value is especially prone to overstating the duration of the transcript reading task. There are about 400 sessions where the interval between the stated "rdtr_end" offset and the "call_bgn" offset is less than 20 seconds -- for about half of these, the interval is less than 1 second (effectively zero). Yet the actual time needed after the last sentence prompt, to set up and begin the subsequent phone call, could extend up to a few minutes. As mentioned above (section 3.4), the start-times for the call components can always be taken as accurate. 4.4 Relatively low recording levels in some channels or some sessions Some of the microphones in the recording rooms were purposefully set at distances beyond the stated performance specifications for the given mic. This was done to cause some channels to serve as "stress tests" for relevant speech technologies. In addition, there were difficulties with other channels in balancing between conflicting goals: (a) always maintain a constant gain setting on each channel across all sessions; (b) take as much care as possible in setting levels in order to avoid clipping; (c) employ techniques during the phone call components to explicitly evoke both unusually loud and unusually soft speech. Of course, there are also wide differences among speakers in terms of their intrinsic loudness, and during informal conversation, they would often cover a wide dynamic range. Taken together, these factors tended to lead to relatively lower gain settings than would have been optimal in many cases. 4.5 Resampling applied to multi-channel recordings The 14 channels presented in the "pcm_flac" directory were originally recorded via matched pairs of 8-channel A/D converters, running from a common clock signal. But relative to the 8KHz ulaw telephone channel (recorded via the public telephone network), and a pair of "reserved" channels recorded via a separate, 22KHz A/D device, it was found that the nominal 16000Hz sample rate applied to channels 01-14 was actually closer to 15899 Hz. This was measured and confirmed manual over numerous sessions, and we also manually confirmed that by doing a digital resampling from 15899 to 16000, the alignment of channels 01-14 to both the 8KHz ulaw and 22KHz signals was correct. This resampling has been applied to all the pcm_flac data, and should have no more than a negligible effect on signal analysis. ------- README prepared 2012-08-29 by David Graff