README FILE FOR: Mixer 3 Speech LDC_Catalog-ID: LDC2023S02 1.0 Introduction This release of the Mixer 3 corpus comprises 19,595 telephone conversations involving 3,875 speakers, who used, in total, up to 26 distinct languages. Residents of the continental United States and Canada were recruited and enrolled via a web site at the LDC, and were given financial incentives to complete 15 calls. For subjects who were fluent in languages other than English, additional incentives were provided to complete at least one non-English call. The design goals of the collection project were to supply training and test data for sponsored evaluations of speech technology for both speaker recognition and language recognition. These evaluations were conducted by NIST, and the test set sets have been published by the LDC as the "NIST Speaker Recognition Evaluation" (SRE) and "NIST Language Recognition Evaluation" (LRE) corpora. The following sections summarize the methods used for call collection and auditing, and describe the various tables (found in the "docs" directory) that provide more details about the calls, subjects, and languages. 2.0 Collection The web site for Mixer 3 recruitment presented an enrollment form where subjects provided contact information and demographic data, and filled in a selection menu for times of day when they would be available to receive calls. Speakers of languages other than English were encouraged to enroll, without limitation as to what their native language was, but all enrollees needed to be able to converse in English as well. LDC staff did an initial vetting pass over the enrollment requests, contacting each enrollee to confirm the enrollment information. Subjects were then activated for participation, using an assigned 5-digit personal identification number (PIN), which the subject would use for verification at the beginning of each call. Call collection was implemented on a robot-operator platform at the LDC, connected to a digital T-1 trunk line with 24 channels for accessing the public telephone network. Some channels were devoted to handling incoming calls from subjects, and the remaining channels were used by the platform for initiating outbound calls. The robot-operator application was configured to conduct dial-outs during specified time-windows of availability. It also accepted dial-ins during a larger time window. For both outbound and inbound calls, subjects were asked to enter their PIN for verification. When this succeeded, subjects were prompted to record their first and last name; next they used their telephone key-pad to answer two questions about the type of phone they were using, and then were presented with a pre-recorded "topic of the day" and placed on hold. (Twenty-five topics were prepared, and these were rotated daily; speakers were requested, but not required, to stick to the given topic throughout the call.) As soon as any two channels were in the "on-hold" state, the application bridged the two channels to create a two-party circuit; a "welcome to both of you" announcement was played to both channels, and then recording began. For each channel, the digital stream from the T-1 line (8-bit mu-law encoded audio at 8000 samples/second) was stored to a disk file on the platform, in addition to being passed through the circuit to the other call participant. Nearly ten minutes later, if the conversation was still in progress, a prompt was played to both parties to indicate that the recording was nearly complete; at 10 minutes, the recording ceased and the call was ended automatically. The two single-channel files for each call were uploaded to an LDC-internal file server, and combined into a 2-channel audio file with a NIST SPHERE header. (The recordings of subjects saying their names prior to each conversation were also uploaded for use during the auditing process; to maintain subject anonymity, these recordings are kept secure at the LDC.) Various techniques were employed to maximize the pairings of speakers of the same language, and to enable each non-English speaker to complete at least one call in their native language. Subjects were instructed to determine, at the beginning of each conversation, if they both spoke the same non-English language, and if so, the conversation would proceed in that language. Otherwise, they would converse in English. 3.0 Auditing As calls came in, some initial signal processing was done to check for the presence of speech, and some recordings were rejected automatically on this basis. For calls with sufficient speech, a two-stage audit procedure was carried out, summarized in the following outline: Stage 1: Assess quality and language in the full conversation For each call: - quickly scan the waveform display from beginning to end, then listen to brief portions (15-20 seconds) from the beginning, middle and end, and mark the following: - was the conversation All English, Some English, or No English? - were there any problems (extended silences, excessive noise, change of speaker on either channel)? Stage 2: Confirm speaker identity on each call side Auditor is shown a list of speaker-PINs with call sides yet to be audited, and picks the next available PIN on the list; this brings up enrollment information for that PIN (subject's full name, sex, native language, age), and the list of call sides for that PIN; each list item provides playback access (audio only, no waveform display) for the "say your name" recording and for pre-selected snippets from the call-side. The auditor: - Listens to the name recordings in succession - Listens to pre-selected snippets - Marks each list item as "ID OK" or "wrong speaker", as appropriate For a subset of the calls that were identified as having "No English", a third stage of auditing was conducted, by auditors who were fluent in the languages involved, to verify the identity of the language being spoken, making the call side available for use in LRE. Regarding the Stage 2 auditing: over the course of the collection, which spanned 14 months, various techniques were developed to resolve the cases of "wrong speaker" as far as possible, and auditors were often able to assign a correct PIN / subject-ID on these call sides, based on the phone number or the name that was spoken in the initial "say your name" recording. Also, in the years following the collection, as some call sides were used in NIST SRE test sets and yielded unexpected outcomes from SRE systems, some initial audit decisions were reviewed and corrected. Three predominant issues leading to "wrong speaker" decisions and unsuitable SRE results were: - some individuals using some other valid PIN by mistake when dialing in (i.e. due to a slip while keying in the number), causing the call side to be wrongly attributed to someone else; - some individuals enrolling multiple times (using a different name each time) in order to earn more money; - some individuals providing their PIN to other people, who may or may not have also been enrolled, so that those others recorded some calls in place of the actual PIN assignee. The first category was relatively easy to fix. The second category was either noticed by auditors or exposed by unusual SRE test results; as a result of repairing the speaker labeling for these cases, a few subjects are now on record as being involved in 50 to 100 calls each. In the third category, the individuals who were actually recorded, while clearly not the person assigned to the given PIN, often couldn't be reliably identified as some other enrolled subject, and may have never enrolled; these cases account for the 267 call sides in this release where the subj_id value is "unknown". 4.0 Documentation The following subsections describe the contents of the "docs" directory. In general, file names ending in ".tab" are tab-delimited tables, and those ending in ".csv" are comma-delimited. In the latter, there are no quotation marks or escape-characters (so simply splitting .csv lines on commas works the same as splitting .tab lines on tabs). 4.1 Interspeech_2007_Mixer_345.pdf This is a paper describing the Mixer 3 corpus (along with Mixer 4 and 5). 4.2 mx3_subj_info.csv This is a comma-delimited, 29 column flat table file with a header line and one line per subject. The column headings are listed below. Note that the column inventory represents a set of demographic fields that has accumulated over the span of numerous telephone collection projects at the LDC. Several of the columns were not included in the recruitment form for Mixer 3, and are therefore mostly or completely empty in this table. In some cases, however, Mixer 3 subjects were re-enrolled for subsequent Mixer collection projects, and provided additional demographic data. For this reason, and for consistency with other Mixer corpora, we've included the full set of demographic fields despite having little or no information for several of them. 1 subjid (6-digit numeric starting with "1", cited in mx3_call_info.csv) 2 sex (M or F) 3 yob (year of birth) 4 edu_years (years of education) 5 edu_degree (highest degree earned) 6 edu_deg_yr (year when degree was received) 7 edu_contig (Y, N or blank: were all edu_years contiguous?) 8 esl_age (for non-native English speakers: age at which English was learned) 9 ntv_lg (3-letter language code) 10 oth_lgs (space-separated list of other 3-letter language codes) 11 occup (subject-supplied string) 12 cntry_born 13 state_born 14 city_born 15 cntry_rsd 16 state_rsd 17 city_rsd 18 ethnic (e.g. Caucasian) 19 smoker (Y, N or blank) 20 ht_cm (height in centimeters) 21 wt_kg (weight in kilograms) 22 mother_born 23 mother_raised 24 mother_lang 25 mother_edu 26 father_born 27 father_raised 28 father_lang 29 father_edu 4.3 mx3_call_info.csv This is a comma-delimited, 21-column flat table file, with a header line and one line per call. The column headings are as follows: col# heading 1 call_id (numeric, matching last field of file name, e.g. "3") 2 call_date (e.g. "2005-12-13_15:26:33", from first 2 fields of file name) 3 lang (e.g. "USE" -- see note (a) below) 4 eng_stat (e.g. "All_ENG" -- see note (b) below) 5 sid_a (numeric, maps to an entry in mx3_subj_info.csv, e.g. "100608") 6 phid_a (encrypted phone number, e.g. "267977cnm") 7 ph_categ_a ("M" for "main" or "O" for "other" -- see note (c)) 8 phtyp_a (telephone type -- see note (d)) 9 phmic_a (microphone type -- see note (d)) 10 cnvq_a (conversation quality: G, A, U -- see note (e)) 11 sigq_a (signal quality: G, A, U -- see note (e)) 12 tbug_a ("technical problem" -- always empty) 13 sid_b \ 14 phid_b \ 15 ph_categ_b \ 16 phtyp_b \_ like cols. 5-12, for B channel 17 phmic_b / 18 cnvq_b / 19 sigq_b / 20 tbug_b / 21 topic_id (numeric, 1-25) Notes: (a) The three letter language codes are defined in "language_list.tab" (4.4); most of the codes are consistent with IS0-639-3 usage, with the following two exceptions: "INE" refers to English as spoken in India and Pakistan; "USE" refers to English as spoken by native speakers in the United States and Canada. In some cases, the two speakers in a given call had indicated in their enrollment that they spoke multiple languages other than English, and they shared more than one non-English language in common; in these cases, the shared languages are concatenated with spaces (e.g. "BEN HIN"). (b) Auditors judged each call to be one of "all English", "some English" or "no English", based on listening to three brief portions of the call. When English was not being used, auditors were not required to identify which language was being used. (c) In compiling the call_info table, a tally was kept as to how many times each subject used a given phone number (whether on dial-in or dial-out); based on this tally, each call side has been labeled to indicate that the subject is using his/her most common phone ("M") or some other phone ("O"). (d) For "phtyp_(a/b)", the following 2-letter values are used: phtyp "hw" - hard-wired phone (land line with wire-connected handset) phtyp "ce" - cell phone phtyp "co" - cordless phone phtyp "na" - information not available For "phmic_(a/b)", the following 2-letter values are used: phmic "hh" - hand-held phmic "sp" - speaker-phone phmic "hs" - headset phmic "eb" - earbud phmic "na" - information not available (e) Conversation and signal quality were judged subjectively by auditors, with three possible responses: "G": "good" -- signal was clean, speech was fluent and active "A": "acceptable" -- signal was noisy but coherent, speech was awkward "U": "unacceptable" -- signal impeded intelligibility, no real conversation 4.4 calls_per_lang.tab This is a three-column, tab-delimited table with one row per language, covering all the language codes that show up in column 3 of mx3_call_info.tab: col# heading 1 n_calls (numeric, number of calls) 2 lang (one or more 3-letter language codes, space-separated 3 eng_stat (English status: one of "All_ENG, Some_ENG. No_ENG") 4.5 calls_per_subj.tab This is a two-column, tab-delimited table with one row per subj_id: col# heading 1 n_calls (numeric, number of calls) 2 subj_id (numeric or "unknown"; numbers map to col.1 of mx3_subj_info.tab) 4.6 language_list.tab This is a two-column, tab-delimited table with one row per language, covering all the language codes in the other tables. col# heading 1 abbrev (3-letter code as used in other tables) 2 language_name (full language name -------- README.txt created by David Graff, Feb. 12, 2021 updated by David Graff, Apr. 29, 2022