README FILE FOR: Mixer 3 Speech
LDC_Catalog-ID:  LDC2023S02


1.0 Introduction

This release of the Mixer 3 corpus comprises 19,595 telephone conversations
involving 3,875 speakers, who used, in total, up to 26 distinct languages.
Residents of the continental United States and Canada were recruited and
enrolled via a web site at the LDC, and were given financial incentives to
complete 15 calls.  For subjects who were fluent in languages other than
English, additional incentives were provided to complete at least one
non-English call.

The design goals of the collection project were to supply training and test
data for sponsored evaluations of speech technology for both speaker
recognition and language recognition.  These evaluations were conducted by
NIST, and the test set sets have been published by the LDC as the "NIST
Speaker Recognition Evaluation" (SRE) and "NIST Language Recognition
Evaluation" (LRE) corpora.

The following sections summarize the methods used for call collection and
auditing, and describe the various tables (found in the "docs" directory) that
provide more details about the calls, subjects, and languages.


2.0 Collection

The web site for Mixer 3 recruitment presented an enrollment form where
subjects provided contact information and demographic data, and filled in a
selection menu for times of day when they would be available to receive calls.
Speakers of languages other than English were encouraged to enroll, without
limitation as to what their native language was, but all enrollees needed to
be able to converse in English as well.

LDC staff did an initial vetting pass over the enrollment requests, contacting
each enrollee to confirm the enrollment information.  Subjects were then
activated for participation, using an assigned 5-digit personal identification
number (PIN), which the subject would use for verification at the beginning of
each call.

Call collection was implemented on a robot-operator platform at the LDC,
connected to a digital T-1 trunk line with 24 channels for accessing the
public telephone network.  Some channels were devoted to handling incoming
calls from subjects, and the remaining channels were used by the platform for
initiating outbound calls.  The robot-operator application was configured to
conduct dial-outs during specified time-windows of availability.  It also
accepted dial-ins during a larger time window.

For both outbound and inbound calls, subjects were asked to enter their PIN
for verification.  When this succeeded, subjects were prompted to record their
first and last name; next they used their telephone key-pad to answer two
questions about the type of phone they were using, and then were presented
with a pre-recorded "topic of the day" and placed on hold. (Twenty-five topics
were prepared, and these were rotated daily; speakers were requested, but not
required, to stick to the given topic throughout the call.)

As soon as any two channels were in the "on-hold" state, the application
bridged the two channels to create a two-party circuit; a "welcome to both of
you" announcement was played to both channels, and then recording began.  For
each channel, the digital stream from the T-1 line (8-bit mu-law encoded audio
at 8000 samples/second) was stored to a disk file on the platform, in addition
to being passed through the circuit to the other call participant.  Nearly ten
minutes later, if the conversation was still in progress, a prompt was played
to both parties to indicate that the recording was nearly complete; at 10
minutes, the recording ceased and the call was ended automatically.

The two single-channel files for each call were uploaded to an LDC-internal
file server, and combined into a 2-channel audio file with a NIST SPHERE
header.  (The recordings of subjects saying their names prior to each
conversation were also uploaded for use during the auditing process; to
maintain subject anonymity, these recordings are kept secure at the LDC.)

Various techniques were employed to maximize the pairings of speakers of the
same language, and to enable each non-English speaker to complete at least one
call in their native language.  Subjects were instructed to determine, at the
beginning of each conversation, if they both spoke the same non-English
language, and if so, the conversation would proceed in that language.
Otherwise, they would converse in English.


3.0 Auditing

As calls came in, some initial signal processing was done to check for the
presence of speech, and some recordings were rejected automatically on this
basis.  For calls with sufficient speech, a two-stage audit procedure was
carried out, summarized in the following outline:

  Stage 1: Assess quality and language in the full conversation

    For each call:
    - quickly scan the waveform display from beginning to end, then listen to
      brief portions (15-20 seconds) from the beginning, middle and end, and
      mark the following:
    - was the conversation All English, Some English, or No English?
    - were there any problems (extended silences, excessive noise, change of
      speaker on either channel)?

  Stage 2: Confirm speaker identity on each call side

    Auditor is shown a list of speaker-PINs with call sides yet to be audited,
    and picks the next available PIN on the list; this brings up enrollment
    information for that PIN (subject's full name, sex, native language, age),
    and the list of call sides for that PIN; each list item provides playback
    access (audio only, no waveform display) for the "say your name" recording
    and for pre-selected snippets from the call-side.  The auditor:
    - Listens to the name recordings in succession
    - Listens to pre-selected snippets
    - Marks each list item as "ID OK" or "wrong speaker", as appropriate

For a subset of the calls that were identified as having "No English", a third
stage of auditing was conducted, by auditors who were fluent in the languages
involved, to verify the identity of the language being spoken, making the call
side available for use in LRE.

Regarding the Stage 2 auditing: over the course of the collection, which
spanned 14 months, various techniques were developed to resolve the cases of
"wrong speaker" as far as possible, and auditors were often able to assign a
correct PIN / subject-ID on these call sides, based on the phone number or the
name that was spoken in the initial "say your name" recording.  Also, in the
years following the collection, as some call sides were used in NIST SRE test
sets and yielded unexpected outcomes from SRE systems, some initial audit
decisions were reviewed and corrected.

Three predominant issues leading to "wrong speaker" decisions and unsuitable
SRE results were:

 - some individuals using some other valid PIN by mistake when dialing in
   (i.e. due to a slip while keying in the number), causing the call side
   to be wrongly attributed to someone else;

 - some individuals enrolling multiple times (using a different name each
   time) in order to earn more money;

 - some individuals providing their PIN to other people, who may or may not
   have also been enrolled, so that those others recorded some calls in place
   of the actual PIN assignee.

The first category was relatively easy to fix.  The second category was either
noticed by auditors or exposed by unusual SRE test results; as a result of
repairing the speaker labeling for these cases, a few subjects are now on
record as being involved in 50 to 100 calls each.  In the third category, the
individuals who were actually recorded, while clearly not the person assigned
to the given PIN, often couldn't be reliably identified as some other enrolled
subject, and may have never enrolled; these cases account for the 267 call
sides in this release where the subj_id value is "unknown".


4.0 Documentation

The following subsections describe the contents of the "docs" directory.  In
general, file names ending in ".tab" are tab-delimited tables, and those
ending in ".csv" are comma-delimited.  In the latter, there are no quotation
marks or escape-characters (so simply splitting .csv lines on commas works the
same as splitting .tab lines on tabs).

4.1 Interspeech_2007_Mixer_345.pdf

This is a paper describing the Mixer 3 corpus (along with Mixer 4 and 5).

4.2 mx3_subj_info.csv

This is a comma-delimited, 29 column flat table file with a header line and
one line per subject.  The column headings are listed below.  Note that the
column inventory represents a set of demographic fields that has accumulated
over the span of numerous telephone collection projects at the LDC.  Several
of the columns were not included in the recruitment form for Mixer 3, and are
therefore mostly or completely empty in this table.  In some cases, however,
Mixer 3 subjects were re-enrolled for subsequent Mixer collection projects,
and provided additional demographic data.  For this reason, and for
consistency with other Mixer corpora, we've included the full set of
demographic fields despite having little or no information for several of
them.

  1	subjid (6-digit numeric starting with "1", cited in mx3_call_info.csv)
  2	sex (M or F)
  3	yob (year of birth)
  4	edu_years (years of education)
  5	edu_degree (highest degree earned)
  6	edu_deg_yr (year when degree was received)
  7	edu_contig (Y, N or blank: were all edu_years contiguous?)
  8	esl_age	(for non-native English speakers: age at which English was learned)
  9	ntv_lg (3-letter language code)
  10	oth_lgs	(space-separated list of other 3-letter language codes)
  11	occup (subject-supplied string)
  12	cntry_born
  13	state_born
  14	city_born
  15	cntry_rsd
  16	state_rsd
  17	city_rsd
  18	ethnic (e.g. Caucasian)
  19	smoker (Y, N or blank)
  20	ht_cm (height in centimeters)
  21	wt_kg (weight in kilograms)
  22	mother_born
  23	mother_raised	
  24	mother_lang	
  25	mother_edu	
  26	father_born	
  27	father_raised	
  28	father_lang	
  29	father_edu	


4.3 mx3_call_info.csv

This is a comma-delimited, 21-column flat table file, with a header line and
one line per call.  The column headings are as follows:

 col#   heading
  1	call_id (numeric, matching last field of file name, e.g. "3")
  2	call_date (e.g. "2005-12-13_15:26:33", from first 2 fields of file name)
  3	lang (e.g. "USE" -- see note (a) below)
  4	eng_stat (e.g. "All_ENG" -- see note (b) below)
  5	sid_a (numeric, maps to an entry in mx3_subj_info.csv, e.g. "100608")
  6	phid_a (encrypted phone number, e.g. "267977cnm")
  7	ph_categ_a ("M" for "main" or "O" for "other" -- see note (c))
  8	phtyp_a	(telephone type -- see note (d))
  9	phmic_a	(microphone type -- see note (d))
  10	cnvq_a	(conversation quality: G, A, U -- see note (e))
  11	sigq_a	(signal quality: G, A, U -- see note (e))
  12	tbug_a	("technical problem" -- always empty)
  13	sid_b     \
  14	phid_b     \
  15	ph_categ_b  \
  16	phtyp_b	     \_ like cols. 5-12, for B channel
  17	phmic_b	     /
  18	cnvq_b	    /
  19	sigq_b	   /
  20	tbug_b	  /
  21	topic_id (numeric, 1-25)

Notes:

(a) The three letter language codes are defined in "language_list.tab" (4.4);
    most of the codes are consistent with IS0-639-3 usage, with the following
    two exceptions: "INE" refers to English as spoken in India and Pakistan;
    "USE" refers to English as spoken by native speakers in the United States
    and Canada.  In some cases, the two speakers in a given call had indicated
    in their enrollment that they spoke multiple languages other than English,
    and they shared more than one non-English language in common; in these
    cases, the shared languages are concatenated with spaces (e.g. "BEN HIN").

(b) Auditors judged each call to be one of "all English", "some English" or
    "no English", based on listening to three brief portions of the call.
    When English was not being used, auditors were not required to identify
    which language was being used.

(c) In compiling the call_info table, a tally was kept as to how many times
    each subject used a given phone number (whether on dial-in or dial-out);
    based on this tally, each call side has been labeled to indicate that the
    subject is using his/her most common phone ("M") or some other phone ("O").

(d) For "phtyp_(a/b)", the following 2-letter values are used:
      phtyp "hw" - hard-wired phone (land line with wire-connected handset)
      phtyp "ce" - cell phone
      phtyp "co" - cordless phone
      phtyp "na" - information not available
    For "phmic_(a/b)", the following 2-letter values are used:
      phmic "hh" - hand-held
      phmic "sp" - speaker-phone
      phmic "hs" - headset
      phmic "eb" - earbud
      phmic "na" - information not available

(e) Conversation and signal quality were judged subjectively by auditors,
    with three possible responses:
      "G": "good" -- signal was clean, speech was fluent and active
      "A": "acceptable" -- signal was noisy but coherent, speech was awkward
      "U": "unacceptable" -- signal impeded intelligibility, no real conversation


4.4 calls_per_lang.tab

This is a three-column, tab-delimited table with one row per language, covering
all the language codes that show up in column 3 of mx3_call_info.tab:

 col#	heading
  1	n_calls (numeric, number of calls)
  2	lang (one or more 3-letter language codes, space-separated
  3	eng_stat (English status: one of "All_ENG, Some_ENG. No_ENG")


4.5 calls_per_subj.tab

This is a two-column, tab-delimited table with one row per subj_id:

 col#	heading
  1	n_calls (numeric, number of calls)
  2	subj_id (numeric or "unknown"; numbers map to col.1 of mx3_subj_info.tab)


4.6 language_list.tab

This is a two-column, tab-delimited table with one row per language, covering
all the language codes in the other tables.

 col#	heading
  1	abbrev (3-letter code as used in other tables)
  2	language_name (full language name


--------
README.txt created by David Graff, Feb. 12, 2021
           updated by David Graff, Apr. 29, 2022