README FILE FOR:  Mixer 4 and 5 Speech
LDC Catalog ID:   LDC2020S03


1.0 Introduction

The Mixer-4 and Mixer-5 Speech Collections comprise recordings made
via the public telephone network (total of 2568 calls) and multiple
microphones in office-room settings (total of 2152 sessions).  The
offices used for recording sessions were located at the Linguistic
Data Consortium (hosted by University of Pennsylvania, Philadelphia
PA), and at the International Computer Science Institute (ICSI,
affilliated University of California at Berkeley).

Mixer-4 and Mixer-5 collections were conducted simultaneously, as a
collaborative, carefully coordinated activity at both recording sites.
There are some differences and some commonalities in the respective
protocols of the two collections, as follows:

- Both Mixer-4 and Mixer-5 involved collection of unscripted telephone
  conversations via a "robot operator" recording platform, which both
  dialed out automatically to, and handled incoming calls from, the
  set of speakers recruited and paid to participate in conversational
  telephone speech (CTS).  Telephone calls were not differentiated by
  project: in a given call, a Mixer-4 participant could be connected
  with a Mixer-5 participant.

- Some Mixer-4 speakers and all Mixer-5 speakers were recruited from
  the Philadelphia PA and Berkeley CA regions, in order to participate
  in multi-channel recording sessions conducted at the LDC and ICSI.
  Other Mixer-4 speakers, who would engage only in CTS recordings,
  were recruited from across the continental U.S.

- In Mixer-4, local-area speakers were asked to come to ICSI or the
  LDC so that some of their CTS calls could be conducted in an office
  equipped with multiple microphones.  As a result of these sessions,
  some of telephone calls in the Mixer-4 collection have multi-channel
  audio data, including the 8kHz telephone channel captured via the
  robot operator, and 14 channels of 16kHz audio recorded via fixed
  microphones in the recording room.  There are 252 such calls.

- In Mixer-5, speakers were asked to come to ICSI or the LDC to record
  a series of up to six 30-minute sessions per speaker.  During these
  sessions, the speaker would be guided by an interviewer through a
  sequence of activities, described in more detail below.  Each
  session included a CTS call in one of three vocal-effort conditions:
  (a) normal, (b) high vocal effort, (c) low vocal effort; these phone
  calls were NOT routed through the LDC's robot operator, so the
  collection has only the 14 channels from room microphones, recorded
  at 16kHz.  There are 1900 such sessions.

Nearly all speakers have English as their native language.

The telephone audio portion of the corpus is similar to earlier
Switchboard collections: recruited speakers are connected through a
robot operator to carry on casual conversations lasting up to 10
minutes, usually about a daily topic that is announced by the robot
operator at the start of the call.  (Speakers were not required to
discuss the announced topic.)  The raw digital audio content for each
call side was captured as a separate channel; each full conversation
is presented as a 2-channel interleaved audio file with a NIST SPHERE
header, 8000 samples/second and u-law sample encoding.  Each speaker
was asked to do up to 15 calls (most of them completed at least 10).

The multi-microphone portion involves 14 distinct microphones set up
identically, in terms of distance from the recruited speaker, at each
recording site.  The 14 channels were recorded synchronously into
separate single-channel files, using 16-bit PCM sample encoding at
16000 samples/second.

Each multi-channel session was guided by an LDC staff person, who used
specialized prompting and recording software to manage the session.
For Mixer-5, each session contained a mixture of distinct activities:

 1. Repeating questions
 2. Informal conversation
 3. Transcript reading
 4. Telephone call (with varied vocal effort conditions)

The recordings in this corpus were used in NIST Speaker Recognition
Evaluation (SRE) test sets for 2008.  Researchers interested in
applying those benchmark test sets should consult the NIST Evaluation
Plan (available at http://nist.gov/itl/iad/mig/sre.cfm) for guidelines
on allowable training data for those tests.

2.0 Directory Structure and Data Files

As described in the introduction, there are two types of audio data,
and these are kept in separate directories, as follows:

  data/

    ulaw_sphere/ - contains 2568 8-KHz 2-channel NIST SPHERE files

    mc_cts_flac/ - Mixer-4 14-channel CTS recordings: 1 channel / directory:

      CH01/ - 252 16-KHz 1-channel flac/ms-wav files
      ...   -   for each channel
      CH14/

    mc_ivs_flac/ - Mixer-5 14-channel Interview recordings:

      CH01/ - 1900 16-KHz 1-channel flac/ms-wav files
      ...   -   for each channel
      CH14/

All audio files have names that indicate the date and time when the
recording began, along with other identifying information, as follows:

  ulaw_sphere/:
    {yyyymmdd}_{hrmnsc}_{callid}.sph

  mc_cts_flac/CH{nn}/:
    {yyyymmdd}_{hrmnsc}_{loc}_{subjid}_{callid}_{A|B}_CH{nn}.flac

  mc_ivs_flac/CH{nn}/:
    {yyyymmdd}_{hrmnsc}_{loc}_{subjid}.flac

 where:

  yyyymmdd is the year, month and date of recording

  hrmnsc is the hour, minute and second when recording began

  loc is either "PHL" or "BER", indicating where recording was done

  subjid is a numeric identifier assigned to the speaker

  callid is a unique, incremental number assigned to each call

  A|B is the "call-side" (channel 1 or 2, respectively) in ulaw data

  nn is a two-digit microphone channel identifier (01-14)

Note that the "hrmnsc" values are local to the recording location; in
particular, for Mixer-4 "mc_cts_flac" files that were recorded in
Berkeley, their "hrmnsc" values are offset by approximately 3 hours,
relative to the corresponding "ulaw_sphere" 2-channel telephone
recording (which was recorded simultaneously by LDC's robot operator).

When the flac files are uncompressed, they become ms-wav/RIFF files
(flac compression does not presently support SPHERE file format.  

The telephone audio is presented in SPHERE format because (a) this is
consistent with other telephone audio releases from the LDC, and (b)
flac does not support ulaw sample encoding.  The current release of
the open-source "sox" utility is able to handle both formats as input;
other utilities are available for both flac and SPHERE formats.


3.0  Related Documentation

The "docs" directory contains the following files; their various
contents are explained in the subsections below:

     1  mx45_call_info.csv  -- details on CTS calls
     2  mx45_spkr_info.csv  -- details on speakers
     3  mx4_mc_cts_info.csv -- Mixer-4 multi-channel sessions
     4  mx5_mc_ivs_info.csv -- Mixer-5 multi-channel sessions
     5  updated_NIST_KEYS   -- modified versions of SRE08 "answer keys"
     6  iv_seg_tdf -- files with time-stamps for IV session events

All the "*.csv" files are comma-delimited, plain-text "flat files".
Care has been taken to ensure that no field values contain commas as
part of the field data, so quotation marks are never used around field
values, and nothing is done to mark or "escape" certain characters
(such as apostrophes).  Consecutive commas on a line, or a comma at
the end of a line, indicate empty or null field values (which tends to
be fairly common in the "spkr" table).


3.1 Calls table (mx45_call_info.csv)

Each row in this table provides the available information about a
2-channel telephone conversation.  A couple of fields ("lang",
"eng_stat") are actually irrelevant to Mixer-4/5, because this was an
English-only collection, but the field inventory is fixed in order to
provide consistency with other telephone corpus releases from the LDC.

     1  call_id - numeric identifier, relates to audio file name
     2  call_date - relates to audio file name
     3  lang - "USE" (U.S.English) if both speakers are native, ENG otherwise
     4  eng_stat - always has the value "All_ENG"
     5  sid_a - subjid of the speaker on channel A
     6  phid_a - telephone ID on channel A
     7  ph_categ_a - one of: M (main phone), O (other phone)
     8  phtyp_a - one of: 1 (cell phone), 2 (cordless), 3 (standard)
     9  phmic_a - one of: 1 (spkr-phone), 2 (headset), 3 (earbud), 4 (handheld)
    10  cnvq_a - audit judgment of conversation quality (Good,Acceptable,Unsuitable)
    11  sigq_a - audit judgment of signal quality (Good,Acceptable,Unsuitable)
    12  tbug_a - Y or N: auditor found a technical problem channel A
    13-20 - same as 5-12, applied to channel B
    21  topic - numeric ID of the topic announced to the callers


3.2 Subjects table (mx45_spkr_info.csv)

Each row in this table provides demographic information about one of
the speakers in the collection.  LDC staffers who lead the interview
sessions and were involved in many of the phone calls that were
conducted during those sessions are included in this table.

     1  subjid - numeric identifier, links to calls and interviews
     2  sex - M or F
     3  yob - year of birth
     4  edu_years - years of formal education
     5  edu_degree - highest education degree earned
     6  edu_deg_yr - year in which highest degree was earned
     7  edu_contig - Y or N: were all edu_years spent contiguously?
     8  esl_age - for ESL speakers, age when English was learned
     9  ntv_lg - native language (ISO 639-3 code)
    10  oth_lgs - other languages (ISO 639-3 codes, '/'-separated)
    11  occup - occupation
    12  cntry_born - country where born
    13  state_born - state where born
    14  city_born - city where born
    15  cntry_rsd - country where raised
    16  state_rsd - state where raised
    17  city_rsd - city where raised
    18  ethnic - ethnicity
    19  smoker - Y or N
    20  ht_cm - height in centimeters
    21  wt_kg - weight in kilograms
    22  mother_born - country (state city) where mother was born
    23  mother_raised - country (state city) where mother was raised
    24  mother_lang - mother's native language
    25  mother_edu - mother's years of formal education
    26  father_born - country (state city) where father was born
    27  father_raised - country (state city) where father was raised
    28  father_lang - father's native language
    29  father_edu - father's years of formal education


3.3 Mixer-4 multi-channel CTS sessions table (mx4_mc_cts_info.csv)

Each row in this table provides information about a single in-office
multi-channel recording session for a Mixer-4 telephone call.  Fields
2 and 5 establish the relative timing for the beginnings of the two
sets of audio data: sessions varied in terms of which device began
recording first ("ulaw" telephone robot operator or "mc" 14-channel
microphone system), and in terms of how much time elapsed until the
other device began recording.  Whichever device began later will have
an "_ofs" value of "0.0", and the other will have a value greater than
zero, indicating that you must seek that far into the latter recording
in order to reach the point where the former recording begins.

     1  mc_cts_flac_id - common portion of multi-channel file name
     2  mc_ofs - alignment offset (in seconds) relative to ulaw data
     3  ulaw_sphere_id - file-ID of 2-channel telephone audio data
     4  ulaw_ch - A or B (call-side corresponding to MC data)
     5  ulaw_ofs - alignment offset (in seconds) relative to MC data


3.4 Mixer-5 multi-channel interview sessions table (mx5_mc_ivs_info.csv)

Each row in this table provides information about a single in-office
multi-channel recording session for a Mixer-5 interview.

     1  common_audio_fileid
     2  session_num - ranges from 1 to 6
     3  intervwr_id - subject_id of staff person asking questions
     4  session_dur


3.5 Updated NIST 2008 answer key data

This directory contains modified versions of three table files that
were originally circulated by NIST following the SRE 2008 evaluation.
This is only a subset of the complete NIST distribution of results and
answer keys.  In particular, it consists of the three tables that
happen to contain references to original LDC data file names, as
delivered to NIST in preparation for creating the SRE 2008 test set.

In preparing the data for general release, we have modified the file
names of the multi-channel session data to make them more consistent
and informative.  Changes to the NIST tables were simply a matter of
replacing the file name strings as needed.


3.6 Time-stamped information about Mixer-5 interview sessions

For most of the Mixer-5 interview sessions, there is a "tdf" file (a
tab-delimited format used by the LDC's "xtrans" audio transcription
tool), with time-stamped entries for the major component portions of
the recording session: the initial set of "repeating questions", the
"informal interview", the reading aloud of sentences, and the
telephone call with one or another "vocal effort" condition.  In the
case of the sentence readings, each sentence prompt is shown together
with the time stamp for when that sentence was presented to the
speaker.  While most of the time stamps are reasonably accurate, they
all derive from the software used to present sentence prompts and to
generally guide the interviewer through the session.  As such, there
may be mistakes in the alignment of sentences to time stamps in some
cases, and the sentence text may not match what the speaker actually
said (e.g. if they stumbled or erred in the reading).


4.0 Known Problems and Difficulties

The following subsections list some issues where the data being
published fall short of expectations.


4.1 Some channels missing from some in-office recording sessions

A few of the multi-channel recording sessions (two in the Mixer-4 CTS
set and three in the Mixer-5 interview set) do not have complete sets
of 14 audio channels, due to the occasional failure of one or another
microphone during the affected sessions.


4.2 Significant variability in recording levels on some in-office mics

There were unexpected difficulties in monitoring of recording levels
on all channels in both the Berkeley and Philadelphia office setups
across the entire calendar duration of the collection project.  As a
result, a given channel may show considerable variation across
sessions in terms of signal level.

Some of the microphones in the recording rooms were purposefully set
at distances beyond the stated performance specifications for the
given mic.  This was done to cause some channels to serve as "stress
tests" for relevant speech technologies.  In addition, there were
difficulties with other channels in balancing between conflicting
goals: (a) always maintain a constant gain setting on each channel
across all sessions; (b) take as much care as possible in setting
levels in order to avoid clipping; (c) employ techniques during the
Mixer-5 phone call components to explicitly evoke both unusually loud
and unusually soft speech.  Of course, there are also wide differences
among speakers in terms of their intrinsic loudness, and during
informal conversation, they would often cover a wide dynamic range.
Taken together, these factors tended to lead to relatively lower gain
settings than would have been optimal in many cases.


4.3 Variable performance in transcript reading task

While reading transcript sentences aloud, speakers would sometimes
stumble, restart, or misread the prompt text (adding, skipping or
substituting one or more words); there were also some cases where
speakers interjected a remark or question in between the sentence
prompts.  The recording process would not be stopped for problems of
this sort, and no attempt has been made to edit them out of the audio
files.  As a result, there will be some difficulty in trying to align
the sentence prompt text with the corresponding audio portion in some
number of sessions.  (It's likely that a detailed alignment of text to
the read speech will be made available as a supplemental annotation
corpus.)


4.4 Resampling applied to Mixer-4 multi-channel recordings

The 14 channels presented in the "mc_cts_flac" and "mc_ivs_flac"
directories were originally recorded via matched pairs of 8-channel
A/D converters, running from a common clock signal.  But relative to
the 8KHz ulaw telephone channel (recorded via the public telephone
network), it was found that the nominal 16000Hz sample rate applied to
channels 01-14 was actually closer to 15899 Hz.

Because the Mixer-4 multi-channel recordings are intended for use with
the ulaw telephone channel data, the discrepancy in sample rates was
measured and confirmed manually over numerous sessions, and we also
manually confirmed that by doing a digital resampling from 15899 to
16000, the alignment of channels 01-14 to the 8KHz ulaw signal was
correct.

This resampling has been applied _only_ to the 14 flac channels of the
Mixer-4 CTS session data.  The 14 channels of Mixer-5 interview audio
have been left as-is, because there were no other simultaneous
recordings of these sessions at other sample rates.

Note that by leaving the Mixer-5 flac data as-is, any reference to
time stamps used by NIST for evaluation test segments from these
recordings will still be correct and appropriate.

However, NIST SRE 2008 test segments drawn from Mixer-4 multi-channel
CTS sessions were originally based on speech-detection time stamps
that were computed from the ulaw telephone data (at 8000 Hz sample
rate).  The sample-rate discrepancy problem in the multi-channel data
(assumed to be 16000 Hz but actually 15899 Hz) was not recognized
until after the evaluation, when it became evident that short segments
extracted near the beginning of a recording session scored reasonably
well, but system performance on short segments degraded as these
segments were closer to the end of the 10 minute recording, where the
office microphone channels were off by nearly 4 seconds relative the
ulaw telephone channel.

Having resampled the 14 channels of office-mic data for all Mixer-4
CTS sessions, the use of NIST segment specifications for SRE08 should
yield noticeably better scores using the present version of the data.

4.5 Some speakers completed interview sessions but no phone calls

In order to provide as much useful data for as many speakers as 
possible, this release includes 60 Mixer 5 speakers who completed one 
or more interview sessions but did not complete any telephone calls. 

-------
README prepared 2017-04-21 by David Graff <graff@ldc.upenn.edu>