README FILE FOR:  Mixer 7 Spanish Speech
LDC Catalog ID:   LDC2023S04


1.0 Introduction

The Mixer-7 Spanish Speech Corpus comprises recordings made via the public
telephone network (total of 2583 calls) and multiple microphones in
office-room settings (total of 678 sessions conducted by LDC staff).  All
recruited speakers and some LDC staff have Spanish as their native language
(214 distinct speakers, including LDC staff).  Recordings took place between
December 2010 and March 2012.

The telephone audio portion of the corpus is similar to earlier Mixer
collections: recruited speakers are connected through a robot operator to
carry on casual conversations lasting up to 10 minutes, usually about a daily
topic that is announced by the robot operator at the start of the call.  The
raw digital audio content for each call side is captured as a separate
channel, and each full conversation is presented as a 2-channel interleaved
audio file, with 8000 samples/second and u-law sample encoding.  Each speaker
was asked to complete 15 calls.

The multi-microphone portion involves 14 distinct microphones set up
identically in two distinct office rooms at the Linguistic Data Consortium.
The same speakers who were taking part in the telephone collection were also
brought in to record up to 4 sessions on distinct days, with each session
lasting up to 90 minutes, typically producing 75 minutes of speech of various
types.  The 14 channels were recorded synchronously into separate
single-channel files, using 16-bit PCM sample encoding at 16000
samples/second.

Each multi-channel session was guided by an LDC staff person, who used
specialized prompting and recording software to manage the session.
Activities recorded in each session consisted of seven components:

 1. Repeating questions - usually less than 1 minute
 2. Informal conversation, "near" condition - 15 minutes
 3. Telephone call, low or high vocal effort condition - 10 minutes
 4. Transcript reading - 15 minutes
 5. Telephone call, cell or speaker phone condition - 10 minutes
 6. Informal conversation, "far" condition - 15 minutes
 7. Telephone call, varied condition - 10 minutes

More information about the session protocol is provided in the file
"mx7_collection_doc.pdf".

The recordings in this corpus were used in NIST Speaker Recognition 
Evaluation (SRE) test set for 2012.  Researchers interested 
in applying those benchmark test sets should consult the respective 
NIST Evaluation Plans (available at http://nist.gov/itl/iad/mig/sre.cfm) 
for guidelines on allowable training data for those tests.


2.0 Directory Structure and Data Files

As described in the introduction, there are two types of audio data,
and these are kept in separate directories, as follows:

  data/

    ulaw_sphere/ - contains 2583 8-KHz 2-channel NIST SPHERE files

    pcm_flac/ - contains 14 directories, 1 microphone / directory:

      CH01/ - up to 4159 16-KHz 1-channel flac/ms-wav files
      ...   -   for each channel
      CH14/

All audio files have names that indicate the date and time when the
recording began, along with other identifying information, as follows:

  ulaw_sphere/:
    {yyyymmdd}_{hrmnsc}_{callid}.sph

  pcm_flac/CH{nn}/:
    {yyyymmdd}_{hrmnsc}_{room}_{comp}_{subjid}_CH{nn}.flac

 where:

  yyyymmdd is the year, month and date of recording

  hrmnsc is the hour, minute and second when recording began

  callid is a unique, incremental number assigned to each call

  room is either "LDC" or "HRM", indicating which office was used

  comp is an abbreviated label for the interview component (see below)

  subjid is a numeric identifier assigned to the speaker

  nn is a two-digit microphone channel identifier (01-14)

When the flac files are uncompressed, they become ms-wav/RIFF files
(flac compression does not presently support SPHERE file format).  

For each 14-channel set of original full-length interview recordings, the
various component segments have been extracted into separate audio files,
leaving out the transition periods between components.  As a result, within
each CH{nn} directory, there is a set of up to seven files with the same
"{yyyymmdd}_{hrmnsc}_{room}" and "{subjid}_CH{nn}" portions in their file
names, and different strings for the "{comp}" portion, as follows:

  ivfr      -- interview, "far" condition
  ivnr      -- interview, "near" condition
  phce[123] -- phone call, "cell" condition
  phhv[12]  -- phone call, "high vocal effort" condition
  phlv[123] -- phone call, "low vocal effort" condition
  phsp[12]  -- phone call, "speaker-phone" condition
  rdtr      -- transcript reading
  rptq      -- repeating questions

The digits appearing next to the various "ph.." labels reflect the relative
order of the phone call component within the full session.

The telephone audio is presented in SPHERE format because (a) this is
consistent with other telephone audio releases from the LDC, and (b)
flac does not support ulaw sample encoding.  The current release of
the open-source "sox" utility is able to handle both formats as input;
other utilities are available for both flac and SPHERE formats.


3.0  Related Documentation

The "docs" directory contains the following files (along with this
"readme.txt" file); their various contents are explained in the
subsections below:

   1 mx7spa_subjs.csv
   2 mx7spa_calls.csv
   3 mx7spa_ivcomponents.csv
   4 mx7_transcript_sentences.txt
   5 mx7_collection_doc.pdf

All the "*.csv" files are comma-delimited, plain-text "flat files".
Care has been taken to ensure that no field values contain commas as
part of the field data, so quotation marks are never used around field
values, and nothing is done to mark or "escape" certain characters
(such as apostrophes).  Consecutive commas on a line of data indicate
empty or null field values (which tends to be fairly common in the
subjects table).


3.1 Subjects table (mx7spa_subjs.csv)

Each row in this table provides demographic information about one of
the speakers in the collection.  LDC staffers who lead the interview
sessions and were involved in many of the phone calls that were
conducted during those sessions are included in this table.

   1  subjid - numeric identifier, links to calls and interviews
   2  sex - M or F
   3  yob - year of birth
   4  edu_years - years of formal education
   5  edu_degree - highest education degree earned
   6  edu_deg_yr - year in which highest degree was earned
   7  edu_contig - Y or N: were all edu_years spent contiguously?
   8  esl_age - for ESL speakers, age when English was learned
   9  ntv_lg - native language (ISO 639-3 code)
  10  oth_lgs - other languages (ISO 639-3 codes, '/'-separated)
  11  occup - occupation
  12  cntry_born - country where born
  13  state_born - state where born
  14  city_born - city where born
  15  cntry_rsd - country where raised
  16  state_rsd - state where raised
  17  city_rsd - city where raised
  18  ethnic - ethnicity
  19  smoker - Y or N
  20  ht_cm - height in centimeters
  21  wt_kg - weight in kilograms
  22  mother_born - country (state city) where mother was born
  23  mother_raised - country (state city) where mother was raised
  24  mother_lang - mother's native language
  25  mother_edu - mother's years of formal education
  26  father_born - country (state city) where father was born
  27  father_raised - country (state city) where father was raised
  28  father_lang - father's native language
  29  father_edu - father's years of formal education


3.2 Calls table (mx7spa_calls.csv)

Each row in this table provides the available information about a
2-channel telephone conversation.  A couple of fields ("lang",
"eng_stat") are actually irrelevant to Mixer-6, because this was an
English-only collection, but the field inventory is fixed in order to
provide consistency with other telephone corpus releases from the LDC.

   1  call_id - numeric identifier, links to audio file name
   2  call_date - links to audio file name
   3  lang - language in which the conversation was conducted
   4  eng_stat - one of: AllENG, SomeENG, NoENG
   5  sid_a - subjid of the speaker on channel A (may be marked by "*"; see 4.5 below)
   6  phid_a - telephone ID on channel A
   7  ph_categ_a - one of: M (main phone), O (other phone)
   8  phtyp_a - one of: 1 (cell phone), 2 (cordless), 3 (standard)
   9  phmic_a - one of: 1 (spkr-phone), 2 (headset), 3 (earbud), 4 (handheld)
  10  cnvq_a - audit judgment of conversation quality (Good,Acceptable,Unsuitable)
  11  sigq_a - audit judgment of signal quality (Good,Acceptable,Unsuitable)
  12  tbug_a - Y or N: auditor found a technical problem channel A
  13-20 - same as 5-12, applied to channel B
  21  topic - numeric ID of the topic announced to the callers


3.3 Interview components table (mx7spa_ivcomponents.csv)

Each row in this table provide time stamps for partitioning a single
in-office multi-channel recording session into its distinct
components.

The time stamps provided for the call components are all based on
careful alignment between the 10-minute ulaw telephone audio and the
45-minute session recording, so when the call portions of the session
audio files are extracted using the stated time stamps, the resulting
excerpt should align within a few milliseconds with the corresponding
ulaw channel.

The time stamps for the other components (rptq, intv, rdtr) are based
on session log files (or have been set by manual annotation, in about
50 cases where log files were not available).  The time stamps from
log files are prone to have varying margins of accuracy relative to
the actual start and end of the given component.  (See the discussion
in section 4.3 below about the "rdtr" time stamps.)

   1  comp_type - one of: rptq, ivnr, ivfr, phce, phsp, phhv, phlv, rdtr
   2  date - 8-digit year-month-day (YYYYMMDD)
   3  iv_bgntime 6-digit hour-minute-second (HRMNSC)
   4  place - recording-room, one of: HRM, LDC
   5  comp_id - same as comp_type, incl. one-digit ID# for ph.. types (e.g. "phce1")
   6  subj_id - six-digit ID# references col.1 in mx7spa_subjs.csv
   7  duration - in seconds
   8  interlocutor_id - six-digit ID# references col.1 in mx7spa_subjs.csv
   9  language - ENG, SPA or MXD
  10  call_id - for ph.. types, 4-digit ID# references col.1 in mx7spa_calls.csv
  11  call_chan - for ph.. types, "A" or "B"
  12  ivchans_missing - semi-colon-delimited list of missing channels, if any

Columns 2-6 can be concatenated with underscore characters to form the file-ID
(minus the channel specification) for the corresponding set of pcm_flac audio
files -- e.g. taking the first row following the table header:

  rptq,20101201,150149,HRM,rptq,121169,23.363,121114,ENG,na,na,na,na,all_present

columns 2-6 can be joined as follows to form the file-ID:

       20101201_150149_HRM_rptq_121169

Since column 12 in that row shows "all_present", we can locate 14 channel
files for this interview component (repeating questions for subj_id 121169) as
follows:

   ./data/pcm_flac/CH*/20101201_150149_HRM_rptq_121169*

In cases where one or more channels are missing (due to problems in the
recording session), the missing channels are listed in column 12; if more than
one channel is missing for a given component, the missing channels are
conjoined with semi-colons in ascending order (e.g. "CH08;CH11;CH13").


3.5 List of Transcript sentences (mx7_transcript_sentences.txt)

This is simply a list of the 335 sentence prompts displayed to speakers in
order to be read aloud.  All the sentences are in English, and have been drawn
from transcripts of spontaneous converations in earlier LDC data collections.

In every session, the transcript-reading component always presented this list
in the order shown (one sentence at a time), and always started at the first
sentence.  If the speaker got to the end of the list quickly, with time
remaining in the session schedule for transcript reading, the list was simply
presented again, starting over at the first sentence.  (So, some sessions may
contain more than 335 sentence readings in this component, and in this case,
sentences at the start of the list will have been read twice.)


3.6 Corpus collection specifications (mx7_collection_doc.pdf)

This pdf file provides more detailed information about the Mixer 6 collection
project: the procedures for recruiting and recording, microphone
specifications, interview and auditing protocols, etc.


4.0 Known Problems and Difficulties

The following subsections list some issues where the data being published fall
short of expectations.


4.1 Some channels missing from some in-office recording sessions

As detailed in the mx7spa_ivcomponents.csv table (col.12, "ivchans_missing"),
there were a number of sessions where one or more channels failed to record as
intended.  In some cases a channel failure occurred mid-session, such that all
channels were recorded for initial components in the session, and some were
missing for later components.  The following list summarizes how many session
components are lacking a given channel (how many rows of the "ivcomponents"
table contain the given channel in col.12):

   4075 all_present (components with no missing channels)
     17 CH01
     10 CH02
      7 CH03
      7 CH04
     39 CH05
     20 CH06
      8 CH07
     23 CH08
     27 CH09
      8 CH10
     25 CH11
     25 CH12
     22 CH13
      9 CH14

There was also variation in the number and distribution of failed channels in
a given session; the following list summarizes the patterns of co-occurrences
in channel failures:

     10 CH01;CH02
      7 CH01;CH05;CH07;CH09;CH10
      7 CH03;CH05;CH08;CH11;CH12
      7 CH04
     10 CH05
      7 CH05;CH06;CH09
      1 CH05;CH07;CH10;CH11;CH12
      7 CH05;CH09;CH12
      6 CH06;CH08;CH09;CH13
      7 CH06;CH11
      3 CH08;CH11;CH12;CH13;CH14
      7 CH08;CH11;CH13
      7 CH12
      6 CH13;CH14


4.2 Variable performance in transcript reading task

While reading transcript sentences aloud, speakers would sometimes stumble,
restart, or misread the prompt text (adding, skipping or substituting one or
more words); there were also some cases where speakers interjected a remark or
question in between the sentence prompts.  Another issue was a relatively low
level of English reading proficiency among some native Spanish speakers,
causing the transcript reading task to be difficult and error-prone.  The
recording process would not be stopped for problems of this sort, and no
attempt has been made to edit them out of the audio files.  As a result, there
will be some difficulty in trying to align the sentence prompt text with the
corresponding audio portion in some number of sessions.


4.3 Relatively low recording levels in some channels or some sessions

Some of the microphones in the recording rooms were purposefully set at
distances beyond the stated performance specifications for the given mic.
This was done to cause some channels to serve as "stress tests" for relevant
speech technologies.  In addition, there were difficulties with other channels
in balancing between conflicting goals: (a) always maintain a constant gain
setting on each channel across all sessions; (b) take as much care as possible
in setting levels in order to avoid clipping; (c) employ techniques during the
phone call components to explicitly evoke both unusually loud and unusually
soft speech.  Of course, there are also wide differences among speakers in
terms of their intrinsic loudness, and during informal conversation, they
would often cover a wide dynamic range.  Taken together, these factors tended
to lead to relatively lower gain settings than would have been optimal in many
cases.


4.4 Resampling applied to multi-channel recordings

The 14 channels presented in the "pcm_flac" directory were originally recorded
via matched pairs of 8-channel A/D converters, running from a common clock
signal.  But relative to the 8KHz ulaw telephone channel (recorded via the
public telephone network), and a pair of "reserved" channels recorded via a
separate, 22KHz A/D device, it was found that the nominal 16000Hz sample rate
applied to channels 01-14 was actually closer to 15899 Hz.  This was measured
and confirmed manual over numerous sessions, and we also manually confirmed
that by doing a digital resampling from 15899 to 16000, the alignment of
channels 01-14 to both the 8KHz ulaw and 22KHz signals was correct.  This
resampling has been applied to all the pcm_flac data, and should have no more
than a negligible effect on signal analysis.


4.5 Some call-side speaker-IDs not fully confirmed by audits

As indicated above in the description of the mx7spa_calls table, some rows
have an asterisk attached to the A- or B-side subj_id value.  These are cases
where the given call side was not audited at all, or the audit judgment was
uncertain as to actual speaker-ID.  The affects either side A or side B (never
both) in 29 calls.


-------
README prepared 2022-04-08 by David Graff <graff@ldc.upenn.edu>