README FILE for the REMIX Telephone Collection

Linguistic Data Consortium (LDC)

Authors:
  Preston Cabe (cabep@ldc.upenn.edu),
  David Graff (graff@ldc.upenn.edu),
  Karen Jones (karj@ldc.upenn.edu),
  Stephanie Strassel (strassel@ldc.upenn.edu),
  Kevin Walker (walkerk@ldc.upenn.edu)


1. Introduction

This corpus contains conversational telephone speech (CTS) that was
collected at the LDC between February and April, 2012, under the
project title "REMIX"; this data collection was created primarily to
support the NIST 2012 Speaker Recognition Evaluation (SRE12).

The participants in this collection were English speakers who were
selected on the basis of having completed both telephone calls and
multi-microphone interview sessions in a previous Mixer collection
project at the LDC (Mixer 4, 5, 6 or 7).

The data in this release include a subset of calls made in "noisy"
environments.  See the participant instructions 
in docs/ParticipantInstructions_v6.pdf.


2. Summary of corpus content

  358 unique speakers 
 1917 calls / audio files,  3834 call sides
   39 topics actually used

Genre -  conversational telephone speech
Language - English


3. Annotation

Data were subject to both a quality audit and a speaker ID audit.

The main goals of the quality audit were to determine (a) that English was
being spoken, (b) the sex of the speaker, (c) whether the call was noisy or
non-noisy, (d) whether the signal was clear, and (e) whether there was more
than one speaker on the line.  Instructions provided to auditors are found in
docs/Quality_Auditing_Instructions_3.0.pdf.

The main goal of the speaker ID audit was to determine (a) that each speaker
in the REMIX study was correctly identifed as being the same person as a given
speaker in a previous Mixer study and (b) each REMIX call side associated with
a given speaker's PIN was spoken by that speaker.  Instructions provided to
SID auditors are found in docs/SID_Auditing_Instructions_v1.0.pdf.


4. Data organization

4.1 Audio data

The "data" directory contains the audio recordings of the 1917 calls,
presented as 2-channel, 8-bit, mu-law encoded sample data recorded at
8000 samples/second, with a NIST SPHERE-format header on each file.
(The sample data were captured digitally from the public telephone
network via a Verizon T-1 circuit.)  The audio file names are
structured as follows:

  {date}_{time}_{callID}.sph

where "date" and "time" identify when the call recording began,
expressed as year-month-day ("yyyymmdd") and hour-minute-second
("hhmmss").

4.2 Documentation

The "docs" directory contains the three sets of instructions mentioned
in sections 1 and 3 above, along with three tables, presented as
plain-text data files, with one row of tab-delimited table data per
line.  The tables provide detailed information about the recorded
calls, the speakers, and the topics that were presented for discussion
during the calls.

The first line of each table file provides the column headings for the
subsequent rows of data.  The columns are described in detail below
for each table.


  remix_calls.tsv fields:

     1  callid -- numeric ID for the call
     2  fileid -- full file name for call audio (incl. recording date)
     3  subjid_a -- numeric ID of speaker on channel A
     4  subjid_b -- numeric ID of speaker on channel B
     5  phoneid_a -- encrypted phone number for channel A
     6  phoneid_b
     7  phone_type_a -- caller input regarding type of telephone
     8  phone_type_b
     9  phone_set_a -- caller input regarding type of microphone
    10  phone_set_b
    11  subjid_ok_a -- auditor decision about speaker ID
    12  subjid_ok_b
    13  caller_a_asserts_noise -- caller input regarding noise
    14  caller_b_asserts_noise
    15  auditor_a_heard_noise -- auditor perception about noise
    16  auditor_b_heard_noise
    17  mostly_speech_a -- auditor perception of speech quantity
    18  mostly_speech_b
    19  in_english_a -- auditor decision about language used
    20  in_english_b
    21  one_speaker_a -- auditor decision about no. of voices heard
    22  one_speaker_b
    23  topic_id -- numeric ID of announced topic (matches id in topics)

  remix_subjects.tsv fields:

     1  subjid -- numeric ID of speaker (matches subjid_a/b in calls) 
     2  sex -- M or F
     3  yob -- year of birth
     4  edu_years -- years of education
     5  edu_degree -- highest education degree obtained
     6  edu_deg_yr -- year when last degree was awarded
     7  edu_contig -- was education contiguous?
     8  esl_age -- age when non-native English speaker learned English
     9  ntv_lg -- native language
    10  oth_lgs -- other languages spoken
    11  occup -- occupation
    12  cntry_born -- location where subject was born
    13  state_born
    14  city_born
    15  cntry_rsd -- location where subject grew up
    16  state_rsd
    17  city_rsd
    18  ethnic -- ethnicity
    19  smoker -- yes or no
    20  ht_cm -- height
    21  wt_kg -- weight
    22  mother_born -- parents' demographics
    23  mother_raised
    24  mother_lang
    25  mother_edu
    26  father_born
    27  father_raised
    28  father_lang
    29  father_edu
    30  total_sides -- call counts from previous collections
    31  deliv_sides -- call counts used in previous evals

The last two fields in the subjects table contain structured
information: the collection or evaluation cycle is presented as a
short label (e.g. "mx6", "SRE08"), and this is followed by a colon and
the number of calls in that cycle.  When a subject has been recorded
in more than one collection, or used in more than one evaluation,
distinct labels and counts are separated by semicolon within the one
tab-delimited field (e.g. "mx7:25;mx6:16").

  remix_topics.tsv fields:

     1  id -- numeric ID (matches topic_id in calls)
     2  title -- keyword(s) for topic
     3  text -- full topic description presented to callers



9. Copyright Information

(c) copyright 2012, Trustees of the University of Pennsylvania