README FILE FOR GREYBEARD CORPUS AUDIO DATA AND TABLES
======================================================

The Greybeard Corpus was collected at the LDC in October and November
of 2008.  The goal was to record new telephone conversations among
subjects who had participated in one or more previous telephone
collections, dating as far back as Switchboard-1 (1991).

A total of 172 subjects were enrolled, all of whom had participated in
one of these older collections:

  Switchboard-1  1991-1992:   2 subjects
  Switchboard-2  1996-1997:  16 subjects
  Mixer 1 and 2  2003-2005: 103 subjects
  Mixer 3 (beginning) 2006:  51 subjects

(In these numbers, some subjects have been counted two or three times,
because they participated in two or three previous collections; more
details are provided below in section C under "TABLES".)

Data collected in Mixer 3 from Jan. 2007 onward was left out, to
assure a minimum temporal separation between the data taken from that
collection and the data recorded under Greybeard.


AUDIO DATA
==========

The corpus presents both the complete set of calls recorded during the
Greybeard collection itself, and all calls from the legacy collections
that involve the 172 Greybeard speakers.

Under the "data" directory, the call data from each collection project
is separated into subdirectories by project abbreviation:

  swb1/ :   36 calls
  swb2/ :  362 calls
  mx1/  : 2356 calls
  mx3/  :  828 calls
  gb1/  : 1098 calls
  ------------------
          4680 total


All audio data files have file names of the form:

  gb1_NNNN_YYYYMMDD.flac

Where NNNN is a four-digit (zero-padded) sequence-ID number, and the
YYYYMMDD field represents the date when the call was recorded.

Note: The 4-digit sequence-ID field was intended to reflect the
overall chronological ordering of each call within the Greybeard
corpus as a whole.  However, in the "mx1" directory, 87 files have
4-digit numbers that do not match their chronological sequence.  The
date fields of these file names are correct and can be used to order
the files chronologically, and the 4-digit ID numbers still serve to
relate the file names to the call_info_mx1.csv table data, as
described below.

All audio files are 2-channel, 8 KHz, 16-bit PCM sample data, in
FLAC-compressed form (http://flac.sourceforge.net); when uncompressed,
they have MS-WAV/RIFF headers.


COLLECTION PROCESS
==================

Speakers who had participated in previous LDC telephone collections
were asked to make and/or receive telephone calls via the LDC's robot
operator.  Most were asked to complete 12 calls, but some were asked
to do an extended collection of up to 24 calls.

The robot operator used a dedicated T-1 circuit (provided by Verizon);
half of the 24 lines on the circuit were assigned to handle incoming
calls, and the other half were reserved for making outbound calls.
During a designated period of time each day, the system would query
the project database for the phone numbers and personal identification
numbers (PINs) of participants who were available to be called; it
also accepted incoming calls at any time.

Each time a connection was established on an inbound or outbound line,
the system would query the caller/callee for their PIN to verify
his/her identity.  If a caller entered a know PIN, or a callee entered
the PIN expected for the phone number that was dialed, the person was
asked to say their full name, and this was recorded and stored in a
separate audio file; the person would then be put on hold until
another connection on the circuit was available for conversation.
Once this happened, the two lines were bridged so that the two people
could converse, and the system announced a recorded topic description
for the pair to discuss.

The audio from each T-1 line was captured digitally by the system and
stored in a separate file as raw mu-law sample data.  As the
recordings were uploaded from the robot operator to network disk
storage each night, automated processes would reformat the audio into
a 2-channel SPHERE-format file for each conversation, and queue the
recordings for manual audit, so that speaker identification could be
verified, and other aspects of the recording could be checked and
rated.

In the auditing process, LDC staff would focus on one speaker at a
time, with access to all the speaker's call side and the associated
full-name recordings, as well as full identifying information from the
database: first and last name, gender, and age.  Confirmation of
speaker-ID was based on comparing different recordings attributed
(according to PIN) to the same speaker.  Auditors gave impressionistic
judgments on overall audio quality, presence of background noise and
cross-channel echo, and any other technical difficulty with the call,
in addition to confirming the speaker-ID on each channel.  These
auditor decisions are provided in the "call_info" tables, described in
the next section.

Upon conclusion of the study, participants were paid on a per-call
basis, plus a bonus payment if they completed the target quantity of
calls.

To provide a more compact and convenient audio format for general
release, each 2-channel recording was converted from SPHERE to MS-WAV
file format, and the mu-law samples were converted to 16-bit PCM
(linear signed-integer); this in turn was compressed using FLAC.
Those familiar with mu-law encoding may be interested to know that in
converting mu-law to PCM, the "two mu-law zeros" (i.e. "positive 0"
and "negative 0", 0xff and 0x7f) have been collapsed in the
conventional way (they all appear as the 16-bit integer value "0").


TABLE DATA
==========

The docs directory contains five types of tabulations for the
Greybeard corpus, described in sections A - E below:


A. - call_info_{corpus}.csv

The "corpus" field of the table file names match the subdirectories
for audio files under the "data" directory.  Each comma-delimited
call_info table begins with a line of column heading labels, followed
by one row per call drawn from the given collection project.  The set
of 22 columns is identical across all tables, as follows:

 1 call_id     -- 4-digit sequence number used in audio file names
 2 corpus      -- collection name (matches path within data directory)
 3 call_date   -- YYYY-MM-DD_hh:mm:ss (time at which call was recorded)
 4 lang        -- USE if both speakers are native US; ENG otherwise
 5 eng_stat    -- always "All_ENG" (no non-English calls)

Channel-A properties:
 6 sid_a       -- subj_id number of speaker on channel A
 7 phid_a      -- phone-number identifier for channel A
 8 ph_categ_a  -- "M" for "Main phone", "O" for "Other phone"
 9 phtyp_a     -- type of phone: 1 (cell), 2 (cordless), 3 (land-line)
10 phmic_a     -- mic: 1 (spkr-phone), 2 (headset), 3 (earbud), 4 handheld)
11 cnvq_a   -- conversation quality: G (good), A (acceptable), U (unusable)
12 sigq_a   -- signal quality: same as cnvq_a
13 tbug_a   -- technical problem in recording: Y (yes) or N (no)

Channel-B properties (same as A properties):
14 sid_b
15 phid_b
16 ph_categ_b
17 phtyp_b
18 phmic_b
19 cnvq_b
20 sigq_b
21 tbug_b

22 legacy_file -- full file name of this call in legacy corpus

The "legacy_file" field is intended to allow the older calls being
included in this corpus to be related to their original releases in
other corpora.


B. - subjids_{target,nontarget}.txt

These are simply plain-text lists of the numeric subj_ids used to
refer to individual speakers in the Greybeard corpus.  The "target"
set comprises the speakers recorded during the Greybeard collection,
and the "nontarget" set consists of speakers from previous collections
who took part in older collections and happened to be recorded in a
call with a Greybeard speaker.

Note that the subj_id numbers used here for Switchboard participants
are different from those used in earlier publications of Switchboard
corpora.  Likewise, previous uses of data from the Mixer-1 collection
had a different set of speaker-ID numbers (as found in the NIST
"answer key" files for SRE cycles that used audio from Mixer 1).

In preparing for the Greybeard collection, we integrated the subject
pools from the Switchboard and Mixer-1 collections into the same
database table containing Mixer-3 subjects, thereby assigning new ID
numbers to the legacy subjects as needed, to maintain a single numeric
series.


C. - subj_info_{target,nontarget}.csv

These two tables present the available demographic information for
both the target Greybeard subjects and the non-target subjects that
appear in the legacy corpus calls.  Both tables are comma-delimited
with identical sets of fields, with column headings in the first line,
as follows - note that the "target" table has 21 columns, while the
"nontarget" table has 20:

 1 subjid    -- numeric that relates to the subj_ids in other tables
 2 sex       -- M or F
 3 yob       -- year of birth
 4 edu       -- number of years spent in school
 5 esl_age   -- number of years speaking English (for non-natives)
 6 ntv_lg    -- native language
 7 oth_lgs   -- other languages known by the subject
 8 occup     -- occupation
 9 cntry_born -- country where born
10 state_born -- state where born
11 city_born  -- city where born
12 cntry_rsd  -- country where raised
13 state_rsd  -- state where raised
14 city_rsd   -- city where raised
15 ethnic     -- ethnicity
16 smoker     -- Y or N
17 ht_cm      -- height in centimeters
18 wt_kg      -- weight in kilograms
19 presence   -- list of corpora containing this speaker (see below)

   -- for target speakers:
20 gb_calls   -- number of calls recorded in Greybeard collection
21 prv_calls  -- number of calls drawn from legacy collections

   -- for nontarget speakers:
20 n_calls    -- number of legacy calls containing this speaker

The "presence" field is a list of one or more labels for the
collections where the given speaker is present in one or more calls.
For speakers appearing in multiple collections, the "presence" field
has multiple labels connected by "+" (plus sign).  Below is a tally of
the different values found in this field, counting target and
nontarget speakers separately:

For target speakers:
    99 IN_GB1+IN_MX1
    47 IN_GB1+IN_MX3
    15 IN_GB1+IN_SWB2
     2 IN_GB1+IN_SWB1
     3 IN_GB1+IN_MX1+IN_MX3
     1 IN_GB1+IN_MX1+IN_MX3+IN_SWB2
     5 IN_GB1

For nontarget speakers:
  1053 IN_MX1
   519 IN_MX3
   311 IN_SWB2
    35 IN_SWB1
     1 IN_MX1+IN_MX3

The tally for target speakers shows that 5 Greybeard participants were
found (after the fact) to have been recruited by mistake: although
these people (shown with just "IN_GB1" in the "presence" column) had
been enrolled in a previous collection, they had not recorded any
usable calls prior to enrolling in Greybeard.  As a result, only 167
of the 172 Greybeard speakers actually have legacy calls.


D. - trial_monthdelta_histogram.tsv

For this table, the full set of legacy and Greybeard-epoch calls in
the corpus were pooled together for each subj_id, and tabulated with
respect to how many "model/test" identification trials could be
performed that have a given temporal distance between the model and
test data, for trials where the model and test samples are drawn from
the same speaker.

For example, suppose one subject had done exactly four call sides
(combining both legacy and Greybeard collections), as follows:

   "callid"  "date"
      a      1996-05
      b      1996-06
      c      2008-11
      d      2008-11

This set would yield 6 "trials" (a-b, a-c, a-d, b-c, b-d, c-d); one
trial would have an elapsed time delta of 0 (c-d), one would have a
delta of 1 (a-b), two would have a delta of 150 months (a-c, a-d) and
two would have a delta of 149 months (b-c, b-d).  So a "trial
month-delta histogram" table for just this one subject would look like
this:

    N       Delta_Months
    1       0
    1       1
    2       149
    2       150

In order to build a comprehensive histogram across all subjects, all
trials with a given time difference (in months) for all subjects are
summed together into one bin for that time difference.  Using all the
call_info tables for the Greybeard and various legacy collection
projects, we converted the YYYY-MM portions of all call dates for each
target speaker to a uniform "months-since-the-epoch" value, and used
that value to sum up the number of distinct target trials (pairings of
two call sides containing the same speaker) as a function of the
"delta months" in each trial.

As indicated in the example above, the histogram shows the total
number of possible target trials, followed by a tab character and the
number of months separating the model and test samples for that set of
trials.


E. - file.tbl

This is a whitespace-delimited table with one row per audio data file,
providing the MD5 checksum, byte count, modification date and path/
name of the file.


===========
David Graff
graff@ldc.upenn.edu
March 7, 2013