README FILE FOR GREYBEARD CORPUS AUDIO DATA AND TABLES ====================================================== The Greybeard Corpus was collected at the LDC in October and November of 2008. The goal was to record new telephone conversations among subjects who had participated in one or more previous telephone collections, dating as far back as Switchboard-1 (1991). A total of 172 subjects were enrolled, all of whom had participated in one of these older collections: Switchboard-1 1991-1992: 2 subjects Switchboard-2 1996-1997: 16 subjects Mixer 1 and 2 2003-2005: 103 subjects Mixer 3 (beginning) 2006: 51 subjects (In these numbers, some subjects have been counted two or three times, because they participated in two or three previous collections; more details are provided below in section C under "TABLES".) Data collected in Mixer 3 from Jan. 2007 onward was left out, to assure a minimum temporal separation between the data taken from that collection and the data recorded under Greybeard. AUDIO DATA ========== The corpus presents both the complete set of calls recorded during the Greybeard collection itself, and all calls from the legacy collections that involve the 172 Greybeard speakers. Under the "data" directory, the call data from each collection project is separated into subdirectories by project abbreviation: swb1/ : 36 calls swb2/ : 362 calls mx1/ : 2356 calls mx3/ : 828 calls gb1/ : 1098 calls ------------------ 4680 total All audio data files have file names of the form: gb1_NNNN_YYYYMMDD.flac Where NNNN is a four-digit (zero-padded) sequence-ID number, and the YYYYMMDD field represents the date when the call was recorded. Note: The 4-digit sequence-ID field was intended to reflect the overall chronological ordering of each call within the Greybeard corpus as a whole. However, in the "mx1" directory, 87 files have 4-digit numbers that do not match their chronological sequence. The date fields of these file names are correct and can be used to order the files chronologically, and the 4-digit ID numbers still serve to relate the file names to the call_info_mx1.csv table data, as described below. All audio files are 2-channel, 8 KHz, 16-bit PCM sample data, in FLAC-compressed form (http://flac.sourceforge.net); when uncompressed, they have MS-WAV/RIFF headers. COLLECTION PROCESS ================== Speakers who had participated in previous LDC telephone collections were asked to make and/or receive telephone calls via the LDC's robot operator. Most were asked to complete 12 calls, but some were asked to do an extended collection of up to 24 calls. The robot operator used a dedicated T-1 circuit (provided by Verizon); half of the 24 lines on the circuit were assigned to handle incoming calls, and the other half were reserved for making outbound calls. During a designated period of time each day, the system would query the project database for the phone numbers and personal identification numbers (PINs) of participants who were available to be called; it also accepted incoming calls at any time. Each time a connection was established on an inbound or outbound line, the system would query the caller/callee for their PIN to verify his/her identity. If a caller entered a know PIN, or a callee entered the PIN expected for the phone number that was dialed, the person was asked to say their full name, and this was recorded and stored in a separate audio file; the person would then be put on hold until another connection on the circuit was available for conversation. Once this happened, the two lines were bridged so that the two people could converse, and the system announced a recorded topic description for the pair to discuss. The audio from each T-1 line was captured digitally by the system and stored in a separate file as raw mu-law sample data. As the recordings were uploaded from the robot operator to network disk storage each night, automated processes would reformat the audio into a 2-channel SPHERE-format file for each conversation, and queue the recordings for manual audit, so that speaker identification could be verified, and other aspects of the recording could be checked and rated. In the auditing process, LDC staff would focus on one speaker at a time, with access to all the speaker's call side and the associated full-name recordings, as well as full identifying information from the database: first and last name, gender, and age. Confirmation of speaker-ID was based on comparing different recordings attributed (according to PIN) to the same speaker. Auditors gave impressionistic judgments on overall audio quality, presence of background noise and cross-channel echo, and any other technical difficulty with the call, in addition to confirming the speaker-ID on each channel. These auditor decisions are provided in the "call_info" tables, described in the next section. Upon conclusion of the study, participants were paid on a per-call basis, plus a bonus payment if they completed the target quantity of calls. To provide a more compact and convenient audio format for general release, each 2-channel recording was converted from SPHERE to MS-WAV file format, and the mu-law samples were converted to 16-bit PCM (linear signed-integer); this in turn was compressed using FLAC. Those familiar with mu-law encoding may be interested to know that in converting mu-law to PCM, the "two mu-law zeros" (i.e. "positive 0" and "negative 0", 0xff and 0x7f) have been collapsed in the conventional way (they all appear as the 16-bit integer value "0"). TABLE DATA ========== The docs directory contains five types of tabulations for the Greybeard corpus, described in sections A - E below: A. - call_info_{corpus}.csv The "corpus" field of the table file names match the subdirectories for audio files under the "data" directory. Each comma-delimited call_info table begins with a line of column heading labels, followed by one row per call drawn from the given collection project. The set of 22 columns is identical across all tables, as follows: 1 call_id -- 4-digit sequence number used in audio file names 2 corpus -- collection name (matches path within data directory) 3 call_date -- YYYY-MM-DD_hh:mm:ss (time at which call was recorded) 4 lang -- USE if both speakers are native US; ENG otherwise 5 eng_stat -- always "All_ENG" (no non-English calls) Channel-A properties: 6 sid_a -- subj_id number of speaker on channel A 7 phid_a -- phone-number identifier for channel A 8 ph_categ_a -- "M" for "Main phone", "O" for "Other phone" 9 phtyp_a -- type of phone: 1 (cell), 2 (cordless), 3 (land-line) 10 phmic_a -- mic: 1 (spkr-phone), 2 (headset), 3 (earbud), 4 handheld) 11 cnvq_a -- conversation quality: G (good), A (acceptable), U (unusable) 12 sigq_a -- signal quality: same as cnvq_a 13 tbug_a -- technical problem in recording: Y (yes) or N (no) Channel-B properties (same as A properties): 14 sid_b 15 phid_b 16 ph_categ_b 17 phtyp_b 18 phmic_b 19 cnvq_b 20 sigq_b 21 tbug_b 22 legacy_file -- full file name of this call in legacy corpus The "legacy_file" field is intended to allow the older calls being included in this corpus to be related to their original releases in other corpora. B. - subjids_{target,nontarget}.txt These are simply plain-text lists of the numeric subj_ids used to refer to individual speakers in the Greybeard corpus. The "target" set comprises the speakers recorded during the Greybeard collection, and the "nontarget" set consists of speakers from previous collections who took part in older collections and happened to be recorded in a call with a Greybeard speaker. Note that the subj_id numbers used here for Switchboard participants are different from those used in earlier publications of Switchboard corpora. Likewise, previous uses of data from the Mixer-1 collection had a different set of speaker-ID numbers (as found in the NIST "answer key" files for SRE cycles that used audio from Mixer 1). In preparing for the Greybeard collection, we integrated the subject pools from the Switchboard and Mixer-1 collections into the same database table containing Mixer-3 subjects, thereby assigning new ID numbers to the legacy subjects as needed, to maintain a single numeric series. C. - subj_info_{target,nontarget}.csv These two tables present the available demographic information for both the target Greybeard subjects and the non-target subjects that appear in the legacy corpus calls. Both tables are comma-delimited with identical sets of fields, with column headings in the first line, as follows - note that the "target" table has 21 columns, while the "nontarget" table has 20: 1 subjid -- numeric that relates to the subj_ids in other tables 2 sex -- M or F 3 yob -- year of birth 4 edu -- number of years spent in school 5 esl_age -- number of years speaking English (for non-natives) 6 ntv_lg -- native language 7 oth_lgs -- other languages known by the subject 8 occup -- occupation 9 cntry_born -- country where born 10 state_born -- state where born 11 city_born -- city where born 12 cntry_rsd -- country where raised 13 state_rsd -- state where raised 14 city_rsd -- city where raised 15 ethnic -- ethnicity 16 smoker -- Y or N 17 ht_cm -- height in centimeters 18 wt_kg -- weight in kilograms 19 presence -- list of corpora containing this speaker (see below) -- for target speakers: 20 gb_calls -- number of calls recorded in Greybeard collection 21 prv_calls -- number of calls drawn from legacy collections -- for nontarget speakers: 20 n_calls -- number of legacy calls containing this speaker The "presence" field is a list of one or more labels for the collections where the given speaker is present in one or more calls. For speakers appearing in multiple collections, the "presence" field has multiple labels connected by "+" (plus sign). Below is a tally of the different values found in this field, counting target and nontarget speakers separately: For target speakers: 99 IN_GB1+IN_MX1 47 IN_GB1+IN_MX3 15 IN_GB1+IN_SWB2 2 IN_GB1+IN_SWB1 3 IN_GB1+IN_MX1+IN_MX3 1 IN_GB1+IN_MX1+IN_MX3+IN_SWB2 5 IN_GB1 For nontarget speakers: 1053 IN_MX1 519 IN_MX3 311 IN_SWB2 35 IN_SWB1 1 IN_MX1+IN_MX3 The tally for target speakers shows that 5 Greybeard participants were found (after the fact) to have been recruited by mistake: although these people (shown with just "IN_GB1" in the "presence" column) had been enrolled in a previous collection, they had not recorded any usable calls prior to enrolling in Greybeard. As a result, only 167 of the 172 Greybeard speakers actually have legacy calls. D. - trial_monthdelta_histogram.tsv For this table, the full set of legacy and Greybeard-epoch calls in the corpus were pooled together for each subj_id, and tabulated with respect to how many "model/test" identification trials could be performed that have a given temporal distance between the model and test data, for trials where the model and test samples are drawn from the same speaker. For example, suppose one subject had done exactly four call sides (combining both legacy and Greybeard collections), as follows: "callid" "date" a 1996-05 b 1996-06 c 2008-11 d 2008-11 This set would yield 6 "trials" (a-b, a-c, a-d, b-c, b-d, c-d); one trial would have an elapsed time delta of 0 (c-d), one would have a delta of 1 (a-b), two would have a delta of 150 months (a-c, a-d) and two would have a delta of 149 months (b-c, b-d). So a "trial month-delta histogram" table for just this one subject would look like this: N Delta_Months 1 0 1 1 2 149 2 150 In order to build a comprehensive histogram across all subjects, all trials with a given time difference (in months) for all subjects are summed together into one bin for that time difference. Using all the call_info tables for the Greybeard and various legacy collection projects, we converted the YYYY-MM portions of all call dates for each target speaker to a uniform "months-since-the-epoch" value, and used that value to sum up the number of distinct target trials (pairings of two call sides containing the same speaker) as a function of the "delta months" in each trial. As indicated in the example above, the histogram shows the total number of possible target trials, followed by a tab character and the number of months separating the model and test samples for that set of trials. E. - file.tbl This is a whitespace-delimited table with one row per audio data file, providing the MD5 checksum, byte count, modification date and path/ name of the file. =========== David Graff graff@ldc.upenn.edu March 7, 2013