README FILE for the REMIX Telephone Collection Linguistic Data Consortium (LDC) Authors: Preston Cabe (cabep@ldc.upenn.edu), David Graff (graff@ldc.upenn.edu), Karen Jones (karj@ldc.upenn.edu), Stephanie Strassel (strassel@ldc.upenn.edu), Kevin Walker (walkerk@ldc.upenn.edu) 1. Introduction This corpus contains conversational telephone speech (CTS) that was collected at the LDC between February and April, 2012, under the project title "REMIX"; this data collection was created primarily to support the NIST 2012 Speaker Recognition Evaluation (SRE12). The participants in this collection were English speakers who were selected on the basis of having completed both telephone calls and multi-microphone interview sessions in a previous Mixer collection project at the LDC (Mixer 4, 5, 6 or 7). The data in this release include a subset of calls made in "noisy" environments. See the participant instructions in docs/ParticipantInstructions_v6.pdf. 2. Summary of corpus content 358 unique speakers 1917 calls / audio files, 3834 call sides 39 topics actually used Genre - conversational telephone speech Language - English 3. Annotation Data were subject to both a quality audit and a speaker ID audit. The main goals of the quality audit were to determine (a) that English was being spoken, (b) the sex of the speaker, (c) whether the call was noisy or non-noisy, (d) whether the signal was clear, and (e) whether there was more than one speaker on the line. Instructions provided to auditors are found in docs/Quality_Auditing_Instructions_3.0.pdf. The main goal of the speaker ID audit was to determine (a) that each speaker in the REMIX study was correctly identifed as being the same person as a given speaker in a previous Mixer study and (b) each REMIX call side associated with a given speaker's PIN was spoken by that speaker. Instructions provided to SID auditors are found in docs/SID_Auditing_Instructions_v1.0.pdf. 4. Data organization 4.1 Audio data The "data" directory contains the audio recordings of the 1917 calls, presented as 2-channel, 8-bit, mu-law encoded sample data recorded at 8000 samples/second, with a NIST SPHERE-format header on each file. (The sample data were captured digitally from the public telephone network via a Verizon T-1 circuit.) The audio file names are structured as follows: {date}_{time}_{callID}.sph where "date" and "time" identify when the call recording began, expressed as year-month-day ("yyyymmdd") and hour-minute-second ("hhmmss"). 4.2 Documentation The "docs" directory contains the three sets of instructions mentioned in sections 1 and 3 above, along with three tables, presented as plain-text data files, with one row of tab-delimited table data per line. The tables provide detailed information about the recorded calls, the speakers, and the topics that were presented for discussion during the calls. The first line of each table file provides the column headings for the subsequent rows of data. The columns are described in detail below for each table. remix_calls.tsv fields: 1 callid -- numeric ID for the call 2 fileid -- full file name for call audio (incl. recording date) 3 subjid_a -- numeric ID of speaker on channel A 4 subjid_b -- numeric ID of speaker on channel B 5 phoneid_a -- encrypted phone number for channel A 6 phoneid_b 7 phone_type_a -- caller input regarding type of telephone 8 phone_type_b 9 phone_set_a -- caller input regarding type of microphone 10 phone_set_b 11 subjid_ok_a -- auditor decision about speaker ID 12 subjid_ok_b 13 caller_a_asserts_noise -- caller input regarding noise 14 caller_b_asserts_noise 15 auditor_a_heard_noise -- auditor perception about noise 16 auditor_b_heard_noise 17 mostly_speech_a -- auditor perception of speech quantity 18 mostly_speech_b 19 in_english_a -- auditor decision about language used 20 in_english_b 21 one_speaker_a -- auditor decision about no. of voices heard 22 one_speaker_b 23 topic_id -- numeric ID of announced topic (matches id in topics) remix_subjects.tsv fields: 1 subjid -- numeric ID of speaker (matches subjid_a/b in calls) 2 sex -- M or F 3 yob -- year of birth 4 edu_years -- years of education 5 edu_degree -- highest education degree obtained 6 edu_deg_yr -- year when last degree was awarded 7 edu_contig -- was education contiguous? 8 esl_age -- age when non-native English speaker learned English 9 ntv_lg -- native language 10 oth_lgs -- other languages spoken 11 occup -- occupation 12 cntry_born -- location where subject was born 13 state_born 14 city_born 15 cntry_rsd -- location where subject grew up 16 state_rsd 17 city_rsd 18 ethnic -- ethnicity 19 smoker -- yes or no 20 ht_cm -- height 21 wt_kg -- weight 22 mother_born -- parents' demographics 23 mother_raised 24 mother_lang 25 mother_edu 26 father_born 27 father_raised 28 father_lang 29 father_edu 30 total_sides -- call counts from previous collections 31 deliv_sides -- call counts used in previous evals The last two fields in the subjects table contain structured information: the collection or evaluation cycle is presented as a short label (e.g. "mx6", "SRE08"), and this is followed by a colon and the number of calls in that cycle. When a subject has been recorded in more than one collection, or used in more than one evaluation, distinct labels and counts are separated by semicolon within the one tab-delimited field (e.g. "mx7:25;mx6:16"). remix_topics.tsv fields: 1 id -- numeric ID (matches topic_id in calls) 2 title -- keyword(s) for topic 3 text -- full topic description presented to callers 9. Copyright Information (c) copyright 2012, Trustees of the University of Pennsylvania