AnnoDIFP Session Audio and Transcripts LDC2025S06 December 16, 2024 Linguistic Data Consortium 1. Overview =========== AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) was created by the Linguistic Data Consortium (LDC), Florida Institute of Technology (FIT), and University of New Haven (UNH) to support development of algorithms for prediction of personality traits. It consists of audio recordings from both in person interviews and conversational telephone speech collections paired with scores from two self-reported personality assessments – HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3). This release contains audio data and transcripts from the in person session component of AnnoDIFP, comprising 438.34 hours of session recordings from 366 participants. More information about the corpus design, collection, protocol, processing, and annotation is provided in the file "docs/annodifp_collection_doc.pdf". 2. Directory Struture ===================== - data//flac/ -- FLAC from the four microphones for participant PARTICIPANT-ID - data//transcripts/ -- transcripts for participant PARTICIPANT-ID - data//sections/ -- section segmentation for participant PARTICIPANT-ID - docs/annodifp_collection_doc.pdf -- detailed description of corpus design, collection, protocols, processing, and annotation - docs/scores.tbl -- ground truth scores for participants - docs/file.tbl -- listing of md5 checksums, sizes, dates, and file names - README.txt -- this file 3. File naming convention ========================= Data files (FLAC, transcripts, section segmentation) are named according to the following convention: _[_] where: - PARTICIPANT-ID -- the anonymized participant id for the subject of the recording - MIC -- name of the mic recording or annotation corresponds to; one of: - ceiling -- distant microphone on ceiling in room containing participant - desk -- distant microphone on desk in room containing participant - intrlav -- lavalier worn by interviewer (different room) - partlav -- lavalier worn by participant - PART -- an letter ("A", "B", ...) indicating order of the recording within a multipart session; only used when the recording platform was restarted, resulting in multiple sets of files (see Section 6.4) - EXT -- file extension; either .flac or .tsv Participants ids are 5 character alphanumeric sequences whose first character indicates the enrollment site: - 2 -- UNH - 3 -- FIT - 7 -- LDC 4. File formats =============== 4.1 Audio All audio is provided in the form of 16 kHz, 16-bit mono-channel FLAC files. This was downsampled to 16 kHz from the original sample rate of 48 kHz using SoX (https://sourceforge.net/projects/sox/). 4.2 Transcripts Each session (or session part in the case of multipart session) is accompanied by a transcript produced automatically using the Rev.ai (https://www.rev.ai/) speech-to-text service. These transcripts are stored as tab-delimited files in which each row contains one transcribed utterance with the following columns: - utterance_id -- unique identifier for utterance - audio_file -- basename of source audio file that transcription was performed from - channel -- channel (1-indexed) on that file - speaker_id -- unique identifier for speaker; three speaker types are supported: - participant -- speaker id for participants is always the same as their participant id - interviewer -- the speaker id for the interviewer from a session is always "interviewer" - anonymous -- other speakers recognized by the diarization are assigned anonymous speaker ids of the form "speaker01", "speaker02", ... - onset -- onset in seconds from beginning of recording - offset -- offset in seconds from beginning of recording - transcript -- human-readable transcript - tokens -- whitespace-delimited list of ASR tokens 4.3 Section segmentation Each session (or session part in the case of multipart session) is accompanied by a file indicating the onset and offset of each task in the session. These are tab-delimited files containing one row per task with the following columns: - audio_file -- basename of source audio file that annotaton was performed from - channel -- channel (1-indexed) on that file - onset -- onset in seconds from beginning of recording - offset -- offset in seconds from beginning of recording - label -- name of task; one of: - Interview - YouTube - MapTask - JobScenario 5. Personality scores ===================== 5.1 Overview Ground truth personality trait data for participants are provided for the six dimensions of HEXACO-PI-R 100 and three dimensions of Short Dark Triad (SD3). Additionally, a derived dimension, agreeableness*, is reported for each participant that is the average of their HEXACO Honesty-Humility score and reversed SD3 Machiavellianism score. Two categorical variables are also provided: - emotionality_cat -- each participant is classed as being low, mid, or high emotionality by comparing their HEXACO emotionality score to the score distribution of the entire group of Phase 1 participants as follows: - low -- score is more than one standard deviation below the group mean - mid -- score is within one standard deviation of the group mean - high -- scores is more than one standard eviation above the group mean - agreeableness_cat -- defined identically to emotionality_cat, but using agreeableness* 5.2 Scores table (docs/scores.tbl) Scores are stored under "docs/scores.tbl", which is a tab-delimited file containing scores for one participant per row, each row containing the following 13 columns: - participant_id -- the anonymized participant id - honesty_humility -- HEXACO-PI-R 100 honesty/humility - emotionality -- HEXACO-PI-R 100 emotionality - extraversion -- HEXACO-PI-R 100 extraversion - agreeableness -- HEXACO-PI-R 100 agreeableness - conscientiousness -- HEXACO-PI-R 100 conscientiousness - openness -- HEXACO-PI-R 100 openness - machiavellianism -- Short Dark Triad (SD3) Machiavellianism - narcissism -- Short Dark Triad (SD3) narcissim - psychopathy -- Short Dark Triad (SD3) psychopathy - agreeableness-star -- agreeableness*; i.e., the average of "honesty_humility" and reversed "machiavellianism" - emotionality_cat -- categorical emotionality variable - agreeableness_cat -- categorical agreeableness* variable **NOTE** that this file contains scores for ALL participants who participated in either the in person session or CTS collections. 6. Known issues =============== 6.1 Audio missing for participants who did not consent to future use When participants enrolled in the study, the consent form allowed them to opt out of future use of their audio data. Out of the 386 people who participated in the study, 65 opted out of future use of their data. For these participants, transcripts and section segmentation are provided, but all audio data is omitted. 6.2 Channels missing due to technical issues Two of in person sessions contained in this release do not have complete sets of 4 audio channels in the "flac/" subdirectories: - 7PKKF -- lacks "intrlav" channel - 7XIVG -- lacks "ceiling" channel In the case of participant 7XIVG speech-to-text was performed using the desk channel. 6.3 Entire sessions missing due to technical issues While 386 people participated in the study, technical issues during recording resulted in loss of all audio from recording devices for 20 participants: - 2DFYC - 2GGYE - 2JKHA - 2REXP - 3AQER - 3ETKK - 3ETON - 3GTIQ - 3JWWK - 3OZKB - 3RGPP - 3VZWY - 3XYWL - 3XZVZ - 7AJMQ - 7NXPD - 7PRLC - 7QJWY - 7ZZZW - 7ZZZX Due to a lack of usable audio, these participants are not included in the present release. NOTE, though, that these participants may have data found in the corresponding CTS collection. 6.4 Sessions split across multiple "parts" There were 9 sessions during which the recording had to be restarted due to technical issues. These restarts resulted in two sets of each channel for each session: one set from before the restart and one from after. When such pairs of FLAC/transcript/session segmentation files occur, the file from BEFORE the restart is suffixed with "_A" and the from AFTER is suffixed with "_B"; e.g.: - 7ZPKH_ceiling_A.flac - 7ZPKH_ceiling_B.flac The affected participants are: - 2OUIA - 3JNBA - 3ZIKF - 7BMAK - 7IOTN - 7PXTS - 7UDES - 7UJST - 7ZPKH 6.5 Partial sessions During 7 sessions, technical issues resulted in a truncated recording, resulting in one or more missing tasks: - 2BKKP -- missing Interview, YouTube, and MapTask - 3HOUR -- missing Interview and YouTube - 3MJDT -- missing Interview and YouTube - 7DKCG -- missing JobScenario - 7DNDF -- missing Interview and YouTube - 7PKKF -- missing Interview and YouTube - 7XIVG -- missing JobScenario Additionally, one further session is missing the YouTube task, which was skipped by the interviewer due to the participant expressing difficulty at devising a video idea: - 2HBZY 6.6 Variable quality of synchronization for intrlav channel In some cases, synchronizing the "intrlav" channel with the other channels is difficult due to crosstalk. While interviewers were instructed to wear headphones, sometimes they failed to do so and instead listened to the participant via speakers hooked up to their computer. When this occurs, there is significant crosstalk from the participant, often at a delay, which interferes with the synchronization procedure. 7. Files table (docs/file.tbl) ============================== Expected sizes, modification times, and MD5 checksums for all files within the "data/" directory are recorded in "docs/file.tbl". This is a tab-delimited table containing one file per line, each line having the following 4 fields: - checksum -- MD5 checksum of file - size -- size of file in bytes - datetime -- last modification date in YYYY-MM-DD_HH:MM:SS format - path -- path to file relative to root of release directory 8. Contacts =========== If you have questions about this data release, please contact the following LDC personnel: Neville Ryant