AnnoDIFP Session Audio and Transcripts

                               LDC2025S06
                            December 16, 2024

                      Linguistic Data Consortium


1. Overview
===========
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality)
was created by the Linguistic Data Consortium (LDC), Florida Institute of
Technology (FIT), and University of New Haven (UNH) to support development
of algorithms for prediction of personality traits. It consists of audio
recordings from both in person interviews and conversational telephone
speech collections paired with scores from two self-reported personality
assessments – HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and
Short Dark Triad (SD3).

This release contains audio data and transcripts from the in person session
component of AnnoDIFP, comprising 438.34 hours of session recordings from 366
participants.

More information about the corpus design, collection, protocol, processing,
and annotation is provided in the file "docs/annodifp_collection_doc.pdf".


2. Directory Struture
=====================
- data/<PARTICIPANT-ID>/flac/  --  FLAC from the four microphones for
  participant PARTICIPANT-ID
- data/<PARTICIPANT-ID>/transcripts/  --  transcripts for participant
  PARTICIPANT-ID
- data/<PARTICIPANT-ID>/sections/  --  section segmentation for participant
  PARTICIPANT-ID
- docs/annodifp_collection_doc.pdf  --  detailed description of corpus
  design, collection, protocols, processing, and annotation
- docs/scores.tbl   --  ground truth scores for participants
- docs/file.tbl   --  listing of md5 checksums, sizes, dates, and file names
- README.txt  --  this file


3. File naming convention
=========================
Data files (FLAC, transcripts, section segmentation) are named according to the
following convention:

    <PARTICIPANT-ID>_<MIC>[_<PART>]<EXT>

where:

- PARTICIPANT-ID  --  the anonymized participant id for the subject of the
  recording
- MIC  --  name of the mic recording or annotation corresponds to; one of:
  - ceiling  --  distant microphone on ceiling in room containing participant
  - desk  --	 distant microphone on desk in room containing participant
  - intrlav  --	 lavalier worn by interviewer (different room)
  - partlav  --	 lavalier worn by participant
- PART  --  an letter ("A", "B", ...) indicating order of the recording within
  a multipart session; only used when the recording platform was restarted,
  resulting in multiple sets of files (see Section 6.4)
- EXT  --  file extension; either .flac or .tsv

Participants ids are 5 character alphanumeric sequences whose first character
indicates the enrollment site:

- 2 -- UNH
- 3 -- FIT
- 7 -- LDC


4. File formats
===============
4.1 Audio

All audio is provided in the form of 16 kHz, 16-bit mono-channel FLAC files.
This was downsampled to 16 kHz from the original sample rate of 48 kHz using
SoX (https://sourceforge.net/projects/sox/).


4.2 Transcripts

Each session (or session part in the case of multipart session) is accompanied
by a transcript produced automatically using the Rev.ai
(https://www.rev.ai/) speech-to-text service. These transcripts are
stored as tab-delimited files in which each row contains one transcribed
utterance with the following columns:

- utterance_id  --  unique identifier for utterance
- audio_file  --  basename of source audio file that transcription was
  performed from
- channel  --  channel (1-indexed) on that file
- speaker_id  --  unique identifier for speaker; three speaker types are
  supported:
  - participant  -- speaker id for participants is always the same as their
    participant id
  - interviewer  --  the speaker id for the interviewer from a session is
    always "interviewer"
  - anonymous  --  other speakers recognized by the diarization are assigned
    anonymous speaker ids of the form "speaker01", "speaker02", ...
- onset  --  onset in seconds from beginning of recording
- offset  --  offset in seconds from beginning of recording
- transcript   --  human-readable transcript
- tokens  --  whitespace-delimited list of ASR tokens


4.3 Section segmentation

Each session (or session part in the case of multipart session)  is
accompanied by a file indicating the onset and offset of each task in the
session. These are tab-delimited files containing one row per task with the
following columns:

- audio_file  --  basename of source audio file	that annotaton was
  performed from
- channel  --  channel (1-indexed) on that file
- onset  --  onset in seconds from beginning of recording
- offset  --  offset in seconds from beginning of recording
- label  --  name of task; one of:
  - Interview
  - YouTube
  - MapTask
  - JobScenario


5. Personality scores
=====================
5.1 Overview

Ground truth personality trait data for participants are provided for the six
dimensions of HEXACO-PI-R 100 and three dimensions of Short Dark Triad (SD3).
Additionally, a derived dimension, agreeableness*, is reported for each
participant that is the average of their HEXACO Honesty-Humility score and
reversed SD3 Machiavellianism score.

Two categorical variables are also provided:

- emotionality_cat  --  each participant is classed as being low, mid, or high
  emotionality by comparing their HEXACO emotionality score to the score
  distribution of the entire group of Phase 1 participants as follows:
  - low   --  score is more than one standard deviation below the group mean
  - mid   --  score is within one standard deviation of the group mean
  - high  --  scores is more than one standard eviation above the group mean
- agreeableness_cat  --  defined identically to emotionality_cat, but using
  agreeableness*


5.2 Scores table (docs/scores.tbl)

Scores are stored under "docs/scores.tbl", which is a tab-delimited file
containing scores for one participant per row, each row containing the
following 13 columns:

- participant_id  --  the anonymized participant id
- honesty_humility  --  HEXACO-PI-R 100 honesty/humility
- emotionality  --  HEXACO-PI-R 100 emotionality
- extraversion  --  HEXACO-PI-R 100 extraversion
- agreeableness  --  HEXACO-PI-R 100 agreeableness
- conscientiousness  --  HEXACO-PI-R 100 conscientiousness
- openness  --  HEXACO-PI-R 100 openness
- machiavellianism  --  Short Dark Triad (SD3) Machiavellianism
- narcissism  --  Short Dark Triad (SD3)  narcissim
- psychopathy  --  Short Dark Triad (SD3)  psychopathy
- agreeableness-star  --  agreeableness*; i.e., the average of
  "honesty_humility" and reversed "machiavellianism"
- emotionality_cat  --  categorical emotionality variable
- agreeableness_cat  --  categorical agreeableness* variable

**NOTE** that this file contains scores for ALL participants who participated
in either the in person session or CTS collections.


6. Known issues
===============
6.1 Audio missing for participants who did not consent to future use

When participants enrolled in the study, the consent form allowed them to opt
out of future use of their audio data. Out of the 386 people who participated
in the study, 65 opted out of future use of their data.

For these participants, transcripts and section segmentation are provided, but
all audio data is omitted.


6.2 Channels missing due to technical issues

Two of in person sessions contained in this release do not have complete
sets of 4 audio channels in the "flac/" subdirectories:

- 7PKKF  --  lacks "intrlav" channel
- 7XIVG  --  lacks "ceiling" channel

In the case of participant 7XIVG speech-to-text was performed using the desk
channel.


6.3 Entire sessions missing due to technical issues

While 386 people participated in the study, technical issues
during recording resulted in loss of all audio from recording devices for
20 participants:

- 2DFYC
- 2GGYE
- 2JKHA
- 2REXP
- 3AQER
- 3ETKK
- 3ETON
- 3GTIQ
- 3JWWK
- 3OZKB
- 3RGPP
- 3VZWY
- 3XYWL
- 3XZVZ
- 7AJMQ
- 7NXPD
- 7PRLC
- 7QJWY
- 7ZZZW
- 7ZZZX

Due to a lack of usable audio, these participants are not included in the
present release. NOTE, though, that these participants may have data found in
the corresponding CTS collection.


6.4 Sessions split across multiple "parts"

There were 9 sessions during which the recording had to be restarted due to
technical issues. These restarts resulted in two sets of each channel for each
session: one set from before the restart and one from after. When such pairs
of FLAC/transcript/session segmentation files occur, the file from BEFORE the
restart is suffixed with "_A" and the from AFTER is suffixed with "_B"; e.g.:

- 7ZPKH_ceiling_A.flac
- 7ZPKH_ceiling_B.flac

The affected participants are:

- 2OUIA
- 3JNBA
- 3ZIKF
- 7BMAK
- 7IOTN
- 7PXTS
- 7UDES
- 7UJST
- 7ZPKH


6.5 Partial sessions

During  7 sessions, technical issues resulted in a truncated recording,
resulting in one or more missing tasks:

- 2BKKP  --  missing Interview,	YouTube, and MapTask
- 3HOUR  --  missing Interview and YouTube
- 3MJDT  --  missing Interview and YouTube
- 7DKCG  --  missing JobScenario
- 7DNDF  --  missing Interview and YouTube
- 7PKKF  --  missing Interview and YouTube
- 7XIVG  --  missing JobScenario

Additionally, one further session is missing the YouTube task, which was
skipped by the interviewer due to the participant expressing difficulty at
devising a video idea:

- 2HBZY


6.6 Variable quality of synchronization for intrlav channel

In some cases, synchronizing the "intrlav" channel with the other channels is
difficult due to crosstalk. While interviewers were instructed to wear
headphones, sometimes they failed to do so and instead listened to the
participant via speakers hooked up to their computer. When this occurs, there
is significant crosstalk from the participant, often at a delay, which
interferes with the synchronization procedure.


7. Files table (docs/file.tbl)
==============================
Expected sizes, modification times, and MD5 checksums for all files within the
"data/" directory are recorded in "docs/file.tbl". This is a tab-delimited table
containing one file per line, each line having the following 4 fields:

- checksum  --  MD5 checksum of file
- size  --  size of file in bytes
- datetime  --  last modification date in YYYY-MM-DD_HH:MM:SS format
- path  --  path to file relative to root of release directory


8. Contacts
===========
If you have questions about this data release, please contact the following
LDC personnel:

    Neville Ryant
    <nryant@ldc.upenn.edu>