AnnoDIFP CTS Audio and Transcripts

                              LDC2025S10
                            March 4, 2025

                      Linguistic Data Consortium


1. Overview
===========
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality)
was created by the Linguistic Data Consortium (LDC), Florida Institute of
Technology (FIT), and University of New Haven (UNH) to support development
of algorithms for prediction of personality traits. It consists of audio
recordings from both in person interviews and conversational telephone
speech collections paired with scores from two self-reported personality
assessments – HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and
Short Dark Triad (SD3).

This release contains audio data and transcripts from the conversational
telephone speech (CTS) collection for the AnnoDIFP project, comprising 1,179
calls from 327 participants (total call duration: 242.52 hours).

More information about the corpus design, collection, protocol, processing,
and annotation is provided in the file "docs/annodifp_collection_doc.pdf".


2. Directory Struture
=====================
- data/<PARTICIPANT-ID>/flac/  --  FLAC from all callsides that participant
  PARTICIPANT-ID was on
- data/<PARTICIPANT-ID>/transcripts/  --  transcripts for all participant
  PARTICIPANT-ID callsides
- docs/annodifp_collection_doc.pdf  --  detailed description of corpus
  design, collection, protocols, processing, and annotation
- docs/scores.tbl   --  ground truth scores for participants
- docs/calls.tbl  --  mapping between calls and callsides
- docs/file.tbl   --  listing of md5 checksums, sizes, dates, and file names
- README.txt  --  this file


3. File naming convention
=========================
Data files (FLAC, transcripts) are stored separately for each callside and named
according to the following convention:

    <DATE>_<TIME>_<PARTICIPANT-ID>_cts<EXT>

where:

- DATE  --  the date that the call was made in YYYMMDD format
- TIME  --  the time that the call was made in HHMMSS<TZ> format, where
  TZ is the timezone
- PARTICIPANT-ID  --  the anonymized participant id for the subject present on
  the callside
- EXT  --  file extension; either .flac or .tsv

Participant ids are 5 character alphanumeric sequences whose first character
indicates the enrollment site:

- 2 -- UNH
- 3 -- FIT
- 7 -- LDC
- 9 -- participant ids beginning with "9" were assigned to study staff
  who participated in the CTS collection to ensure actual participants could
  always be paired with a conversational partner
  

4. File formats
===============
4.1 Audio

All audio is provided in the form of 16 kHz, 16-bit mono-channel FLAC files.
This was upsampled to 16 kHz from the original sample rate of 8 kHz using
SoX (https://sourceforge.net/projects/sox/).


4.2 Transcripts

Each FLAC file is accompanied by a transcript produced automatically using the
Rev.ai (https://www.rev.ai/) speech-to-text service. These transcripts are
stored as tab-delimited files in which each row contains one transcribed
utterance with the following columns:

- utterance_id  --  unique identifier for utterance
- audio_file  --  basename of source audio file that transcription was
  performed from
- channel  --  channel (1-indexed) on that file
- speaker_id  --  unique identifier for speaker; two speaker types are
  supported:
  - participant  -- speaker id for participants is always the same as their
    participant id
  - anonymous  --  other speakers recognized by the diarization are assigned
    anonymous speaker ids of the form "speaker01", "speaker02", ...
- onset  --  onset in seconds from beginning of recording
- offset  --  offset in seconds from beginning of recording
- transcript   --  human-readable transcript
- tokens  --  whitespace-delimited list of ASR tokens


5. Personality scores
=====================
5.1 Overview

Ground truth personality trait data for participants are provided for the six
dimensions of HEXACO-PI-R 100 and three dimensions of Short Dark Triad (SD3).
Additionally, a derived dimension, agreeableness*, is reported for each
participant that is the average of their HEXACO Honesty-Humility score and
reversed SD3 Machiavellianism score.

Two categorical variables are also provided:

- emotionality_cat  --  each participant is classes as being low, mid, or high
  emotionality by comparing their HEXACO emotionality score to the score
  distribution of the entire group of Phase 1 participants as follows:
  - low   --  score is more than one standard deviation below the group mean
  - mid   --  score is within one standard deviation of the group mean
  - high  --  scores is more than one standard eviation above the group mean
- agreeableness_cat  --  defined identically to emotionality_cat, but using
  agreeableness*


5.2 Scores table (docs/scores.tbl)

Scores are stored under "docs/scores.tbl", which is a tab-delimited file
containing scores for one participant per row, each row containing the
following 13 columns:

- participant_id  --  the anonymized participant id
- honesty_humility  --  HEXACO-PI-R 100 honesty/humility
- emotionality  --  HEXACO-PI-R 100 emotionality
- extraversion  --  HEXACO-PI-R 100 extraversion
- agreeableness  --  HEXACO-PI-R 100 agreeableness
- conscientiousness  --  HEXACO-PI-R 100 conscientiousness
- openness  --  HEXACO-PI-R 100 openness
- machiavellianism  --  Short Dark Triad (SD3) Machiavellianism
- narcissism  --  Short Dark Triad (SD3)  narcissim
- psychopathy  --  Short Dark Triad (SD3)  psychopathy
- agreeableness-star  --  agreeableness*; i.e., the average of
  "honesty_humility" and reveresed "machiavellianism"
- emotionality_cat  --  categorical emotionality variable
- agreeableness_cat  --  categorical agreeableness* variable

**NOTE** that this file contains scores for ALL participants who participated
in either the in person session or CTS collections. It does **NOT** include
data for the study staff (participant ids beginning with "9"; see Section 3)
who entered the call pool to ensure actual participants could make calls.


6. Calls table (docs/calls.tbl)
===============================
The mapping between calls and callsides is mediated by "docs/calls.tbl", which
is a tab-delimited table containing one call per row, each row containing the
following 7 columns:

- call_id  --  a 4 digit call id
- call_date  --  the date on which call was made in YYYMMDD format
- call_time  --   the time that the call was made in HHMMSS<TZ> format, where
  TZ is the timezone
- a_pid  -- participant id for participant on side A of the call
- a_fid  --  file id (i.e., basename minus extention) for files (FLAC/TSV)
  corresponding to side A of the call
- b_pid  -- file id (i.e., basename minus extention) for files (FLAC/TSV)
  corresponding to side B of the call


7. Known issues
===============
7.1 Audio missing for participants who did not consent to future use

When participants enrolled in the study, the consent form allowed them to opt
out of future use of their audio data. Out of the 386 people who participated
in the study, 65 opted out of future use of their data.

For these participants, transcripts are provided, but all audio data is omitted.


7.2 Call durations < 10 minutes

While the target call duration was 10 minutes, in some cases (n=83) participants
finished calls early (sometimes very early). We include all calls with a
duration >= 3 minutes, which resulted in the filtering of 61 calls.


7.3 Participants with > 8 calls

During collection, a bug in the platform allowed participants to continue
placing calls after reaching the 8 call target. As a result, the corpus contains
78 participants with > 8 calls with the mean number of calls for this group
being 9.8.


8. Files table (docs/file.tbl)
==============================
Expected sizes, modification times, and MD5 checksums for all files within the
"data/" directory are recorded in "docs/file.tbl". This is a tab-delimited table
containing one file per line, each line having the following 4 fields:

- checksum  --  MD5 checksum of file
- size  --  size of file in bytes
- datetime  --  last modification date in YYYY-MM-DD_HH:MM:SS format
- path  --  path to file relative to root of release directory


9. Contacts
===========
If you have questions about this data release, please contact the following
LDC personnel:

    Neville Ryant
    <nryant@ldc.upenn.edu>