Home › Language Resources › Data

AnnoDIFP Session Audio and Transcripts

Item Name:	AnnoDIFP Session Audio and Transcripts
Author(s):	Christopher Cieri, James Fiumara, Kevin Walker, Mark Liberman, Neville Ryant
LDC Catalog No.:	LDC2025S06
ISLRN:	831-339-304-772-0
DOI:	https://doi.org/10.35111/kbj5-9864
Release Date:	July 15, 2025
Member Year(s):	2025
DCMI Type(s):	Sound, Text
Sample Type:	16-bit FLAC
Sample Rate:	16000
Data Source(s):	microphone conversation, microphone speech
Project(s):	AnnoDIFP
Application(s):	machine learning, psycholinguistics
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2025S06 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Cieri, Christopher, et al. AnnoDIFP Session Audio and Transcripts LDC2025S06. Web Download. Philadelphia: Linguistic Data Consortium, 2025.
Related Works: Hide	View relatesTo LDC2025S10 AnnoDIFP CTS Audio and Transcripts isProcessedBy rev.ai https://www.rev.ai/

Introduction

AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) Session Audio and Transcripts was developed by the Linguistic Data Consortium (LDC), the Florida Institute of Technology (FIT), and the University of New Haven (UNH) to support algorithm development for predicting personality traits. It contains 438.34 hours of English audio and transcripts from in-person interviews of 366 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).

Survey and behavioral data were collected in three phases. Phase 1 consisted of online questionnaires. Selected participants were invited to participate in Phase 2a, collecting behavioral and linguistic data in a laboratory setting. In Phase 2b, participants engaged in a telephone speech collection by calling other particpants. This release covers the activities in Phase 2a.

Data

In-person interviews were recorded at LDC, FIT and UNH. In each session, the participant and interviewer sat in separate sound-isolated rooms with communication between them supplied by audio/video hardware. Sessions consisted of the following tasks: rapport building, a YouTube task, a map task, and a business task. Further details on collection methodology and session tasks are contained in the documentation accompanying this release.

There were a total of 386 participants in Phase 2a. This corpus contains audio data and transcripts from 301 participants and transcripts only for 65 participants. Recordings for 20 participants were not usable.

Each session (or session part in the case of multipart sessions) is accompanied by a transcript produced automatically using the Rev.ai speech-to-text service.

Speech data is presented as 16 kHz, 16-bit mono-channel FLAC-compressed MS-WAV files. Text data is UTF-8 encoded.

AnnoDIFP Session Audio and Transcripts

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees