AnnoDIFP CTS Audio and Transcripts

Item Name: AnnoDIFP CTS Audio and Transcripts
Author(s): Christopher Cieri, James Fiumara, Kevin Walker, Neville Ryant, Mark Liberman
LDC Catalog No.: LDC2025S10
ISLRN: 374-951-679-757-4
DOI: https://doi.org/10.35111/5m3m-m396
Release Date: November 17, 2025
Member Year(s): 2025
DCMI Type(s): Sound, Text
Sample Type: 16-bit FLAC
Sample Rate: 16000
Data Source(s): telephone conversations, telephone speech
Project(s): AnnoDIFP
Application(s): machine learning
Language(s): English
Language ID(s): eng
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2025S10 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Cieri, Christopher, et al. AnnoDIFP CTS Audio and Transcripts LDC2025S10. Web Download. Philadelphia: Linguistic Data Consortium, 2025.
Related Works: View

Introduction

AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) CTS (Conversational Telephone Speech) Audio and Transcripts was developed by the Linguistic Data Consortium (LDC), the Florida Institute of Technology  and the University of New Haven to support algorithm development for predicting personality traits. It contains 242.52 hours of English audio and transcripts from 1,179 telephone calls involving 327 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).

Survey and behavioral data were collected in three phases. Phase 1 consisted of online questionnaires. Selected participants were invited to participate in Phase 2a, collecting behavioral and linguistic data in a laboratory setting. In Phase 2b, participants engaged in a telephone speech collection. This release covers the activities in Phase 2b. The data collected in Phase 2a is contained in AnnoDIFP Session Audio and Transcripts (LDC2025S06).

Data

Telephone calls were collected using LDC's robot-operator platform. The operator called participants every 24 hours during their indicated availability and paired them with another participant to speak on a prompted topic for 10 minutes. Further details on collection methodology are contained in the documentation accompanying this release.

There were a total of 327 participants in Phase 2b. This corpus contains audio and transcripts for 277 participants and transcripts only for 50 participants.

Speech data is presented as 16 kHz, 16-bit mono-channel FLAC-compressed MS-WAV files.

Transcripts were produced automatically using the Rev.ai speech-to-text service. Text data is UTF-8 encoded.

Samples

Please view these samples:

Updates

No updates at this time.

Available Media

View Fees





Login for the applicable fee