AnnoDIFP CTS Audio and Transcripts
| Item Name: | AnnoDIFP CTS Audio and Transcripts |
| Author(s): | Christopher Cieri, James Fiumara, Kevin Walker, Neville Ryant, Mark Liberman |
| LDC Catalog No.: | LDC2025S10 |
| ISLRN: | 374-951-679-757-4 |
| DOI: | https://doi.org/10.35111/5m3m-m396 |
| Release Date: | November 17, 2025 |
| Member Year(s): | 2025 |
| DCMI Type(s): | Sound, Text |
| Sample Type: | 16-bit FLAC |
| Sample Rate: | 16000 |
| Data Source(s): | telephone conversations, telephone speech |
| Project(s): | AnnoDIFP |
| Application(s): | machine learning |
| Language(s): | English |
| Language ID(s): | eng |
| License(s): |
LDC User Agreement for Non-Members |
| Online Documentation: | LDC2025S10 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Cieri, Christopher, et al. AnnoDIFP CTS Audio and Transcripts LDC2025S10. Web Download. Philadelphia: Linguistic Data Consortium, 2025. |
| Related Works: | View |
Introduction
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) CTS (Conversational Telephone Speech) Audio and Transcripts was developed by the Linguistic Data Consortium (LDC), the Florida Institute of Technology and the University of New Haven to support algorithm development for predicting personality traits. It contains 242.52 hours of English audio and transcripts from 1,179 telephone calls involving 327 participants paired with scores from two self-reported personality assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short Dark Triad (SD3).
Survey and behavioral data were collected in three phases. Phase 1 consisted of online questionnaires. Selected participants were invited to participate in Phase 2a, collecting behavioral and linguistic data in a laboratory setting. In Phase 2b, participants engaged in a telephone speech collection. This release covers the activities in Phase 2b. The data collected in Phase 2a is contained in AnnoDIFP Session Audio and Transcripts (LDC2025S06).
Data
Telephone calls were collected using LDC's robot-operator platform. The operator called participants every 24 hours during their indicated availability and paired them with another participant to speak on a prompted topic for 10 minutes. Further details on collection methodology are contained in the documentation accompanying this release.
There were a total of 327 participants in Phase 2b. This corpus contains audio and transcripts for 277 participants and transcripts only for 50 participants.
Speech data is presented as 16 kHz, 16-bit mono-channel FLAC-compressed MS-WAV files.
Transcripts were produced automatically using the Rev.ai speech-to-text service. Text data is UTF-8 encoded.
Samples
Please view these samples:
Updates
No updates at this time.