Authors: Alexandra Canavan, David Graff, and George Zipperlen
Data Type: speech
Sample Rate: 8000 Hz
Sampling Format: 2-channel ulaw
Data Source(s): telephone conversations
Project(s): EARS, GALE, Hub5-LVCSR
Application(s): speech recognition
Language(s): English
Language ID(s): eng
The CALLHOME English corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of English.


All calls, which lasted up to 30 minutes, originated in North America; 90 of the 120 calls were placed to various locations overseas, while the remaining 30 were placed within North America. Most participants called family members or close friends.

This corpus contains speech data files ONLY, along with the minimal amount of documentation needed to describe the contents and format of the speech files and the software packages needed to uncompress the speech data. The transcripts and documentation (LDC97T14) are available separately, as is an associated lexicon (LDC97L20).


The "shorten" and "sphere" directories have been removed.

The sphere directory contained NIST "SPeech HEader REsources" (SPHERE): C-language source code libraries and utilities for manipulating NIST SPHERE-format waveform files.

The shorten directory contained files for Tony Robinson's "shorten" software for speech compression.

A more recent version of the SPHERE utilities is now available on the NIST web site; additional utilities for converting from SPHERE to other waveform file formats is also available at the LDC web site.

