CALLHOME Spanish Speech

Item Name: CALLHOME Spanish Speech
Authors: Alexandra Canavan and George Zipperlen
LDC Catalog No.: LDC96S35
ISBN: 1-58563-083-7
Data Type: speech
Sample Rate: 8000 Hz
Sampling Format: 2-channel ulaw
Data Source(s): telephone conversations
Project(s): Hub5-LVCSR
Application(s): speech recognition
Language(s): Spanish
Language ID(s): SPN
Distribution: 1 DVD
Member fee: $0 for 1996, 1997 members
Non-member Fee: US $1500.00
Reduced-License Fee: US $750.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Alexandra Canavan and George Zipperlen
CALLHOME Spanish Speech
Linguistic Data Consortium, Philadelphia

The CALLHOME Spanish corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Spanish.

All calls, which lasted up to 30 minutes, originated in North America and were placed to international locations. Most participants called family members or close friends.

This corpus contains speech data files ONLY, along with the minimal amount of documentation needed to describe the contents and format of the speech files and the software packages needed to uncompress the speech data. The transcripts and documentation (LDC96T17) are available separately, as is an associated lexicon (LDC96L16).


The "shorten" and "sphere" directories have been removed.

The sphere directory contained NIST "SPeech HEader REsources" (SPHERE): C-language source code libraries and utilities for manipulating NIST SPHERE-format waveform files.

The shorten directory contained files for Tony Robinson's "shorten" software for speech compression.

A more recent version of the SPHERE utilities is now available on the NIST web site; additional utilities for converting from SPHERE to other waveform file formats is also available at the LDC web site.

10.10.2003: It has been brought to our attention that 16 sphere files (both from the train and devtest directories) were corrupted; the problem becomes apparent when trying to decompress the files using the w_decode utility. The correct version of these files is now available on a third CD-Rom, containing the 16 speech files and a readme.txt, listing the contents of the disc. If you purchased the corpus, please request the CD by writing to The new orders will receive the two CDs and the third disc with the corrected files.

Content Copyright

Portions 1996 Trustees of the University of Pennsylvania