CSLU: Multilanguage Telephone Speech Version 1.2

Item Name: CSLU: Multilanguage Telephone Speech Version 1.2
Author(s): Yeshwant Muthusamy, Ronald Cole, Beatrice Oshika
LDC Catalog No.: LDC2006S35
ISBN: 1-58563-390-9
ISLRN: 871-936-811-171-7
Release Date: June 15, 2006
Member Year(s): 2006
DCMI Type(s): Sound
Sample Type: pcm
Sample Rate: 8000
Data Source(s): telephone speech
Application(s): machine translation, language identification
Language(s): Vietnamese, Tamil, Spanish, Iranian Persian, Korean, Japanese, Hindi, French, English, German, Mandarin Chinese
Language ID(s): vie, tam, spa, pes, kor, jpn, hin, fra, eng, deu, cmn
License(s): CSLU Agreement
Online Documentation: LDC2006S35 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Muthusamy, Yeshwant, Ronald Cole, and Beatrice Oshika. CSLU: Multilanguage Telephone Speech Version 1.2 LDC2006S35. Web Download. Philadelphia: Linguistic Data Consortium, 2006.


The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. The corpus contains fixed vocabulary utterances (eg. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2,052 speakers, for a total of about 38.5 hours of speech. Time-aligned phonetic transcriptions for 619 of the utterances are also included.


Each subject called the CSLU data collection system by dialing a toll-free number. An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8 khz and the files were stored in 16-bit linear format on a UNIX file system. Each utterance was recorded as a separate file.


For an example of the data in this corpus, please listen to these audio samples in Tamil and English.

Available Media

View Fees

Extra Copy
Login for the applicable fee