CSLU: Multilanguage Telephone Speech Version 1.2
|Item Name:||CSLU: Multilanguage Telephone Speech Version 1.2|
|Author(s):||Yeshwant Muthusamy, Ronald Allan Cole, Beatrice Oshika|
|LDC Catalog No.:||LDC2006S35|
|Release Date:||June 15, 2006|
|Data Source(s):||telephone speech|
|Application(s):||machine translation, language identification|
|Language(s):||Vietnamese, Tamil, Spanish, Iranian Persian, Korean, Japanese, Hindi, French, English, German, Mandarin Chinese|
|Language ID(s):||vie, tam, spa, pes, kor, jpn, hin, fra, eng, deu, cmn|
|Online Documentation:||LDC2006S35 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Muthusamy, Yeshwant, Ronald Cole, and Beatrice Oshika. CSLU: Multilanguage Telephone Speech Version 1.2 LDC2006S35. Web Download. Philadelphia: Linguistic Data Consortium, 2006.|
The Multilanguage Telephone Speech corpus consists of telephone speech from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, Vietnamese. The corpus contains fixed vocabulary utterances (eg. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2,052 speakers, for a total of about 38.5 hours of speech. Time-aligned phonetic transcriptions for 619 of the utterances are also included.
Each subject called the CSLU data collection system by dialing a toll-free number. An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8 khz and the files were stored in 16-bit linear format on a UNIX file system. Each utterance was recorded as a separate file.