CSLU: Multilanguage Telephone Speech Version 1.2
Item Name: | CSLU: Multilanguage Telephone Speech Version 1.2 |
Author(s): | Yeshwant Muthusamy, Ronald Allan Cole, Beatrice Oshika |
LDC Catalog No.: | LDC2006S35 |
ISBN: | 1-58563-390-9 |
ISLRN: | 871-936-811-171-7 |
DOI: | https://doi.org/10.35111/j0p6-f049 |
Release Date: | June 15, 2006 |
Member Year(s): | 2006 |
DCMI Type(s): | Sound, Text |
Sample Type: | pcm |
Sample Rate: | 8000 |
Data Source(s): | telephone speech |
Application(s): | language identification, machine translation |
Language(s): | Vietnamese, Tamil, Spanish, Iranian Persian, Korean, Japanese, Hindi, French, English, German, Mandarin Chinese |
Language ID(s): | vie, tam, spa, pes, kor, jpn, hin, fra, eng, deu, cmn |
License(s): |
CSLU Agreement |
Online Documentation: | LDC2006S35 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Muthusamy, Yeshwant, Ronald Cole, and Beatrice Oshika. CSLU: Multilanguage Telephone Speech Version 1.2 LDC2006S35. Web Download. Philadelphia: Linguistic Data Consortium, 2006. |
Related Works: | View |
Introduction
CSLU: Multilanguage Telephone Speech Version 1.2 was developed by The Center for Spoken Language Understanding (CSLU) and consists of telephone approximately 38.5 hours of speech, about eight hours of which has time-aligned phonetic transcripts, from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2,052 speakers, 12,152 speech files, and 619 phonetic transcripts.
This corpus was collected and developed in 1992.
Data
Each subject called the CSLU data collection system by dialing a toll-free number. Most subjects were respondents to postings on USEnet newsgroups. Subjects were asked to contribute their voice to science to help with the research.
Participating subjects responded to prompts that were designed to elicit vocabulary of three types:
fixed and useful -- language spoken, days of the week, numbers
domain-specific -- short open-ended questions
unrestricted -- monologue on subject of choice
An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8 kHz and the files were stored in 16-bit linear format on a UNIX file system. Each utterance was recorded as a separate file.
Samples
For an example of the data in this corpus, please listen to these audio samples in Korean (WAV), Tamil (WAV) and English (WAV).
Updates
None at this time.