CSLU: Multilanguage Telephone Speech Version 1.2

Item Name: CSLU: Multilanguage Telephone Speech Version 1.2
Author(s): Yeshwant Muthusamy, Ronald Allan Cole, Beatrice Oshika
LDC Catalog No.: LDC2006S35
ISBN: 1-58563-390-9
ISLRN: 871-936-811-171-7
DOI: https://doi.org/10.35111/j0p6-f049
Release Date: June 15, 2006
Member Year(s): 2006
DCMI Type(s): Sound, Text
Sample Type: pcm
Sample Rate: 8000
Data Source(s): telephone speech
Application(s): language identification, machine translation
Language(s): Vietnamese, Tamil, Spanish, Iranian Persian, Korean, Japanese, Hindi, French, English, German, Mandarin Chinese
Language ID(s): vie, tam, spa, pes, kor, jpn, hin, fra, eng, deu, cmn
License(s): CSLU Agreement
Online Documentation: LDC2006S35 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Muthusamy, Yeshwant, Ronald Cole, and Beatrice Oshika. CSLU: Multilanguage Telephone Speech Version 1.2 LDC2006S35. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: View

Introduction

CSLU: Multilanguage Telephone Speech Version 1.2 was developed by The Center for Spoken Language Understanding (CSLU) and consists of telephone approximately 38.5 hours of speech, about eight hours of which has time-aligned phonetic transcripts, from 11 languages: English, Farsi, French, German, Hindi, Japanese, Korean, Mandarin, Spanish, Tamil, and Vietnamese. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. The current release includes recorded utterances from about 2,052 speakers, 12,152 speech files, and 619 phonetic transcripts.

This corpus was collected and developed in 1992.

Data

Each subject called the CSLU data collection system by dialing a toll-free number. Most subjects were respondents to postings on USEnet newsgroups. Subjects were asked to contribute their voice to science to help with the research.

Participating subjects responded to prompts that were designed to ilicit vocabulary of three types:

fixed and useful -- language spoken, days of the week, numbers
domain-specific -- short open-ended questions
unrestricted -- monologue on subject of choice

An analog telephone line was connected to a Gradient Technologies box. Data from incoming calls were recorded by the Gradient box. The sampling rate was 8 kHz and the files were stored in 16-bit linear format on a UNIX file system. Each utterance was recorded as a separate file.

Samples

For an example of the data in this corpus, please listen to these audio samples in Tamil (WAV) and English (WAV).

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee