Home › Language Resources › Data

Asian Spoken Language Sampler

Item Name:	Asian Spoken Language Sampler
Author(s):	Linguistic Data Consortium
LDC Catalog No.:	LDC2010S07
ISBN:	1-58563-559-6
ISLRN:	042-211-152-679-3
DOI:	https://doi.org/10.35111/e3jx-tv33
Member Year(s):	2010
Data Source(s):	telephone speech, microphone speech
Language(s):	Yue Chinese, Vietnamese, Urdu, Tamil, Russian, Korean, Japanese, Hindi, Persian, Mandarin Chinese, North Levantine Arabic, South Levantine Arabic, Gulf Arabic, Dari, Iranian Persian
Language ID(s):	yue, vie, urd, tam, rus, kor, jpn, hin, fas, cmn, apc, ajp, afb, prs, pes
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Linguistic Data Consortium. Asian Spoken Language Sampler LDC2010S07. Web Download. Philadelphia: Linguistic Data Consortium, 2010.
Related Works: Hide	View isSimilarWith LDC2008S08 LDC Spoken Language Sampler LDC2013S06 LDC Spoken Language Sampler - Second Release LDC2015S09 LDC Spoken Language Sampler - Third Release LDC2017S16 LDC Spoken Language Sampler - Fourth Release LDC2019S17 LDC Spoken Language Sampler - Fifth Release relatesTo LDC96S34 CALLHOME Mandarin Chinese Speech LDC96S37 CALLHOME Japanese Speech LDC96S50 CALLFRIEND Farsi LDC96S59 CALLFRIEND Tamil LDC96S60 CALLFRIEND Vietnamese LDC96S64-5 JEIDA/JCSD-Channel 0 Mono Syllables LDC96T18 CALLHOME Japanese Transcripts LDC2003S03 Korean Telephone Conversations Speech LDC2003T08 Korean Telephone Conversations Transcripts LDC2005S11 TDT4 Multilingual Broadcast News Speech Corpus LDC2006S34 Russian through Switched Telephone Network (RuSTeN) LDC2006S36 West Point Korean Speech LDC2006S43 Gulf Arabic Conversational Telephone Speech LDC2006T15 Gulf Arabic Conversational Telephone Speech, Transcripts LDC2007S09 Mandarin Affective Speech LDC2007S02 Fisher Levantine Arabic Conversational Telephone Speech LDC2007S03 ARL Urdu Speech Database, Training Data LDC2007T04 Fisher Levantine Arabic Conversational Telephone Speech, Transcripts LDC2008S05 2005 NIST Language Recognition Evaluation LDC2009S04 2007 NIST Language Recognition Evaluation Test Set

Introduction

The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected, readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with data providers and maintaining relations with other like-minded groups around the world.

Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDCs publications, a searchable catalog is available at http://www.ldc.upenn.edu/Catalog/.

Data

The Asian Spoken Language Sampler provides a variety of speech and transcript samples from various corpora and is designed to illustrate the variety and breadth of the speech-related resources available from LDCs Catalog. Further information about each data set can be obtained by clicking the links in the table below. The sample files provided in this release have been modified in various ways relative to the original data as published by LDC:

most excerpts are truncated to be much shorter than the original files, excerpt duration is typically one minute and thirty seconds
signal amplitude has been adjusted where necessary to normalize playback volume
some corpora are published in compressed form, but all samples here are uncompressed
LDC frequently uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities.

2005 NIST Language Recognition Evaluation	The goal of the NIST Language Recognition Evaluation is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field.
2007 NIST Language Recognition Evaluation Test Set	The most significant differences between previous NIST evaluations and the 2007 task were the increased number of languages and dialects, the greater emph asis on a basic detection task for evaluation and the variety of evaluation conditions.
ARL Urdu Speech Database, Training Data	The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India.
CALLFRIEND Farsi	A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi.
CALLFRIEND Tamil	A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil.
CALLFRIEND Vietnamese	A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Vietnamese.
CALLHOME Japanese	A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts.
CALLHOME Mandarin Chinese Speech	The Callhome Mandarin Chinese corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese.
JEIDA/JCSD-Channel 0 Mono Syllables	This collection consists of high-fidelity recordings of 150 native speakers of Japanese each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones.
Korean Telephone Conversations Speech and Transcripts	This publication consists of 100 telephone conversations, 49 of which were published in 1996 as Callfriend Korean, while the rest of 51 are previously unexposed calls. All 100 conversations have been transcribed.
Mandarin Affective Speech	Mandarin Affective Speech is a database of emotional speech consisting of audio recordings and corresponding transcripts collected in 2005 at the Advance Computing and System Laboratory, Zhejiang University. The speech database was recorded by eliciting speakers to express different emotional states in response to stimuli.
Russian through Switched Telephone Network (RuSTeN)	The purpose of the project was to develop software for automatic identification of speakers based on voice samples acquired through telephone channels.
TDT4 Multilingual Broadcast News Speech Corpus	This release contains the complete set of American English, Modern Standard Arabic and Mandarin Chinese broadcast news audio used in the 2002 and 2003 Topic Detection and Tracking technology evaluations.
West Point Korean Speech	West Point Korean Speech is a database of digital recordings of spoken Korean. The prompt scripts were created from 20,000 distinct sentences, along with a subset of prompts designed to elicit free response answers to questions for use in domain-specific translation systems.
Fisher Levantine Arabic	A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities.
Gulf Arabic Conversational Telephone Speech	Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions.

How to Obtain the Sampler

The Asian Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler.

Download

28 mb

Asian Spoken Language Sampler

Introduction

Data

How to Obtain the Sampler

Copyright

Available Media

View Fees