Asian Spoken Language Sampler


Item Name: Asian Spoken Language Sampler
Authors: Linguistic Data Consortium
LDC Catalog No.: LDC2010S07
ISBN: 1-58563-559-6
Data Source(s): microphone speech, telephone speech
Language(s): Cantonese, Farsi, Gulf Arabic, Hindi, Japanese, Korean, Levantine Arabic, Mandarin Chinese, Russian, Tamil, Urdu, Vietnamese
Language ID(s): afb, ajp, apc, cmn, fas, hin, jpn, kor, rus, tam, urd, vie, yue
Distribution: Web Download
Member fee: $0 for 2010, 2010 members
Non-member Fee: US $0.00
Reduced-License Fee: US $0.00
Extra-Copy Fee: N/A
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Linguistic Data Consortium
2010
Asian Spoken Language Sampler
Linguistic Data Consortium, Philadelphia

Introduction

The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected, readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with data providers and maintaining relations with other like-minded groups around the world.

Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDCs publications, a searchable catalog is available at http://www.ldc.upenn.edu/Catalog/.

Data

The Asian Spoken Language Sampler provides a variety of speech and transcript samples from various corpora and is designed to illustrate the variety and breadth of the speech-related resources available from LDCs Catalog. Further information about each data set can be obtained by clicking the links in the table below. The sample files provided in this release have been modified in various ways relative to the original data as published by LDC:

  • most excerpts are truncated to be much shorter than the original files, excerpt duration is typically one minute and thirty seconds
  • signal amplitude has been adjusted where necessary to normalize playback volume
  • some corpora are published in compressed form, but all samples here are uncompressed
  • LDC frequently uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities.
2005 NIST Language Recognition Evaluation The goal of the NIST Language Recognition Evaluation is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field.
2007 NIST Language Recognition Evaluation Test Set The most significant differences between previous NIST evaluations and the 2007 task were the increased number of languages and dialects, the greater emph asis on a basic detection task for evaluation and the variety of evaluation conditions.
ARL Urdu Speech Database, Training Data The ARL Urdu Speech Database is a collection of recorded speech from 200 adult native Urdu speakers from Pakistan and Northern India.
CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi.
CALLFRIEND Tamil A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil.
CALLFRIEND Vietnamese A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Vietnamese.
CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts.
CALLHOME Mandarin Chinese Speech The Callhome Mandarin Chinese corpus of telephone speech consists of 120 unscripted telephone conversations between native speakers of Mandarin Chinese.
JEIDA/JCSD-Channel 0 Mono Syllables This collection consists of high-fidelity recordings of 150 native speakers of Japanese each speaker produces four repetitions of 323 short prompts, including city names, control words, monosyllabic words, isolated digits and strings of four digits. Each reading session was recorded with two microphones.
Korean Telephone Conversations Speech and Transcripts This publication consists of 100 telephone conversations, 49 of which were published in 1996 as Callfriend Korean, while the rest of 51 are previously unexposed calls. All 100 conversations have been transcribed.
Mandarin Affective Speech Mandarin Affective Speech is a database of emotional speech consisting of audio recordings and corresponding transcripts collected in 2005 at the Advance Computing and System Laboratory, Zhejiang University. The speech database was recorded by eliciting speakers to express different emotional states in response to stimuli.
Russian through Switched Telephone Network (RuSTeN) The purpose of the project was to develop software for automatic identification of speakers based on voice samples acquired through telephone channels.
TDT4 Multilingual Broadcast News Speech Corpus This release contains the complete set of American English, Modern Standard Arabic and Mandarin Chinese broadcast news audio used in the 2002 and 2003 Topic Detection and Tracking technology evaluations.
West Point Korean Speech West Point Korean Speech is a database of digital recordings of spoken Korean. The prompt scripts were created from 20,000 distinct sentences, along with a subset of prompts designed to elicit free response answers to questions for use in domain-specific translation systems.
Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities.
Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions.

How to Obtain the Sampler

The Asian Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler.

Download 28 mb

Content Copyright

Portions 2010 Trustees of the University of Pennsylvania