Home › Language Resources › Data

LDC Spoken Language Sampler

Item Name:	LDC Spoken Language Sampler
Author(s):	Anthony Castelletto, et al.
LDC Catalog No.:	LDC2008S08
ISBN:	1-58563-495-6
ISLRN:	857-539-187-188-1
DOI:	https://doi.org/10.35111/jawx-3z48
Release Date:	November 18, 2008
Member Year(s):	2008
DCMI Type(s):	Sound
Data Source(s):	telephone speech, microphone speech, meeting speech, field recordings
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Castelletto, Anthony, and et al.. LDC Spoken Language Sampler LDC2008S08. Web Download. Philadelphia: Linguistic Data Consortium, 2008.
Related Works: Hide	View hasContinuation LDC2013S06 LDC Spoken Language Sampler - Second Release LDC2015S09 LDC Spoken Language Sampler - Third Release LDC2017S16 LDC Spoken Language Sampler - Fourth Release LDC2019S17 LDC Spoken Language Sampler - Fifth Release LDC2023S07 LDC Spoken Language Sampler - Sixth Release isSimilarWith LDC2010S07 Asian Spoken Language Sampler relatesTo LDC96S35 CALLHOME Spanish Speech LDC96S37 CALLHOME Japanese Speech LDC96S50 CALLFRIEND Farsi LDC96S59 CALLFRIEND Tamil LDC96T17 CALLHOME Spanish Transcripts LDC96T18 CALLHOME Japanese Transcripts LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms LDC2003S03 Korean Telephone Conversations Speech LDC2003S05 West Point Russian Speech LDC2003T08 Korean Telephone Conversations Transcripts LDC2004S09 NIST Meeting Pilot Corpus Speech LDC2004T13 NIST Meeting Pilot Corpus Transcripts and Metadata LDC2005L01 Mawukakan Lexicon LDC2006S43 Gulf Arabic Conversational Telephone Speech LDC2006T15 Gulf Arabic Conversational Telephone Speech, Transcripts LDC2007S15 Nationwide Speech Project LDC2007S02 Fisher Levantine Arabic Conversational Telephone Speech LDC2007S18 CSLU: Kids` Speech Version 1.1 LDC2007T04 Fisher Levantine Arabic Conversational Telephone Speech, Transcripts LDC2008L01 An English Dictionary of the Tamil Verb

Introduction

The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. In 2008, LDC is a growing consortium that includes more than 100 companies, universities, and government members that has distributed over 50,000 corpora to a global audience. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with potential information providers and would-be members, and maintaining relations with other like-minded groups around the world.

Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials.

Data

The LDC Spoken Language Sampler provides a variety of speech, transcript and lexicon samples and is designed to illustrate the variety and breadth of the resources available from LDC Publication Catalog.

most excerpts are truncated to be much shorter than the original files, typically one minute and thirty seconds of speech
signal amplitude has been adjusted where necessary to normalize playback volume
some corpora are published in compressed form, but all samples here are uncompressed
LDC typically uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities.

The sampler includes samples from the following corpora and lexicons. Audio samples range from 30 seconds to 90 seconds and are accompanied by transcripts.

An English Dictionary of the Tamil Verb	This dictionary contains translations for over 6000 English verbs and defines over 9000 Tamil verbs. Entries include the English word, the Tamil equivalent in transliteration and Tamil script and audio examples in Spoken Tamil pronunciation.
CALLFRIEND Farsi	A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi.
CALLFRIEND Tamil	A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil.
CALLHOME Japanese	A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts.
CALLHOME Spanish	A corpus of 120 unscripted telephone conversations between native Spanish speakers and a corpus of associated transcripts.
CSLU Kids Speech	Developed at Oregeon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10.
Fisher Levantine Arabic	A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities.
Grassfields Bantu Fieldwork: Dschang Tone Paradigms	Tone paradigms from Yémba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon.
Gulf Arabic Conversational Telephone Speech	Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions.
Korean Telephone Speech	Collection of 100 telephone conversations between native Korean speakers and their transcriptions.
Mawukakan Lexicon	The first publication of an ongoing project aiming to build an electronic dictionary of four Mandekan [Eastern Manding languages of the Mande Group of the Niger-Congo family] languages.
Nationwide Speech Project	A database of speech representing current regional accents and dialects of the United States.
NIST Pilot Meeting Speech	Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place.
West Point Russian Speech	Utterances of sentences in Russian from 1,891 native and non-native speakers.

How to Obtain

The LDC Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler.

Download

74 mb

LDC Spoken Language Sampler

Introduction

Data

How to Obtain

Copyright

Available Media

View Fees