LDC Spoken Language Sampler - Second Release
|LDC Spoken Language Sampler - Second Release
|Linguistic Data Consortium
|LDC Catalog No.:
|August 23, 2013
|Subscription & Standard Members, and Non-Members
|Linguistic Data Consortium. LDC Spoken Language Sampler - Second Release LDC2013S06. Web Download. Philadelphia: Linguistic Data Consortium, 2013.
LDC (Linguistic Data Consortium) Spoken Language Sampler - Second Release contains samples from 20 different corpora published by LDC between 1996 and 2013.
LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media or web downloads, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.
Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.
This sampler is available as a free download.
The LDC Spoken Language Sampler - Second Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:
- Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes.
- Signal amplitude has been adjusted where necessary to normalize playback volume.
- Some corpora are published in compressed form, but all samples here are uncompressed.
- Some text files are presented as images to ensure foreign character sets display properly.
- In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.
The link for the catalog number takes you to the catalog entry.
|Greybeard is comprised of approximately 590 hours of English telephone conversation speech collected in October and November 2008 by LDC. The goal was to record new telephone conversations among subjects who had participated in one or more previous LDC telephone collections, from Switchboard-1 (1991) through the Mixer studies (2006).
|GALE Phase 2 Chinese Broadcast Conversation Speech
|GALE Phase 2 Chinese Broadcast Conversation Speech is comprised of approximately 120 hours of Chinese speech from current events programming featuring interviews, call-in programs and roundtable discussions.
|Turkish Broadcast News Speech and Transcripts
|Turkish Broadcast News Speech and Transcripts contains approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding transcripts.
|USC-SFI MALACH Interviews and Transcripts English
|USC-SFI MALACH Interviews and Transcripts English contains approximately 375 hours of interviews from 784 survivors of the Holocaust along with transcripts and other documentation.
|Malto Speech and Transcripts
|Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Malto is principally spoken in northeastern India and Bangladesh.
|Digital Archive of Southern Speech
|Digital Archive of Southern Speech contains approximately 370 hours of American English speech data from 30 female speakers and 34 male speakers, along with associated metadata about the speakers and the recordings and maps in .jpeg format relating to the recording locations in the southern United States.
|TORGO Database of Dysarthric Articulation
|TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
|2008 NIST Speaker Recognition Evaluation Test Set
|2008 NIST Speaker Recognition Evaluation Test Set contains 942 hours of multilingual telephone speech and English interview speech along with transcripts and other materials used as test data in the 2008 NIST Speaker Recognition Evaluation.
|Asian Elephant Vocalizations
|Asian Elephant Vocalizations consists of 57.5 hours of audio recordings of vocalizations by Asian Elephants (Elephas maximus) in the Uda Walawe National Park, Sri Lanka, of which 31.25 hours have been annotated.
|Fisher Spanish Speech
|Fisher Spanish Speech consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.
|CSLU Kids Speech
|Developed at Oregon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10.
|Nationwide Speech Project
|A database of speech representing regional accents and dialects of the United States.
|Fisher Levantine Arabic
|A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities.
|Gulf Arabic Conversational Telephone Speech
|Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions.
|NIST Meeting Pilot Corpus Speech
|Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place.
|West Point Russian Speech
|Utterances of sentences in Russian from 1,891 native and non-native speakers.
|Korean Telephone Speech
|Collection of 100 telephone conversations between native Korean speakers and their transcriptions.
|Grassfields Bantu Fieldwork: Dschang Tone Paradigms
|Tone paradigms from Yemba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon.
|A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi.
|A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts.