LDC (Linguistic Data Consortium) Spoken Language Sampler - Second Release, LDC catalog number LDC2013S06 and ISBN 1-58563-653-3, contains samples from 20 different corpora published by LDC between 1996 and 2013.
The Linguistic Data Consortium at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media or web downloads, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.
The LDC Spoken Language Sampler - Second Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:
The link for the catalog number takes you to the catalog entry.
|LDC2013S05 ||Greybeard ||Greybeard is comprised of approximately 590 hours of English telephone conversation speech collected in October and November 2008 by LDC. The goal was to record new telephone conversations among subjects who had participated in one or more previous LDC telephone collections, from Switchboard-1 (1991) through the Mixer studies (2006). |
|LDC2013S04 ||GALE Phase 2 Chinese Broadcast Conversation Speech ||GALE Phase 2 Chinese Broadcast Conversation Speech is comprised of approximately 120 hours of Chinese speech from current events programming featuring interviews, call-in programs and roundtable discussions. |
| LDC2012S06 ||Turkish Broadcast News Speech and Transcripts ||Turkish Broadcast News Speech and Transcripts contains approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding transcripts. |
|LDC2012S05 ||USC-SFI MALACH Interviews and Transcripts English ||USC-SFI MALACH Interviews and Transcripts English contains approximately 375 hours of interviews from 784 survivors of the Holocaust along with transcripts and other documentation. |
|LDC2012S04 ||Malto Speech and Transcripts ||Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Malto is principally spoken in northeastern India and Bangladesh. |
|LDC2012S03 ||Digital Archive of Southern Speech ||Digital Archive of Southern Speech contains approximately 370 hours of American English speech data from 30 female speakers and 34 male speakers, along with associated metadata about the speakers and the recordings and maps in .jpeg format relating to the recording locations in the southern United States. |
|LDC2012S02 ||TORGO Database of Dysarthric Articulation ||TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group. |
|LDC2011S08 ||2008 NIST Speaker Recognition Evaluation Test Set ||2008 NIST Speaker Recognition Evaluation Test Set contains 942 hours of multilingual telephone speech and English interview speech along with transcripts and other materials used as test data in the 2008 NIST Speaker Recognition Evaluation. |
| LDC2010S05 ||Asian Elephant Vocalizations ||Asian Elephant Vocalizations consists of 57.5 hours of audio recordings of vocalizations by Asian Elephants (Elephas maximus) in the Uda Walawe National Park, Sri Lanka, of which 31.25 hours have been annotated. |
| LDC2010S01 ||Fisher Spanish Speech ||Fisher Spanish Speech consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers. |
| LDC2007S18 ||CSLU Kids Speech ||Developed at Oregon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10. |
| LDC2007S15 ||Nationwide Speech Project ||A database of speech representing regional accents and dialects of the United States. |
| LDC2007S02 ||Fisher Levantine Arabic ||A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities. |
| LDC2006S43 ||Gulf Arabic Conversational Telephone Speech ||Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions. |
| LDC2004S09 ||NIST Meeting Pilot Corpus Speech ||Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place. |
| LDC2003S05 ||West Point Russian Speech ||Utterances of sentences in Russian from 1,891 native and non-native speakers. |
| LDC2003S03 ||Korean Telephone Speech ||Collection of 100 telephone conversations between native Korean speakers and their transcriptions. |
| LDC2003S02 ||Grassfields Bantu Fieldwork: Dschang Tone Paradigms ||Tone paradigms from Yemba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon. |
| LDC96S50 ||CALLFRIEND Farsi ||A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi. |
| LDC96S37 ||CALLHOME Japanese ||A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. |