LDC Spoken Language Sampler - Fifth Release

Item Name: LDC Spoken Language Sampler - Fifth Release
Author(s): Linguistic Data Consortium
LDC Catalog No.: LDC2019S17
ISBN: 1-58563-899-4
DOI: https://doi.org/10.35111/mdhj-5p59
Release Date: September 02, 2019
Member Year(s): 2019
DCMI Type(s): Sound, Text
Online Documentation: LDC2019S17 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Linguistic Data Consortium. LDC Spoken Language Sampler - Fifth Release LDC2019S17. Web Download. Philadelphia: Linguistic Data Consortium, 2019.
Related Works: View


LDC (Linguistic Data Consortium) Spoken Language Sampler - Fifth Release contains samples from 19 corpora published by LDC between 1996 and 2019.

LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.

Resources available from LDC include speech, text, video and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.

The sampler is available as a free download.


The LDC Spoken Language Sampler - Fifth Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:

  • Most excerpts are truncated to be much shorter than the original files, typically about 2 minutes. Samples shorter than this typically represent the entirety of a single file.
  • Signal amplitude has been adjusted where necessary to normalize playback volume.
  • Some corpora are published in compressed form, but all samples here are uncompressed.
  • Some text files are presented as images to ensure foreign character sets display properly.

In the below table, the link for the catalog number takes you to the catalog entry for that corpus.

LDC2018S06 2011 NIST Language Recognition Evaluation Test Set 2011 NIST Language Recognition Evaluation Test Set contains selected training data and the evaluation test set for the 2011 NIST Language Recognition Evaluation. It consists of approximately 204 hours of conversational telephone speech and broadcast audio collected by the Linguistic Data Consortium (LDC) in the following 24 languages and dialects: Arabic (Iraqi), Arabic (Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari, English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Punjabi, Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian and Urdu.
LDC2018S14 AISHELL-1 AISHELL-1 contains approximately 520 hours of Chinese Mandarin speech from 400 speakers recorded simultaneously on three different devices with associated transcripts. The goal of the collection was to support speech recognition system development in domains such as smart homes, autonomous driving, entertainment, finance, and science and technology.
LDC2018S15 Avatar Education Portuguese Avatar Education Portuguese contains approximately 80 minutes of Brazilian Portuguese microphone speech with phonetic and orthographic transcriptions. The data was developed for Avatar Education, an animated virtual assistant designed to enhance communication and interaction in educational contexts, such as online learning.
LDC96S60 CALLFRIEND Vietnamese CALLFRIEND Vietnamese consists of approximately 60 unscripted telephone conversations between native speakers of Vietnamese. The duration of each conversation was between 5-30 minutes. The corpus also includes documentation describing speaker information (sex, age, education, callee telephone number) and call information (channel quality, number of speakers.
LDC2019S07 CIEMPIESS Experimentation CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) Experimentation was developed at the National Autonomous University of Mexico (UNAM) and consists of approximately 22 hours of Mexican Spanish broadcast and read speech with associated transcripts. The goal of this work was to create acoustic models for automatic speech recognition.
LDC97S63 The CMU Kids Corpus The CMU Kids Corpus was developed in 1995-1996 and is a database of sentences read aloud by 76 children, totaling 5,180 utterances. This data set was designed as a training set of children's speech for the SPHINX II automatic speech recognizer in the LISTEN project at Carnegie Mellon University.
LDC2008S01 CSLU: Portland Cellular Telephone Speech Version 1.3 Created by the Center for Spoken Language Understanding (CSLU) at Oregon Health and Science University, CSLU: Portland Cellular Telephone Speech Version 1.3 is a collection of cellular telephone speech (7,571 utterances) and corresponding orthographic and phonetic transcriptions.
LDC2018S01 DIRHA English WSJ Audio DIRHA English WSJ Audio is comprised of approximately 85 hours of real and simulated read speech by six native American English speakers. It was developed as part of the Distant-Speech Interaction for Robust Home Applications (DIRHA) Project, which addressed natural spontaneous speech interaction with distant microphones in a domestic environment.
LDC2019S14 The DKU-JNU-EMA Electromagnetic Articulography Database The DKU-JNU-EMA Electromagnetic Articulography Database was developed by Duke Kunshan University and Jinan University and contains approximately 10 hours of articulography and speech data in Mandarin, Cantonese, Hakka, and Teochew Chinese from two to seven native speakers for each dialect.
LDC2002S28 Emotional Prosody Speech and Transcripts Emotional Prosody Speech and Transcripts was developed by LDC and contains audio recordings and corresponding transcripts, designed to support research in emotional prosody and collected over an eight-month period in 2000-2001. The recordings consist of professional actors reading a series of semantically neutral utterances (dates and numbers) spanning 14 distinct emotional categories.
LDC2019S09 First DIHARD Challenge Development - Eight Sources First DIHARD Challenge Development - Eight Sources was developed by LDC and contains approximately 17 hours of English and Chinese speech data along with corresponding annotations used in support of the First DIHARD Challenge. This release, when combined with First DIHARD Challenge Development - SEEDLingS (LDC2019S10), contains the development set audio data and annotation (diarization, segmentation) as well as the official scoring tool.
LDC2017S19 IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e IARPA Babel Zulu Language Pack IARPA-babel206b-v0.1e was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 211 hours of Zulu conversational and scripted telephone speech collected in 2012 and 2013 along with corresponding transcripts.
LDC2004S02 ICSI Meeting Speech ICSI Meeting Speech contains approximately 72 hours of speech from 53 unique speakers in 75 meetings collected at Berkeley’s International Computer Science Institute (ICSI) in 2000-2002. The recordings were made during regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. The speech files range in length from 17 to 103 minutes, but in general are less than one hour each.
LDC2012S04 Malto Speech and Transcripts Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females), accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore; elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto.
LDC2018S08 Multi-Language Conversational Telephone Speech 2011 -- Central European Multi-Language Conversational Telephone Speech 2011 -- Central European was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 44 hours of telephone speech in two distinct language varieties of Central Europe: Czech and Slovak. The data was collected to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects. Portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation.
LDC2006S13 N4 NATO Native and Non-Native Speech N4 NATO Native and Non-Native Speech corpus was developed by the NATO research group on Speech and Language Technology in order to provide a military-oriented database for multilingual and non-native speech processing studies. It consists of 115 native and non-native speakers using NATO English procedure between ships and reading from a text, "The North Wind and the Sun," in both English and the speaker's native language.
LDC2018S10 RATS Language Identification RATS Language Identification was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 5,400 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Language Identification (LID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.
LDC2012S06 Turkish Broadcast News Speech and Transcripts Turkish Broadcast News Speech and Transcripts was developed by Boğaziçi University, Istanbul, Turkey and contains approximately 130 hours of Voice of America (VOA) Turkish radio broadcasts and corresponding transcripts. This is part of a larger corpus of Turkish broadcast news data collected and transcribed with the goal to facilitate research in Turkish automatic speech recognition and its applications. The VOA material was collected between December 2006 and June 2009 using a PC and TV/radio card setup. The data collected during the period 2006-2008 was recorded from analog FM radio; the 2009 broadcasts were recorded from digital satellite transmissions.
LDC2017S17 Vehicle City Voices Corpus – Part I Vehicle City Voices Corpus – Part I was developed at the University of Michigan-Flint, and is an ongoing oral history project and survey of English language variation in Flint, Michigan. It contains approximately 16 hours of speech with corresponding transcripts from 21 interviews of Flint residents conducted between 2012 and 2015. The corpus was designed to provide high-quality recordings for acoustic analysis and to examine narrative structure and discursive construction of individual and collective identity in urban spaces.

Available Media

View Fees

Login for the applicable fee