LDC Spoken Language Sampler - Fourth Release

Item Name: LDC Spoken Language Sampler - Fourth Release
Author(s): Linguistic Data Consortium
LDC Catalog No.: LDC2017S16
ISBN: 1-58563-811-0
DOI: https://doi.org/10.35111/94k3-cf05
Member Year(s): 2017
DCMI Type(s): Sound, Text
Online Documentation: LDC2017S16 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Linguistic Data Consortium. LDC Spoken Language Sampler - Fourth Release LDC2017S16. Web Download. Philadelphia: Linguistic Data Consortium, 2017.
Related Works: View


LDC (Linguistic Data Consortium) Spoken Language Sampler - Fourth Release, LDC catalog number LDC2017S16 and ISBN 1-58563-811-0, contains samples from 18 different corpora published by LDC between 1996 and 2017.

LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.

Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.

The sampler is available as a free download.


The LDC Spoken Language Sampler - Fourth Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:

  • Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes.
  • Signal amplitude has been adjusted where necessary to normalize playback volume.
  • Some corpora are published in compressed form, but all samples here are uncompressed.
  • Some text files are presented as images to ensure foreign character sets display properly.
  • In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.

The link for the catalog number takes you to the catalog entry, and the link for the title takes you to further documentation for that corpus.

LDC2017S06 2010 NIST Speaker Recognition Evaluation Test Set 2010 NIST Speaker Recognition Evaluation Test Set was developed by LDC and NIST (National Institute of Standards and Technology). It contains 2,255 hours of American English telephone speech and interview speech recorded over a microphone channel used as test data in the NIST-sponsored 2010 Speaker Recognition Evaluation (SRE).
LDC2015S10 Arabic Learner Corpus Arabic Learner Corpus was developed at the University of Leeds and consists of written essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students from 67 nationalities studying at pre-university and university levels. The average length of an essay is 178 words.
LDC2015S12 Articulation Index LSCP Articulation Index LSCP was developed by researchers at Laboratoire de Sciences Cognitives et Psycholinguistique (LSCP), Ecole Normale Supérieure. It revises and enhances a subset of Articulation Index (AIC) (LDC2005S22), a corpus of persons speaking English syllables. Changes include the addition of forced alignment to sound files, time alignment of syllable utterances and format conversions.
LDC2014S01 CALLFRIEND Farsi Second Edition Speech CALLFRIEND Farsi Second Edition Speech was developed by LDC and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The CALLFRIEND project supported the development of language identification technology. Each CALLFRIEND corpus consists of unscripted telephone conversations lasting between 5-30 minutes.
LDC2016S04 CHM150 CHM150 (Corpus Hecho en México 150) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 1.63 hours of Mexican Spanish speech, associated transcripts, and speaker metadata. The goal of this work was to support spoken term detection and forensic speaker identification.
LDC2007S18 CSLU: Kids` Speech Version 1.1 CSLU: Kids' Speech Version 1.1 is a collection of spontaneous and prompted speech from 1100 children between Kindergarten and Grade 10 in the Forest Grove School District in Oregon. Approximately 100 children at each grade level read around 60 items from a total list of 319 phonetically-balanced but simple words, sentences or digit strings. Each utterance of spontaneous speech begins with a recitation of the alphabet and contains a monologue of about one minute in length. This release consists of 1017 files containing approximately 8-10 minutes of speech per speaker. Corresponding word-level transcriptions are also included.
LDC2016S12 IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a IARPA Babel Georgian Language Pack IARPA-babel404b-v1.0a was developed by Appen for the IARPA (Intelligence Advanced Research Projects Activity) Babel program. It contains approximately 190 hours of Georgian conversational and scripted telephone speech collected in 2014-2015 along with corresponding transcripts.
LDC2003S07 Korean Telephone Conversations Complete (S), (T), (L) The Korean telephone conversations were originally recorded as part of the CALLFRIEND project. Korean Telephone Conversations Speech consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the remaining 51 are previously unexposed calls. Korean Telephone Conversations Transcripts consists of 100 text files, totaling approximately 190K words and 25K unique words. All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system. The complete set of Korean Telephone Conversations also includes a transcript (LDC2003T08) and lexicon (LDC2003L02) corpus.
LDC2012S04 Malto Speech and Transcripts Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females), accompanying transcripts, English translations and glosses for 6 hours of the collection. Speakers were asked to talk about themselves, their lives, rituals and folklore; elicitation interviews were then conducted. The goal of the work was to present the current state and dialectal variation of Malto.
LDC2015S04 Mandarin-English Code-Switching in South-East Asia Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia and includes approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts.
LDC2017S11 Metalogue Multi-Issue Bargaining Dialogue Metalogue Multi-Issue Bargaining Dialogue was developed by the Metalogue Consortium under the European Community's Seventh Framework Programme for Research and Technological Development. This release consists of approximately 2.5 hours of semantically annotated English dialogue data that includes speech and transcripts.
LDC2016S11 Multi-Language Conversational Telephone Speech 2011 -- Slavic Group Multi-Language Conversational Telephone Speech 2011 – Slavic Group was developed by LDC and is comprised of approximately 60 hours of telephone speech in Polish, Russian and Ukrainian. The data was collected to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects. Portions of these telephone calls were used in the NIST 2011 Language Recognition Evaluation.
LDC2017S09 Multi-Language Conversational Telephone Speech 2011 Multi-Language Conversational Telephone Speech 2011 -- Turkish was developed by LDC and is comprised of approximately 18 hours of telephone speech in Turkish. The data was collected primarily to support research and technology evaluation in automatic language identification, specifically language pair discrimination for closely related languages/dialects.
LDC2004S09 NIST Meeting Pilot Corpus Speech The audio data included in this corpus was collected in the NIST Meeting Data Collection Laboratory for the NIST Automatic Meeting Recognition Project. The corresponding transcripts are available as the NIST Meeting Pilot Corpus Transcripts and Metadata (LDC2004T13), while the video files will be published later as NIST Meeting Pilot Corpus Video. For more information regarding the data collection conditions, meeting scenarios, transcripts, speaker information, recording logs, errata, and other ancillary data for the corpus, please consult the NIST project website for this corpus.
LDC2017S04 Noisy TIMIT Speech Noisy TIMIT Speech was developed by the Florida Institute of Technology and contains approximately 322 hours of speech from the TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) modified with different additive noise levels. Only the audio has been modified; the original arrangement of the TIMIT corpus is still as described by the TIMIT documentation.
LDC2015S08 The Walking Around Corpus The Walking Around Corpus was developed by Stony Brook University and is comprised of approximately 33 hours of navigational telephone dialogues from 72 speakers (36 speaker pairs). Participants were Stony Brook University students who identified themselves as native English speakers.
LDC2012S02 TORGO Database of Dysarthric Articulation TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
LDC2014S04 USC-SFI MALACH Interviews and Transcripts Czech USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation.

Available Media

View Fees

Login for the applicable fee