Home › Language Resources › Data

LDC Spoken Language Sampler - Sixth Release

Item Name:	LDC Spoken Language Sampler - Sixth Release
Author(s):	Linguistic Data Consortium
LDC Catalog No.:	LDC2023S07
DOI:	https://doi.org/10.35111/k7w8-4403
Release Date:	August 15, 2023
Member Year(s):	2023
DCMI Type(s):	Sound, Text
Online Documentation:	LDC2023S07 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Linguistic Data Consortium. LDC Spoken Language Sampler - Sixth Release LDC2023S07. Web Download. Philadelphia: Linguistic Data Consortium, 2023.
Related Works: Hide	View isContinuationOf LDC2008S08 LDC Spoken Language Sampler LDC2013S06 LDC Spoken Language Sampler - Second Release LDC2015S09 LDC Spoken Language Sampler - Third Release LDC2017S16 LDC Spoken Language Sampler - Fourth Release LDC2019S17 LDC Spoken Language Sampler - Fifth Release relatesTo LDC2020S08 CALLFRIEND American English-Southern Dialect Second Edition LDC2020S12 Global TIMIT Mandarin Chinese-Guanzhong Dialect LDC2021S01 Althingi Parliamentary Speech LDC2021S02 Columbia Games Corpus LDC2021S04 The SSNCE Database of Tamil Dysarthric Speech LDC2021S05 MyST Children's Conversational Speech LDC2021S06 Ethnobotanical Research and Language Documentation of Nahuatl LDC2021S07 Wikipedia Spanish Speech and Transcripts LDC2021S08 RATS Speaker Identification LDC2021S09 UCLA Speaker Variability Database LDC2022S01 2017 NIST OpenSAT Pilot - SSSF LDC2022S03 Spoken Digits in Hindi and Indian English LDC2022S10 2017 NIST Language Recognition Evaluation Training and Development Sets LDC2022S05 Samrómur Icelandic Speech 1.0 LDC2022S08 MASRI Synthetic LDC2022S13 Global TIMIT Thai LDC2023S03 2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge LDC2023S01 AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts LDC2023S04 Mixer 7 Spanish Speech LDC2023S02 Mixer 3 Speech

Introduction

LDC (Linguistic Data Consortium) Spoken Language Sampler - Sixth Release (LDC2023S07) contains samples from 20 different corpora published by LDC between 2020 and 2023.

LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.

Resources available from LDC include speech, text, video and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.

The sampler is available as a free download.

Data

The LDC Spoken Language Sampler - Sixth Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts. Most excerpts are truncated to be much shorter than the original files, typically about 2 minutes. Samples shorter than this typically represent the entirety of a single file.

In the below table, the link for the catalog number takes you to the catalog entry, and the link for the title takes you to further documentation for that corpus.

LDC2022S10	2017 NIST Language Recognition Evaluation Training and Development Sets	2017 NIST Language Recognition Evaluation Training and Development Sets contains training and development material for the 2017 NIST Language Recognition Evaluation. It consists of approximately 2,100 hours of conversational telephone speech, broadcast conversation, broadcast narrow band speech, and speech from video in the following 14 languages, dialects, and varieties: Arabic (Iraqi, Levantine, Maghrebi, Egyptian), English (British, American), Polish, Russian, Portuguese (Brazilian), Spanish (Caribbean, European, Latin American Continental), and Chinese (Mandarin, Min Nan).
LDC2022S01	2017 NIST OpenSAT Pilot - SSSF	2017 NIST OpenSAT Pilot - SSSF was developed by NIST (National Institute of Standards and Technology) and contains approximately one hour of operational speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition (ASR), and keyword search (KWS) tasks of the 2017 OpenSAT Pilot evaluation. The source audio consists of radio and telephone dispatches during the Sofa Super Store fire (Charleston, South Carolina) in June 2007 (SSSF), which claimed the lives of nine firefighters.
LDC2023S03	2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge	2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains approximately 635 hours of Tunisian Arabic telephone recordings for development and test, answer keys, enrollment, trial files and documentation from the CTS Challenge portion of the NIST-sponsored 2019 Speaker Recognition Evaluation (SRE).
LDC2023S01	AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts	AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 156 hours of Ukrainian conversational telephone speech (CTS) and broadcast news audio (BN) with 1.2 million words of corresponding orthographic transcripts.
LDC2021S01	Althingi Parliamentary Speech	Althingi Parliamentary Speech consists of approximately 542 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary and two language models. Speeches date from 2005-2016.
LDC2020S08	CALLFRIEND American English-Southern Dialect Second Edition	CALLFRIEND American English-Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Southern dialects of American English. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata.
LDC2021S02	Columbia Games Corpus	Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation along with corresponding orthographic transcripts and annotation. Speech recordings are comprised of two subjects playing a series of computer games requiring verbal communication to achieve joint goals of identifying and moving images on the screen to reach a combined number of points.
LDC2021S06	Ethnobotanical Research and Language Documentation of Nahuatl	Ethnobotanical Research and Language Documentation of Nahuatl consists of approximately 190 hours of field recordings collected in the Sierra Nororiental and Sierra Norte regions of Puebla, Mexico. The corpus contains audio and video recordings of native Nahutal speakers during the collection of particular plants; partial transcripts (Nahuatl and Spanish); a Highland Puebla Nahuat dictionary; botanical and ethnobotanical data; and speaker metadata.
LDC2020S12	Global TIMIT Mandarin Chinese-Guanzhong Dialect	Global TIMIT Mandarin Chinese-Guanzhong Dialect was developed by LDC and Xi'an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Shannxi province.
LDC2022S13	Global TIMIT Thai	Global TIMIT Thai was developed by the Linguistic Data Consortium and consists of approximately 12 hours of read speech and time-aligned transcripts in Standard Thai.
LDC2022S08	MASRI Synthetic	MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team at the University of Malta and consists of approximately 99 hours of synthesized Maltese speech.
LDC2023S02	Mixer 3 Speech	Mixer 3 Speech was developed by the Linguistic Data Consortium (LDC) and comprises 3,200 hours of audio recordings of conversational telephone speech involving 3,875 speakers and 26 distinct languages. This material was collected by LDC from 2005-2007 as part of the Mixer project, and recordings in this corpus were used in NIST Speaker Recognition Evaluation (SRE) and NIST Language Recognition Evaluation (LRE) corpora, including 2006 SRE and 2007 LRE.
LDC2023S04	Mixer 7 Spanish Speech	Mixer 7 Spanish Speech was developed by the Linguistic Data Consortium (LDC) and contains 9,600 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 191 distinct native Spanish speakers. This material was collected by LDC in 2011 and 2012 as part of the Mixer project. The recordings in this corpus were used in the 2012 NIST Speaker Recognition Evaluation test set.
LDC2021S05	MyST Children's Conversational Speech	MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary.
LDC2021S08	RATS Speaker Identification	RATS Speaker Identification was developed by LDC and is comprised of approximately 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotations of speech segments. The audio was retransmitted over eight channels, making 17,000 hours of total audio. The corpus was created to provide training and development sets for the Speaker Identification (SID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.
LDC2022S05	Samrómur Icelandic Speech 1.0	Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances.
LDC2022S03	Spoken Digits in Hindi and Indian English	Spoken Digits in Hindi and Indian English was developed by the Birla Institute of Technology and Science Pilani. It contains approximately two hours of speech comprised of spoken digits from one to ten in Hindi and English with regional accents from across India.
LDC2021S04	The SSNCE Database of Tamil Dysarthric Speech	The SSNCE Database of Tamil Dysarthric Speech was developed by the Speech Lab, SSN College of Engineering, India, in collaboration with the Indian National Institute of Empowerment of Persons with Multiple Disabilities (NIEPMD) and contains approximately eight hours of Tamil speech data, time-aligned transcripts and metadata collected from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers).
LDC2021S09	UCLA Speaker Variability Database	UCLA Speaker Variability Database was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of approximately 34 hours of English speech and orthographic transcripts.
LDC2021S07	Wikipedia Spanish Speech and Transcripts	Wikipedia Spanish Speech and Transcripts consists of approximately 25 hours of Spanish read speech and transcripts. The read text was taken from the Spanish version of WikiProject Spoken Wikipedia, referred to as Wikipedia Grabada. The transcripts were developed for this release.

LDC Spoken Language Sampler - Sixth Release

Introduction

Data

Copyright

Available Media

View Fees