LDC Spoken Language Sampler - Sixth Release
Item Name: | LDC Spoken Language Sampler - Sixth Release |
Author(s): | Linguistic Data Consortium |
LDC Catalog No.: | LDC2023S07 |
DOI: | https://doi.org/10.35111/k7w8-4403 |
Release Date: | August 15, 2023 |
Member Year(s): | 2023 |
DCMI Type(s): | Sound, Text |
Online Documentation: | LDC2023S07 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Linguistic Data Consortium. LDC Spoken Language Sampler - Sixth Release LDC2023S07. Web Download. Philadelphia: Linguistic Data Consortium, 2023. |
Related Works: | View |
Introduction
LDC (Linguistic Data Consortium) Spoken Language Sampler - Sixth Release (LDC2023S07) contains samples from 20 different corpora published by LDC between 2020 and 2023.
LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.
Resources available from LDC include speech, text, video and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.
The sampler is available as a free download.
Data
The LDC Spoken Language Sampler - Sixth Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts. Most excerpts are truncated to be much shorter than the original files, typically about 2 minutes. Samples shorter than this typically represent the entirety of a single file.
In the below table, the link for the catalog number takes you to the catalog entry, and the link for the title takes you to further documentation for that corpus.
LDC2022S10 | 2017 NIST Language Recognition Evaluation Training and Development Sets | 2017 NIST Language Recognition Evaluation Training and Development Sets contains training and development material for the 2017 NIST Language Recognition Evaluation. It consists of approximately 2,100 hours of conversational telephone speech, broadcast conversation, broadcast narrow band speech, and speech from video in the following 14 languages, dialects, and varieties: Arabic (Iraqi, Levantine, Maghrebi, Egyptian), English (British, American), Polish, Russian, Portuguese (Brazilian), Spanish (Caribbean, European, Latin American Continental), and Chinese (Mandarin, Min Nan). |
LDC2022S01 | 2017 NIST OpenSAT Pilot - SSSF | 2017 NIST OpenSAT Pilot - SSSF was developed by NIST (National Institute of Standards and Technology) and contains approximately one hour of operational speech data, transcripts and annotation files used in the speech activity detection, automatic speech recognition (ASR), and keyword search (KWS) tasks of the 2017 OpenSAT Pilot evaluation. The source audio consists of radio and telephone dispatches during the Sofa Super Store fire (Charleston, South Carolina) in June 2007 (SSSF), which claimed the lives of nine firefighters. |
LDC2023S03 | 2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge | 2019 NIST Speaker Recognition Evaluation Test Set -- CTS Challenge was developed by the Linguistic Data Consortium (LDC) and NIST (National Institute of Standards and Technology). It contains approximately 635 hours of Tunisian Arabic telephone recordings for development and test, answer keys, enrollment, trial files and documentation from the CTS Challenge portion of the NIST-sponsored 2019 Speaker Recognition Evaluation (SRE). |
LDC2023S01 | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts | AIDA Ukrainian Broadcast and Telephone Speech Audio and Transcripts was developed by the Linguistic Data Consortium (LDC) and is comprised of approximately 156 hours of Ukrainian conversational telephone speech (CTS) and broadcast news audio (BN) with 1.2 million words of corresponding orthographic transcripts. |
LDC2021S01 | Althingi Parliamentary Speech | Althingi Parliamentary Speech consists of approximately 542 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary and two language models. Speeches date from 2005-2016. |
LDC2020S08 | CALLFRIEND American English-Southern Dialect Second Edition | CALLFRIEND American English-Southern Dialect Second Edition was developed by LDC and consists of approximately 26 hours of unscripted telephone conversations between native speakers of Southern dialects of American English. This second edition updates the audio files to wav format, simplifies the directory structure and adds documentation and metadata. |
LDC2021S02 | Columbia Games Corpus | Columbia Games Corpus was developed by the Spoken Language Group, Columbia University and the Department of Linguistics, Northwestern University. It consists of approximately 10 hours of spontaneous English conversation along with corresponding orthographic transcripts and annotation. Speech recordings are comprised of two subjects playing a series of computer games requiring verbal communication to achieve joint goals of identifying and moving images on the screen to reach a combined number of points. |
LDC2021S06 | Ethnobotanical Research and Language Documentation of Nahuatl | Ethnobotanical Research and Language Documentation of Nahuatl consists of approximately 190 hours of field recordings collected in the Sierra Nororiental and Sierra Norte regions of Puebla, Mexico. The corpus contains audio and video recordings of native Nahutal speakers during the collection of particular plants; partial transcripts (Nahuatl and Spanish); a Highland Puebla Nahuat dictionary; botanical and ethnobotanical data; and speaker metadata. |
LDC2020S12 | Global TIMIT Mandarin Chinese-Guanzhong Dialect | Global TIMIT Mandarin Chinese-Guanzhong Dialect was developed by LDC and Xi'an Jiaotong University and consists of approximately five hours of read speech and transcripts in the Guanzhong dialect of Mandarin Chinese as spoken in Shannxi province. |
LDC2022S13 | Global TIMIT Thai | Global TIMIT Thai was developed by the Linguistic Data Consortium and consists of approximately 12 hours of read speech and time-aligned transcripts in Standard Thai. |
LDC2022S08 | MASRI Synthetic | MASRI (Maltese Automatic Speech Recognition I) Synthetic was developed by the MASRI team at the University of Malta and consists of approximately 99 hours of synthesized Maltese speech. |
LDC2023S02 | Mixer 3 Speech | Mixer 3 Speech was developed by the Linguistic Data Consortium (LDC) and comprises 3,200 hours of audio recordings of conversational telephone speech involving 3,875 speakers and 26 distinct languages. This material was collected by LDC from 2005-2007 as part of the Mixer project, and recordings in this corpus were used in NIST Speaker Recognition Evaluation (SRE) and NIST Language Recognition Evaluation (LRE) corpora, including 2006 SRE and 2007 LRE. |
LDC2023S04 | Mixer 7 Spanish Speech | Mixer 7 Spanish Speech was developed by the Linguistic Data Consortium (LDC) and contains 9,600 hours of audio recordings of interviews, transcript readings and conversational telephone speech involving 191 distinct native Spanish speakers. This material was collected by LDC in 2011 and 2012 as part of the Mixer project. The recordings in this corpus were used in the 2012 NIST Speaker Recognition Evaluation test set. |
LDC2021S05 | MyST Children's Conversational Speech | MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary. |
LDC2021S08 | RATS Speaker Identification | RATS Speaker Identification was developed by LDC and is comprised of approximately 1,900 hours of Levantine Arabic, Farsi, Dari, Pashto and Urdu conversational telephone speech with annotations of speech segments. The audio was retransmitted over eight channels, making 17,000 hours of total audio. The corpus was created to provide training and development sets for the Speaker Identification (SID) task in the DARPA RATS (Robust Automatic Transcription of Speech) program. |
LDC2022S05 | Samrómur Icelandic Speech 1.0 | Samrómur Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 145 hours of Icelandic prompted speech from 8,392 speakers representing 100,000 utterances. |
LDC2022S03 | Spoken Digits in Hindi and Indian English | Spoken Digits in Hindi and Indian English was developed by the Birla Institute of Technology and Science Pilani. It contains approximately two hours of speech comprised of spoken digits from one to ten in Hindi and English with regional accents from across India. |
LDC2021S04 | The SSNCE Database of Tamil Dysarthric Speech | The SSNCE Database of Tamil Dysarthric Speech was developed by the Speech Lab, SSN College of Engineering, India, in collaboration with the Indian National Institute of Empowerment of Persons with Multiple Disabilities (NIEPMD) and contains approximately eight hours of Tamil speech data, time-aligned transcripts and metadata collected from 30 speakers (20 dysarthric speakers and 10 non-dysarthric speakers). |
LDC2021S09 | UCLA Speaker Variability Database | UCLA Speaker Variability Database was developed by UCLA Speech Processing and Auditory Perception Laboratory and is comprised of approximately 34 hours of English speech and orthographic transcripts. |
LDC2021S07 | Wikipedia Spanish Speech and Transcripts | Wikipedia Spanish Speech and Transcripts consists of approximately 25 hours of Spanish read speech and transcripts. The read text was taken from the Spanish version of WikiProject Spoken Wikipedia, referred to as Wikipedia Grabada. The transcripts were developed for this release. |