LDC Spoken Language Sampler - Third Release
|LDC Spoken Language Sampler - Third Release
|Linguistic Data Consortium
|LDC Catalog No.:
|Subscription & Standard Members, and Non-Members
|Linguistic Data Consortium. LDC Spoken Language Sampler - Third Release LDC2015S09. Web Download. Philadelphia: Linguistic Data Consortium, 2015.
LDC (Linguistic Data Consortium) Spoken Language Sampler - Third Release contains samples from 20 different corpora published by LDC between 1996 and 2015.
LDC distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and resource sharing. With the support of its members, LDC provides critical services to the language research community that include: maintaining the LDC data archives, producing and distributing data via media or web download, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.
Resources available from LDC include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDC's publications, browse the Catalog.
The sampler is available as a free download.
The LDC Spoken Language Sampler - Third Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the speech-related resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:
- Most excerpts are truncated to be much shorter than the original files, typically between 1.5 and 2 minutes.
- Signal amplitude has been adjusted where necessary to normalize playback volume.
- Some corpora are published in compressed form, but all samples here are uncompressed.
- Some text files are presented as images to ensure foreign character sets display properly.
- In some publications, NIST SPHERE file format is used for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.
The link for the catalog number takes you to the catalog entry.
|2009 NIST Language Recognition Evaluation Test Set
|The 2009 evaluation contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu and Vietnamese.
|CALLFRIEND Farsi Second Edition Speech
|CALLFRIEND Farsi Second Edition Speech was developed by LDC and consists of approximately 42 hours of telephone conversation (100 recordings) among native Farsi speakers. The CALLFRIEND project supported the development of language identification technology. Each CALLFRIEND corpus consists of unscripted telephone conversations lasting between 5-30 minutes.
|A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts.
|CSC Deceptive Speech
|CSC Deceptive Speech was developed by Columbia University, SRI International and University of Colorado Boulder. It consists of 32 hours of audio interviews from 32 native speakers of Standard American English (16 male, 16 female) recruited from the Columbia University student population and the community. The purpose of the study was to distinguish deceptive speech from non-deceptive speech using machine learning techniques on extracted features from the corpus.
|CSLU Kids' Speech
|Developed at Oregon State University's Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10.
|Fisher Spanish Speech
|Fisher Spanish Speech consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.
|King Saud University Arabic Speech Database
|King Saud University Arabic Speech Database contains 590 hours of recorded Arabic speech from 269 male and female Saudi and non-Saudi speakers. The utterances include read and spontaneous speech recorded in quiet and noisy environments. The recordings were collected via different microphones and a mobile phone and averaged between 16-19 minutes.
|Korean Telephone Conversations Complete
|The Korean telephone conversations were originally recorded as part of the CALLFRIEND project. Korean Telephone Conversations Speech consists of 100 telephone conversations, 49 of which were published in 1996 as CALLFRIEND Korean, while the remaining 51 are previously unexposed calls. Korean Telephone Conversations Transcripts (LDC2003T08) consists of 100 text files, totaling approximately 190K words and 25K unique words. All files are in Korean orthography: orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system. The complete data set also includes a lexicon (LDC2003L02).
|Malto Speech and Transcripts
|Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Malto is principally spoken in northeastern India and Bangladesh.
|Mandarin Chinese Phonetic Segmentation and Tone
|Mandarin Chinese Phonetic Segmentation and Tone was developed by LDC and contains 7,849 Mandarin Chinese "utterances" and their phonetic segmentation and tone labels separated into training and test sets. The utterances were derived from 1997 Mandarin Broadcast News Speech and Transcripts (HUB4-NE) (LDC98S73 and LDC98T24, respectively). That collection consists of approximately 30 hours of Chinese broadcast news recordings from Voice of America, China Central TV and KAZN-AM, a commercial radio station based in Los Angeles, CA. This corpus was developed to investigate the use of phone boundary models on forced alignment in Mandarin Chinese.
|Mandarin-English Code-Switching in South-East Asia
|Mandarin-English Code-Switching in South-East Asia was developed by Nanyang Technological University and Universiti Sains Malaysia and includes approximately 192 hours of Mandarin-English code-switching speech from 156 speakers with associated transcripts.
|Mixer 6 Speech
|Mixer 6 Speech was developed by LDC and is comprised of 15,863 hours of telephone speech, interviews and transcript readings from 594 distinct native English speakers. This material was collected by LDC in 2009 and 2010 as part of the Mixer project, specifically phase 6, the focus of which was on native American English speakers local to the Philadelphia area.
|Multi-Channel WSJ Audio
|Multi-Channel WSJ Audio was developed by the Centre for Speech Technology Research at The University of Edinburgh and contains approximately 100 hours of recorded speech from 45 British English speakers. Participants read Wall Street Journal texts published in 1987-1989 in three recording scenarios: a single stationary speaker, two stationary overlapping speakers and one single moving speaker.
|NIST Meeting Pilot Corpus Speech
|This data set contains speech and transcriptions from topical discussions in meeting settings, including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place.
|RATS Speech Activity Detection
|RATS Speech Activity Detection was developed by LDC and is comprised of approximately 3,000 hours of Levantine Arabic, English, Farsi, Pashto, and Urdu conversational telephone speech with automatic and manual annotation of speech segments. The corpus was created to provide training, development and initial test sets for the Speech Activity Detection (SAD) task in the DARPA RATS (Robust Automatic Transcription of Speech) program.
|The Subglottal Resonances Database
|The Subglottal Resonances Database was developed by Washington University and University of California Los Angeles and consists of 45 hours of simultaneous microphone and subglottal accelerometer recordings of 25 adult male and 25 adult female speakers of American English between 22 and 25 years of age.
|TORGO Database of Dysarthric Articulation
|TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
|Turkish Broadcast News Speech and Transcripts
|Turkish Broadcast News Speech and Transcripts contains approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding transcripts.
|United Nations Proceedings Speech
|United Nations Proceedings Speech was developed by the United Nations (UN) and contains approximately 8,500 hours of recorded proceedings in the six official UN languages, Arabic, Chinese, English, French, Russian and Spanish. The data was recorded in 2009-2012 from sessions 64-66 of the General Assembly and First Committee (Disarmament and International Security), and meetings 6434-6763 of the Security Council.
|USC-SFI MALACH Interviews and Transcripts Czech
|USC-SFI MALACH Interviews and Transcripts Czech was developed by The University of Southern California Shoah Foundation Institute (USC-SFI) and the University of West Bohemia as part of the MALACH (Multilingual Access to Large Spoken ArCHives) Project. It contains approximately 229 hours of interviews from 420 interviewees along with transcripts and other documentation.