LDC Spoken Language Sampler - 2nd Release


Item Name: LDC Spoken Language Sampler - 2nd Release
Authors: Linguistic Data Consortium
LDC Catalog No.: LDC2013S06
ISBN: 1-58563-653-3
Release Date: Aug 23, 2013
Data Type: speech
Distribution: 1 CD, Web Download
Member fee: $0 for 2013 members
Non-member Fee: US $0.00
Reduced-License Fee: N/A
Extra-Copy Fee: US $
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Linguistic Data Consortium
2013
LDC Spoken Language Sampler - 2nd Release
Linguistic Data Consortium, Philadelphia

table.d, th.d, td.d { border: 1px solid black } td.d { padding:5px }

Introduction

LDC (Linguistic Data Consortium) Spoken Language Sampler - Second Release, LDC catalog number LDC2013S06 and ISBN 1-58563-653-3, contains samples from 20 different corpora published by LDC between 1996 and 2013.

The Linguistic Data Consortium at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily-available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media or web downloads, negotiating intellectual property agreements with potential information providers and maintaining relations with other like-minded groups around the world.

Resources available from LDC (http://www.ldc.upenn.edu) include speech, text, video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials. For a complete view of LDCs publications, a searchable catalog is available at http://www.ldc.upenn.edu/Catalog/.

This sampler is available as a free download from LDC.

Data

The LDC Spoken Language Sampler - Second Release provides speech and transcript samples and is designed to illustrate the variety and breadth of the resources available from the LDC Catalog. The sound files included in this release are excerpts that have been modified in various ways relative to the original data as published by LDC:

  • Most excerpts are truncated to be much shorter than the original files, typically between one and half and 2 minutes.
  • Signal amplitude has been adjusted where necessary to normalize playback volume.
  • Some corpora are published in compressed form, but all samples here are uncompressed.
  • Some text files are presented as images to ensure foreign character sets display properly.
  • In some publications, LDC has used NIST SPHERE file format for audio data, but the audio files in this sampler are MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities. FLAC files have been expanded into their wav form as well.

The link for the catalog number takes you to the catalog entry.

LDC2013S05 Greybeard Greybeard is comprised of approximately 590 hours of English telephone conversation speech collected in October and November 2008 by LDC. The goal was to record new telephone conversations among subjects who had participated in one or more previous LDC telephone collections, from Switchboard-1 (1991) through the Mixer studies (2006).
LDC2013S04 GALE Phase 2 Chinese Broadcast Conversation Speech GALE Phase 2 Chinese Broadcast Conversation Speech is comprised of approximately 120 hours of Chinese speech from current events programming featuring interviews, call-in programs and roundtable discussions.
LDC2012S06 Turkish Broadcast News Speech and Transcripts Turkish Broadcast News Speech and Transcripts contains approximately 130 hours of Voice of America Turkish radio broadcasts and corresponding transcripts.
LDC2012S05 USC-SFI MALACH Interviews and Transcripts English USC-SFI MALACH Interviews and Transcripts English contains approximately 375 hours of interviews from 784 survivors of the Holocaust along with transcripts and other documentation.
LDC2012S04 Malto Speech and Transcripts Malto Speech and Transcripts contains approximately 8 hours of Malto speech data collected between 2005 and 2009 from 27 speakers (22 males, 5 females). Also included are accompanying transcripts, English translations and glosses for 6 hours of the collection. Malto is principally spoken in northeastern India and Bangladesh.
LDC2012S03 Digital Archive of Southern Speech Digital Archive of Southern Speech contains approximately 370 hours of American English speech data from 30 female speakers and 34 male speakers, along with associated metadata about the speakers and the recordings and maps in .jpeg format relating to the recording locations in the southern United States.
LDC2012S02 TORGO Database of Dysarthric Articulation TORGO contains approximately 23 hours of English speech data, accompanying transcripts and documentation from 8 speakers (5 males, 3 females) with cerebral palsy or amyotrophic lateral sclerosis and from 7 speakers (4 males, 3 females) from a non-dysarthric control group.
LDC2011S08 2008 NIST Speaker Recognition Evaluation Test Set 2008 NIST Speaker Recognition Evaluation Test Set contains 942 hours of multilingual telephone speech and English interview speech along with transcripts and other materials used as test data in the 2008 NIST Speaker Recognition Evaluation.
LDC2010S05 Asian Elephant Vocalizations Asian Elephant Vocalizations consists of 57.5 hours of audio recordings of vocalizations by Asian Elephants (Elephas maximus) in the Uda Walawe National Park, Sri Lanka, of which 31.25 hours have been annotated.
LDC2010S01 Fisher Spanish Speech Fisher Spanish Speech consists of audio files covering roughly 163 hours of telephone speech from 136 native Caribbean Spanish and non-Caribbean Spanish speakers.
LDC2007S18 CSLU Kids Speech Developed at Oregon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10.
LDC2007S15 Nationwide Speech Project A database of speech representing regional accents and dialects of the United States.
LDC2007S02 Fisher Levantine Arabic A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities.
LDC2006S43 Gulf Arabic Conversational Telephone Speech Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions.
LDC2004S09 NIST Meeting Pilot Corpus Speech Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place.
LDC2003S05 West Point Russian Speech Utterances of sentences in Russian from 1,891 native and non-native speakers.
LDC2003S03 Korean Telephone Speech Collection of 100 telephone conversations between native Korean speakers and their transcriptions.
LDC2003S02 Grassfields Bantu Fieldwork: Dschang Tone Paradigms Tone paradigms from Yemba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon.
LDC96S50 CALLFRIEND Farsi A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi.
LDC96S37 CALLHOME Japanese A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts.

Content Copyright

Portions 2013 Trustees of the University of Pennsylvania