The Linguistic Data Consortium (LDC) at the University of Pennsylvania distributes a wide and growing assortment of resources for researchers, engineers and educators whose work is concerned with human languages. Historically, most linguistic resources were not generally available to interested researchers but were restricted to single laboratories or to a limited number of users. Inspired by the success of selected readily available and well-known data sets, such as the Brown University text corpus, LDC was founded in 1992 to provide a new mechanism for large-scale corpus development and sharing of resources. In 2008, LDC is a growing consortium that includes more than 100 companies, universities, and government members that has distributed over 50,000 corpora to a global audience. With the support of its members, LDC is able to provide critical services to the language research community. These services include: maintaining the data archives, producing and distributing data via media (DVD-ROM or CD-ROM) or web downloads, negotiating intellectual property agreements with potential information providers and would-be members, and maintaining relations with other like-minded groups around the world.
Resources available from LDC (http://www.ldc.upenn.edu) include speech, text and video data and lexicons in multiple languages, as well as software tools to facilitate the use of corpus materials.
The LDC Spoken Language Sampler provides a variety of speech, transcript and lexicon samples and is designed to illustrate the variety and breadth of the resources available from LDC Publication Catalog.
- most excerpts are truncated to be much shorter than the original files, typically one minute and thirty seconds of speech
- signal amplitude has been adjusted where necessary to normalize playback volume
- some corpora are published in compressed form, but all samples here are uncompressed
- LDC typically uses NIST SPHERE file format for audio data, but the audio files in this sampler have been converted to MS-WAV/audio (RIFF) file format for compatibility with typical browser audio utilities.
The sampler includes samples from the following corpora and lexicons. Audio samples range from 30 seconds to 90 seconds and are accompanied by transcripts.
|An English Dictionary of the Tamil Verb ||This dictionary contains translations for over 6000 English verbs and defines over 9000 Tamil verbs. Entries include the English word, the Tamil equivalent in transliteration and Tamil script and audio examples in Spoken Tamil pronunciation. |
|CALLFRIEND Farsi ||A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Farsi. |
|CALLFRIEND Tamil ||A corpus of 60 unscripted telephone calls between friends and acquaintances speaking in their native language, Tamil. ||CALLHOME Japanese ||A corpus of 120 unscripted telephone conversations between native Japanese speakers and a corpus of associated transcripts. |
|CALLHOME Spanish ||A corpus of 120 unscripted telephone conversations between native Spanish speakers and a corpus of associated transcripts. |
|CSLU Kids Speech ||Developed at Oregeon State Universitys Center for Spoken Language Understanding, this corpus is a collection of spontaneous and prompted speech from 1100 children from Kindergarten through Grade 10. |
|Fisher Levantine Arabic ||A collection of 279 Levantine Arabic telephone conversations and transcripts from speakers of several nationalities. |
|Grassfields Bantu Fieldwork: Dschang Tone Paradigms ||Tone paradigms from Yémba (Bamileke Dschang), a Bamileke (Grassfields Bantu) language spoken by 300,000+ people in Southwestern Cameroon. |
|Gulf Arabic Conversational Telephone Speech ||Contains 975 telephone conversations from speakers across the Persian Gulf region and their transcriptions. |
|Korean Telephone Speech ||Collection of 100 telephone conversations between native Korean speakers and their transcriptions. |
|Mawukakan Lexicon ||The first publication of an ongoing project aiming to build an electronic dictionary of four Mandekan [Eastern Manding languages of the Mande Group of the Niger-Congo family] languages. |
|Nationwide Speech Project ||A database of speech representing current regional accents and dialects of the United States. |
|NIST Pilot Meeting Speech ||Collects speech and transcriptions from topical discussions in meeting settings including complete descriptive metadata and detailed descriptions of the physical environment in which the discussions took place. |
|West Point Russian Speech ||Utterances of sentences in Russian from 1,891 native and non-native speakers. |
How to Obtain
The LDC Spoken Language Sampler may be downloaded freely. The sampler is a Gnu zipped tar file. Most compression utilities will readily extract the sampler.
Portions © 2008 Trustees of the University of Pennsylvania