The Audiovisual Database of Spoken American English, Linguistic Data Consortium (LDC) catalog number LDC2009V01 and isbn 1-58563-496-4, was developed at Butler University, Indianapolis, IN in 2007 for use by a a variety of researchers to evaluate speech production and speech recognition. It contains approximately seven hours of audiovisual recordings of fourteen American English speakers producing syllables, word lists and sentences used in both academic and clinical settings.
All talkers were from the North Midland dialect region -- roughly defined as Indianapolis and north within the state of Indiana -- and had lived in that region for the majority of the time from birth to 18 years of age. Each participant read 238 different words and 166 different sentences. The sentences spoken were drawn from the following sources:
- Central Institute for the Deaf (CID) Everyday Sentences (Lists A-J)
- Northwestern University Auditory Test No. 6 (Lists I-IV)
- Vowels in /hVd/ context (separate words)
- Texas Instruments/Massachusetts Institute for Technology (TIMIT) sentences
The CID Everyday Sentences were created in the 1950s from a sample developed by the Armed Forces National Research Committee on Hearing and Bio-Acoustics. They are considered to represent everyday American speech and have the following characteristics: the vocabulary is appropriate to adults; the words appear with high frequency in one or more of the well-known word counts of the English language; proper names and proper nouns are not used; common non-slang idioms and contractions are used freely; phonetic loading and "tongue-twisting" are avoided; redundancy is high; the level of abstraction is low; and grammatical structure varies freely.
Northwestern University Auditory Test No. 6 is a phonemically-balanced set of monosyllabic English words used clinically to test speech perception in adults with hearing loss.
The /hVd/ vowel list was created to elicit all of the vowel sounds of American English.
The TIMIT sentences are a subset (34 sentences) of the 2342 phonetically-rich sentences read by speakers in the TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. TIMIT was designed to provide speech data for the acquisition of acoustic-phonetic knowledge and for the development and evaluation of automatic speech recognition systems. TIMIT speakers were from eight dialect regions of the United States.
The Audiovisual Database of Spoken American English will be of interest in various disciplines: to linguists for studies of phonetics, phonology, and prosody of American English; to speech scientists for investigations of motor speech production and auditory-visual speech perception; to engineers and computer scientists for investigations of machine audio-visual speech recognition (AVSR); and to speech and hearing scientists for clinical purposes, such as the examination and improvement of speech perception by listeners with hearing loss.
Participants were recorded individually during a single session. A participant first completed a statement of informed consent and a questionnaire to gather biographical data and then was asked by the experimenter to mark his or her Indiana hometown on a state map. The experimenter and participant then moved to a small, sound-treated studio where the participant was seated in front of three navy blue baffles. A laptop computer was elevated to eye-level on a speaker stand and placed approximately 50-60 cm in front of the participant. Prompts were presented to the participant in a Microsoft PowerPoint presentation. The experimenter was seated directly next to the participant, but outside the camera angle, and advanced the PowerPoint slides at a comfortable pace.
Participants were recorded with a Panasonic DVC-80 digital video camera to miniDV digital video cassette tapes. All participants wore a Sennheiser MKE-2060 directional/cardioid lapel microphone throughout the recordings.
Each speaker produced a total of 94 segmented files which were converted from Final Cut Express to Quicktime (.mov) files and then saved in the appropriately marked folder. If a speaker mispronounced a sentence or word during the recording process, the mispronunciations were edited out of the segments to be archived. The remaining parts of the recording, including the correct repetition of each prompt, were then sequenced together to create a continuous and complete segment.
The fourteen participants were between 19 and 61 years of age (with a mean age of 30 years) and native speakers of American English.
For an example of the data in this corpus, please view this video sample (Quicktime, mov).
Portions © 2007 Butler University, © 1993, 2009 Trustees of the University of Pennsylvania