Santa Barbara Corpus of Spoken American English Part IV
Item Name: | Santa Barbara Corpus of Spoken American English Part IV |
Author(s): | John W. Du Bois, Robert Englebretson |
LDC Catalog No.: | LDC2005S25 |
ISBN: | 158563-348-8 |
ISLRN: | 659-853-066-274-9 |
DOI: | https://doi.org/10.35111/c9nh-1v54 |
Release Date: | September 20, 2005 |
Member Year(s): | 2005 |
DCMI Type(s): | Sound, Text |
Sample Type: | 2-channel pcm |
Sample Rate: | 22050 |
Data Source(s): | microphone speech |
Project(s): | EARS, GALE, Talkbank |
Application(s): | discourse analysis, prosody |
Language(s): | English |
Language ID(s): | eng |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2005S25 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Du Bois, John W., and Robert Englebretson. Santa Barbara Corpus of Spoken American English Part IV LDC2005S25. Web Download. Philadelphia: Linguistic Data Consortium, 2005. |
Related Works: | View |
Introduction
Santa Barbara Corpus of Spoken American English Part IV was produced by Linguistic Data Consortium (LDC) and contains approximately 5.5 hours of conversational and prepared English speech and associated transcripts. The corpus was collected by the University of California, Santa Barbara (UCSB) Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)).
The corpus is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.
For software and additional data resources, please refer to the following sites: TalkBank, International Corpus of English.
The first three parts of this collection are available here:
- Santa Barbara Corpus of Spoken American English Part I (LDC2000S85).
- Santa Barbara Corpus of Spoken American English Part II (LDC2003S06).
- Santa Barbara Corpus of Spoken American English Part III (LDC2003S10).
Data
The gender breakdown for speakers in this corpus was: 33 male, 25 female. In addition, the following metadata is included: age, dialect of english, dialect state, current state, highest level of education, years of education, occupation, ethnicity.
The audio data consists of 14 WAV format speech files, recorded in two-channel PCM, at 22050 Hz, representing over 58,000 words and over 6,000 unique words in the transcribed text. The corpus also includes transcript files in TXT format, as well as files specifying spans in each audio file that have been filtered to remove personal identifying information.
Samples
For an example of the data in this corpus, please examine this audio sample (WAV) and its transcript (TXT).
Sponsorship
The completion and release of this corpus was facilitated by funding extended by the TalkBank Project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.
Updates
None at this time.