Santa Barbara Corpus of Spoken American English Part IV


Item Name: Santa Barbara Corpus of Spoken American English Part IV
Authors: John W. Du Bois and Robert Englebretson
LDC Catalog No.: LDC2005S25
ISBN: 158563-348-8
Release Date: Sep 20, 2005
Data Type: speech
Sample Rate: 22050 Hz
Sampling Format: 2-channel pcm
Data Source(s): microphone speech
Project(s): EARS, GALE, Talkbank
Application(s): discourse analysis, prosody
Language(s): English
Language ID(s): eng
Distribution: 1 DVD
Member fee: $0 for 2005 members
Non-member Fee: US $200.00
Reduced-License Fee: US $200.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: John W. Du Bois and Robert Englebretson
2005
Santa Barbara Corpus of Spoken American English Part IV
Linguistic Data Consortium, Philadelphia

Introduction

Santa Barbara Corpus of Spoken American English Part IV was produced by Linguistic Data Consortium (LDC) catalog number LDC2005S25 and ISBN 158563-348-8.

Santa Barbara Corpus of Spoken American English Part IV is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.

The corpus was collected by: University of California, Santa Barbara Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)).

For software and additional data resources, please refer to the following sites: TalkBank, International Corpus of English.

Part I of the Santa Barbara Corpus of Spoken American English is available as LDC2000S85.

Part II of the Santa Barbara Corpus of Spoken American English is available as LDC2003S06.

Part III of the Santa Barbara Corpus of Spoken American English is available as LDC2003S10.

For the latest information on this corpus, please refer to the following sites devoted to it:

http:// http://www.linguistics.ucsb.edu/research/sbcorpus.html http://www.ldc.upenn.edu/Projects/SBCSAE

Data

The audio data consists of 14 wave format speech files, recorded in two-channel pcm, at 22050Hz. The speech files total 5.75 hours of audio (1.5 GB), representing over 58,000 words and over 6,000 unique words in the transcribed text.

Samples

For an example of this corpus, please examine this audio sample and its transcript.

Note

The cost of the first 100 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge to qualified researchers; a $30 shipping and handling fee applies. After these first 100 copies are distributed, additional copies will be available for the production cost of $200 per DVD-ROM.

Acknowledgements

The completion and release of this corpus was facilitated by funding extended by the TalkBank Project. TalkBank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.

Content Copyright

Portions 2003 University of California, 2003 Trustees of the University of Pennsylvania