Santa Barbara Corpus of Spoken American English Part III


Item Name: Santa Barbara Corpus of Spoken American English Part III
Authors: John W. Du Bois and Robert Englebretson
LDC Catalog No.: LDC2004S10
ISBN: 1-58563-308-9
Release Date: Sep 23, 2004
Data Type: speech
Sample Rate: 22050 Hz
Sampling Format: pcm
Data Source(s): microphone speech
Project(s): EARS, GALE, Talkbank
Application(s): discourse analysis, prosody
Language(s): English
Language ID(s): eng
Distribution: 1 DVD
Member fee: $0 for 2004 members
Non-member Fee: US $200.00
Reduced-License Fee: US $200.00
Extra-Copy Fee: US $200.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: John W. Du Bois and Robert Englebretson
2004
Santa Barbara Corpus of Spoken American English Part III
Linguistic Data Consortium, Philadelphia

Santa Barbara Corpus of Spoken American English Part III was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S10 and ISBN 1-58563-308-9.

Santa Barbara Corpus of Spoken American English Part III is based on hundreds of recordings of natural speech from all over the United States, representing a wide variety of people of different regional origins, ages, occupations, and ethnic and social backgrounds. It reflects many ways that people use language in their lives: conversation, gossip, arguments, on-the-job talk, card games, city council meetings, sales pitches, classroom lectures, political speeches, bedtime stories, sermons, weddings, and more.

The corpus was collected by: University of California, Santa Barbara Center for the Study of Discourse (Director: John W. Du Bois (UCSB), Authors: John W. Du Bois and Robert Englebretson. Associate Editors: Wallace L. Chafe (UCSB), Charles Meyer (UMass, Boston), and Sandra A. Thompson (UCSB)).

Santa Barbara Corpus of Spoken American English Part III is also part of the International Corpus of English (ICE) (Charles W. Meyer, Director), representing the American Component.

For software and additional data resources, please refer to the following sites: Talkbank, International Corpus of English.

Part I of the Santa Barbara Corpus of Spoken American English is available as LDC2000S85.

Part II of the Santa Barbara Corpus of Spoken American English is available as LDC2003S06.

Data

The audio data consists of 16 wave format speech files, recorded in two-channel pcm, at 22050Hz. The speech files total ~6 hours of audio (1.8GB), representing over 116K-words and over 9K unique words in transcription.
segment.txt explanation of the information in segment.tbl
segment.tbl collection information about the recordings
segment_summaries.txt brief summaries of audio scenarios
speaker.txt explanation of the information in speaker.tbl
speaker.tbl speaker ethnographic, demographic information
table.txt description of file names and informal titles
annotations.txt list of conventions and prosodic annotations

The the transcripts are in the following format:

.trn format structure 2.660 2.805 JOANNE: But, 2.805 4.685 so these slides be real interesting. 6.140 6.325 KEN: ... Yeah. 6.325 7.710 I think it'll be real interesting

A sample transcript file may be found here.

Personal names, place names, phone numbers, etc., in the transcripts have been altered to preserve the anonymity of the speakers and their acquaintances and the audio files have been filtered to make these portions of the recordings unrecognizable. Pitch information is still recoverable from these filtered portions of the recordings, but the amplitude levels in these regions have been reduced relative to the original signal. A separate filter list file (*.flt) associated with each transcript/waveform file pair is provided to list the beginning and ending times of the filtered regions. The file sbc040.flt is empty indicating there was no personal information to filter out.

The filtering was done using a digital FIR low-pass filter, with the cut-off frequency set at 400 Hz. The effect of the filter was gradually faded in and out at the beginning and end of the regions over a 1,000 sample region, roughly 45 milliseconds, to avoid abrupt transitions in the resulting waveform.

For a complete listing of the files, please see file.tbl in the docs directory.

For the latest information on this corpus, please refer to the following sites devoted to it:

http:// http://www.linguistics.ucsb.edu/research/sbcorpus.html http://www.ldc.upenn.edu/Projects/SBCSAE

Acknowledgements

The completion and release of this corpus was facilitated by funding extended by the Talkbank project. Talkbank is an interdisciplinary research project funded by a five-year grant (BCS-998009, KDI, SBE) from the National Science Foundation to Carnegie Mellon University and the University of Pennsylvania.

Produced at the LDC by Nii Martey.

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2003S06.

Please contact Nii Martey with any questions regarding this corpus.

Note

The cost of the first 100 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge to qualified researchers; a $30 shipping and handling fee applies. After these first 100 copies are distributed, additional copies will be available for the production cost of $200 per DVD-ROM.

Content Copyright

Portions 2003 University of California, 2003 Trustees of the University of Pennsylvania