Switchboard-1 Release 2


Item Name: Switchboard-1 Release 2
Authors: John J. Godfrey and Edward Holliman
LDC Catalog No.: LDC97S62
ISBN: 1-58563-121-3
Data Type: speech
Sample Rate: 8000 Hz
Sampling Format: 2-channel ulaw
Data Source(s): telephone conversations
Project(s): EARS, GALE, Hub5-LVCSR
Application(s): speaker identification, speech recognition
Language(s): English
Language ID(s): eng
Distribution: 4 DVD
Member fee: $0 for 1993, 1997 members
Non-member Fee: US $3000.00
Reduced-License Fee: US $1500.00
Extra-Copy Fee: US $800.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: John J. Godfrey and Edward Holliman
1997
Switchboard-1 Release 2
Linguistic Data Consortium, Philadelphia

Introduction

The Switchboard-1 Telephone Speech Corpus (LDC97S62) was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed.

Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.

Data

In this release, assembled and published by the LDC, all known errors affecting the original publication of speech files were corrected. In addition, modifications have been made to the contents of the NIST Sphere headers of all speech files, to identify each file as being part of the new release and to make the usage of the sample_count header field consistent with standard Sphere usage. (In particular, the sample_count field should reflect the number of samples on each channel in the file. In the initial release, this field was improperly set to be the total number of samples in both channels of the file this has been corrected in the new release.)

Since the 1997 release, the Switchboard transcripts have been carefully revised at ISIP and additional problems have been discovered and patched. Three speech files, part of the original release, were inadvertently left off the 1997 revision. After corpus users noted some problems in the original speaker attribution table, LDC audited the problem calls and corrected the attributions. The latest version of ISIP transcriptions, the ISIP update of the ICSI phonetic transcriptions, and corrected word alignments are all available at http://www.ece.msstate.edu/research/isip/projects/switchboard/. The LDC makes the transcript summaries available via http. Researchers have used SWB-1 data for various annotation projects including discourse annotation/speech acts, part-of-speech tagging and parsing, up-to-date orthographic transcriptions, and phonetic transcriptions. This summary documents which files have been used for the various annotations. In addition to the index of these file characteristics, there is also a table detailing speaker attributes.

Updates

03/26/2013: Three previously missing files were added to this release. (sw02289.sph, sw04361.sph, sw04379.sph) File tables and documentation were updated to reflect the addition of these files. Please contact ldc@ldc.upenn.edu to obtain this update. All copies of this corpora obtained after the above date already include this update.

09/29/2011: Added a file list, available through online docs, to reflect its release on DVD. Also, an updated readme reflects these changes.

11/12/2007: Updated and corrected speaker and call tables are now available online in the corpus documentation directory http://www.ldc.upenn.edu/Catalog/docs/LDC97S62/ or as a single compressed tar file file at: ftp://ftp.ldc.upenn.edu/pub/ldc/public_data/swb1_corrected_tables.tar.gz

09/2008: The Switchboard Dialog Act Corpus is a version of Switchboard-1 Release 2 tagged with a shallow discourse tagset of approximately 60 basic dialog act tags and combinations. The discourse tag-set used is an augmentation of the Discourse Annotation and Markup System of Labeling (DAMSL) tag-set and is referred to as the SWBD-DAMSL labels. These annotations were created in 1997 at the University of Colorado at Boulder, with the goal of building better language models for automatic speech recognition of the Switchboard domain. To that end, the label-set incorporates both traditional sociolinguistic and discourse-theoretic rhetorical relations/adjacency-pairs as well as some more form-based models. This corpus contains labels for 1155 5-minute conversations comprising 205,000 utterances and 1.4 million words. The Switchboard Dialog Act Corpus is now available online at:

ftp://ftp.ldc.upenn.edu/pub/ldc/public_data/swb1_dialogact_annot.tar.gz

Content Copyright

Portions 1992, 1993, 1997 Trustees of the University of Pennsylvania