Switchboard-1 Release 2
|Item Name:||Switchboard-1 Release 2|
|Author(s):||John Godfrey, Edward Holliman|
|LDC Catalog No.:||LDC97S62|
|Member Year(s):||1993, 1997|
|Sample Type:||2-channel ulaw|
|Data Source(s):||telephone conversations|
|Project(s):||Hub5-LVCSR, GALE, EARS|
|Application(s):||speech recognition, speaker identification|
LDC User Agreement for Non-Members
|Online Documentation:||LDC97S62 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Godfrey, John, and Edward Holliman. Switchboard-1 Release 2 LDC97S62. DVD. Philadelphia: Linguistic Data Consortium, 1993.|
The Switchboard-1 Telephone Speech Corpus (LDC97S62) was originally collected by Texas Instruments in 1990-1, under DARPA sponsorship. The first release of the corpus was published by NIST and distributed by the LDC in 1992-3. Since that release, a number of corrections have been made to the data files as presented on the original CD-ROM set and all copies of the first pressing have been distributed.
Switchboard is a collection of about 2,400 two-sided telephone conversations among 543 speakers (302 male, 241 female) from all areas of the United States. A computer-driven robot operator system handled the calls, giving the caller appropriate recorded prompts, selecting and dialing another person (the callee) to take part in a conversation, introducing a topic for discussion and recording the speech from the two subjects into separate channels until the conversation was finished. About 70 topics were provided, of which about 50 were used frequently. Selection of topics and callees was constrained so that: (1) no two speakers would converse together more than once and (2) no one spoke more than once on a given topic.
In this release, assembled and published by the LDC, all known errors affecting the original publication of speech files were corrected. In addition, modifications have been made to the contents of the NIST Sphere headers of all speech files, to identify each file as being part of the new release and to make the usage of the sample_count header field consistent with standard Sphere usage. (In particular, the sample_count field should reflect the number of samples on each channel in the file. In the initial release, this field was improperly set to be the total number of samples in both channels of the file this has been corrected in the new release.)
Since the 1997 release, the Switchboard transcripts have been carefully revised at The Institute for Signal and Information Processing (ISIP) and additional problems have been discovered and patched. Three speech files, part of the original release, were inadvertently left off the 1997 revision. After corpus users noted some problems in the original speaker attribution table, LDC audited the problem calls and corrected the attributions. The latest version of ISIP transcriptions, the ISIP update of the ICSI phonetic transcriptions, and corrected word alignments are all available at ISIP. The LDC makes the transcript summaries available via http. Researchers have used SWB-1 data for various annotation projects including discourse annotation/speech acts, part-of-speech tagging and parsing, up-to-date orthographic transcriptions, and phonetic transcriptions. This summary documents which files have been used for the various annotations. In addition to the index of these file characteristics, there is also a table detailing speaker attributes.
03/26/2013: Three previously missing files were added to this release. (sw02289.sph, sw04361.sph, sw04379.sph) File tables and documentation were updated to reflect the addition of these files. Please contact email@example.com to obtain this update. All copies of this corpora obtained after the above date already include this update.
09/29/2011: Added a file list, available through online docs, to reflect its release on DVD. Also, an updated readme reflects these changes.
11/12/2007: Updated and corrected speaker and call tables are now available online in the corpus documentation directory at https://catalog.ldc.upenn.edu/docs/LDC97S62/
09/2008: The Switchboard Dialog Act Corpus is a version of Switchboard-1 Release 2 tagged with a shallow discourse tagset of approximately 60 basic dialog act tags and combinations. The discourse tag-set used is an augmentation of the Discourse Annotation and Markup System of Labeling (DAMSL) tag-set and is referred to as the SWBD-DAMSL labels. These annotations were created in 1997 at the University of Colorado at Boulder, with the goal of building better language models for automatic speech recognition of the Switchboard domain. To that end, the label-set incorporates both traditional sociolinguistic and discourse-theoretic rhetorical relations/adjacency-pairs as well as some more form-based models. This corpus contains labels for 1155 5-minute conversations comprising 205,000 utterances and 1.4 million words. The Switchboard Dialog Act Corpus is available as a free download via the online documentation folder.