Switchboard Corpus Credit Card Conversations Wordspotting Training Set NIST Speech Disc 8-1.2 May, 1992 This CD-ROM contains training data for wordspotting on the Switchboard credit card conversations. Thirty five conversations are included. They may be used for cross validation and algorithm parameter determination, as well as for ordinary training. A ten conversation test set will be released later. Directory and File Structures ----------------------------- The following directories and files are contained in the top-level directory of this disc: readme.doc - this file swb1/ - directory containing Switchboard credit card conversations and documentation The directory, "swb1", contains three subdirectories: doc/ - documentation training/ - training data keywords/ - keyword markings The data for Switchboard conversations are contained in a set of associated files under the "training" subdirectory. All files for a given conversation share the same basename but contain different extensions indicating file type. All Switchboard corpus files, other than ref files (see below) are of the form: sw. Where, CONVERSATION-ID ::= 1000 ... 9999 (base 10) FILETYPE ::= .wav | .txt | .mrk (see below for descriptions) The thirty-five included conversations are: 1026 1037 1038 1044 1060 1081 1083 1088 2023 2067 2163 2301 2313 2390 2399 2409 2536 2682 2710 2718 2764 2800 2883 2917 2951 2987 2999 3170 3332 3409 3439 3751 2781 3821 3855 Switchboard Filetypes --------------------- .wav - two-channel u-law encoded audio waveform files with standard NIST SPHERE headers. Each .wav file contains one conversation of not more than ten minutes. Each channel was intended to contain the audio for one speaker in the conversation (although crosstalk between channels is known to exist for some conversations). For the earlier conversations, those preceding 3170, there was generally an initial time offset between the channels, and variation in the offset as the conversation proceeded. This was due to certain peculiarities in the collection process including some random losses of data. For the later conversations this problem was corrected. For some of these conversations, those with significant cross talk, using which the offset could be tracked, samples have been deleted from non-speech parts of the data to approximately correct the offsets. Corresponding changes have been made in the marked transcript files. For these conversations, as well as for conversations with little crosstalk, a combined channel version may be created by summing. It is assumed, however, that the standard procedure will be to process the channels separately. The following conversations have been processed in this manner: 1060 2023 2163 2301 2313 2390 2409 2710 2718 2800 2883 2951 2987 .txt - text files containing interleaved transcriptions of both channels. The .txt files contain headers which describe various parameters of the conversation. See "txt_spec.doc" for more details. .mrk - time-aligned word transcriptions. See "mrk_spec.doc" for more details. .ref - text files for each keyword containing markings for all instances of the keyword in all included conversations. These are included under the "keywords" subdirectory. See "ref.spec" for more details. Documentation ------------- The following files are located in the Switchboard documentation, "swb1/doc", directory: ccsets.doc - characterization of the contents of the training set of Switchboard credit card conversations including the keywords and variants to be used. conv_tab.doc - mu-law to PCM conversion table mrk_spec.doc - .mrk file specifications txt_spec.doc - .txt file specifications ref_spec.doc - .ref file specifications speakers.doc - information about the speakers in the conversations