Switchboard Corpus

                         Credit Card Conversations

                         Wordspotting Training Set

                        NIST Speech Disc 8-1.2

                              May, 1992


This CD-ROM contains training data for wordspotting on the Switchboard
credit card conversations.  Thirty five conversations are included.
They may be used for cross validation and algorithm parameter
determination, as well as for ordinary training.  A ten conversation
test set will be released later.  


Directory and File Structures
-----------------------------
The following directories and files are contained in the top-level
directory of this disc:

   readme.doc - this file
 
        swb1/ - directory containing Switchboard credit card
                conversations and documentation

The directory, "swb1", contains three subdirectories:

         doc/ - documentation

    training/ - training data

    keywords/  - keyword markings


The data for Switchboard conversations are contained in a set of
associated files under the "training" subdirectory.
All files for a given conversation share the same basename but contain
different extensions indicating file type.

All Switchboard corpus files, other than ref files (see below) are of 
the form:

    sw<CONVERSATION-ID>.<FILETYPE>

Where,

    CONVERSATION-ID ::= 1000 ... 9999 (base 10)

    FILETYPE ::= .wav | .txt | .mrk  (see below for descriptions)


The thirty-five included conversations are:    1026 1037 1038 1044 1060
                                               1081 1083 1088 2023 2067
                                               2163 2301 2313 2390 2399
                                               2409 2536 2682 2710 2718 
                                               2764 2800 2883 2917 2951
                                               2987 2999 3170 3332 3409
                                               3439 3751 2781 3821 3855


Switchboard Filetypes
---------------------
.wav - two-channel u-law encoded audio waveform files with standard
       NIST SPHERE headers.  Each .wav file contains one conversation
       of not more than ten minutes.  Each channel was intended to
       contain the audio for one speaker in the conversation (although
       crosstalk between channels is known to exist for some 
       conversations).

       For the earlier conversations, those preceding 3170, there was
       generally an initial time offset between the channels, and
       variation in the offset as the conversation proceeded.  This was
       due to certain peculiarities in the collection process including
       some random losses of data.  For the later conversations this
       problem was corrected.

       For some of these conversations, those with significant cross
       talk, using which the offset could be tracked, samples have been
       deleted from non-speech parts of the data to approximately
       correct the offsets.  Corresponding changes have been made in the 
       marked transcript files.   For these conversations, as well as for 
       conversations with little crosstalk, a combined channel version 
       may be created by summing.  It is assumed, however, that the 
       standard procedure will be to process the channels separately.  
       The following conversations have been processed in this manner:

       1060 2023 2163 2301 2313 2390 2409 2710 2718 2800 2883 2951 2987

.txt - text files containing interleaved transcriptions of both channels.
       The .txt files contain headers which describe various
       parameters of the conversation.  See "txt_spec.doc" for more 
       details. 

.mrk - time-aligned word transcriptions.  See "mrk_spec.doc" for more
       details.

.ref - text files for each keyword containing markings for all instances
       of the keyword in all included conversations.  These are included 
       under the "keywords" subdirectory.  See "ref.spec" for more 
       details.


Documentation
-------------
The following files are located in the Switchboard documentation,
"swb1/doc", directory:

      ccsets.doc - characterization of the contents of the training
                   set of Switchboard credit card conversations
                   including the keywords and variants to be used.
                   
    conv_tab.doc - mu-law to PCM conversion table

    mrk_spec.doc - .mrk file specifications

    txt_spec.doc - .txt file specifications

    ref_spec.doc - .ref file specifications

    speakers.doc - information about the speakers in the conversations