NIST Speech Discs 18-1.1 and 18-2.1
April, 1994
This two-disc set of CD-ROMs contains recordings of conversations, transcription files, and documentation for the SPeaker IDentification REsearch (SPIDRE) Corpus. It is a subset of the much larger Switchboard Corpus. SPIDRE contains transcripts and time marked word transcripts for 280 of the conversations in the Switchboard Corpus. There are 142 conversations on disc 18-1.1, and 138 conversations on disc 18-2.1.
In order to keep the corpus compact and manageable, 64 conversations that originally exceeded five minutes in length were truncated. The truncation point was intended to be the nearest ending of a speaking turn occurring before the 5 minute mark. The transcription files (both *.mrk and *.txt) were truncated to match their corresponding waveform files.
The truncation process was automatically implemented using turn taking information in the time aligned (*.mrk) transcription files. Please note, however, that since the "*.mrk" files are not always precise in their time alignments, some truncations of the waveform (*.wav) files were not performed exactly at the turn taking point indicated in the "*.mrk" files. In listening to the truncated waveforms, it appears that 29% (18/63) of the conversations have been imperfectly truncated, due to *.mrk file errors. These errors typically involve the addition or deletion of one word at the end of the time aligned (*.mrk) transcription file, and should not affect the use of these files for the purpose of speaker identification.
Modifications from SWITCHBOARD Corpus (SWB1)
In addition to the waveform truncation changes described above, a correction to the original SWB1 transcript has also been made for one of the conversations. Speech Disc 18-2.1 of the SPIDRE corpus contains a corrected transcript file for target speaker 1181. Note that the transcript for conversation 3169 now differs from its instantiation in the SWB1. The orthographic transcription file (sw3169.txt) was updated to include the speaker's proper personal identification number in the header field "FILENAME:".
The purpose of the SPeaker IDentification REsearch (SPIDRE) corpus is to provide a "starter kit" for research in the area of speaker identification. The data in the SPIDRE corpus has been drawn from the much larger Switchboard (SWB1) corpus in order to create a manageable data set for speaker identification research. The SPIDRE corpus data has also been selected to maximize its utility in this domain.
The SPIDRE corpus contains 280 conversations, 180 of which contain at least one speaker who has been deemed to be a "target" speaker. The remaining 100 conversations contain only "non-target" speakers. The corpus contains 45 target speakers and 287 non-target speakers. Of the 287 non-target speakers, 161 are in a non-target conversation and the remaining 126 are speaking to a target speaker in a target conversation. The specific design and selection criteria used in forming the corpus are described below.
SPIDRE SELECTION CRITERIA:
180 target conversations (92-disc 1/88-disc 2) were selected from SWB1 according to the following criteria:
100 non-target conversations (50 on each disc) were selected from SWB1 according to the following criteria:
All SPIDRE corpus files are of the form: swFurther information. Where, CONVERSATION-ID ::= 1000 ... 9999 (base 10) FILETYPE ::= .wav | .txt | .mrk
SPIDRE filetypes and detailed information can be found in the SWITCHBOARD Manual.