SPIDRE Corpus

Recorded Telephone Conversations

NIST Speech Discs 18-1.1 and 18-2.1

April, 1994

This two-disc set of CD-ROMs contains recordings of conversations, transcription files, and documentation for the SPeaker IDentification REsearch (SPIDRE) Corpus. It is a subset of the much larger Switchboard Corpus. SPIDRE contains transcripts and time marked word transcripts for 280 of the conversations in the Switchboard Corpus. There are 142 conversations on disc 18-1.1, and 138 conversations on disc 18-2.1.

In order to keep the corpus compact and manageable, 64 conversations that originally exceeded five minutes in length were truncated. The truncation point was intended to be the nearest ending of a speaking turn occurring before the 5 minute mark. The transcription files (both *.mrk and *.txt) were truncated to match their corresponding waveform files.

The truncation process was automatically implemented using turn taking information in the time aligned (*.mrk) transcription files. Please note, however, that since the "*.mrk" files are not always precise in their time alignments, some truncations of the waveform (*.wav) files were not performed exactly at the turn taking point indicated in the "*.mrk" files. In listening to the truncated waveforms, it appears that 29% (18/63) of the conversations have been imperfectly truncated, due to *.mrk file errors. These errors typically involve the addition or deletion of one word at the end of the time aligned (*.mrk) transcription file, and should not affect the use of these files for the purpose of speaker identification.

Modifications from SWITCHBOARD Corpus (SWB1)

In addition to the waveform truncation changes described above, a correction to the original SWB1 transcript has also been made for one of the conversations. Speech Disc 18-2.1 of the SPIDRE corpus contains a corrected transcript file for target speaker 1181. Note that the transcript for conversation 3169 now differs from its instantiation in the SWB1. The orthographic transcription file (sw3169.txt) was updated to include the speaker's proper personal identification number in the header field "FILENAME:".


SPIDRE.DOC

The purpose of the SPeaker IDentification REsearch (SPIDRE) corpus is to provide a "starter kit" for research in the area of speaker identification. The data in the SPIDRE corpus has been drawn from the much larger Switchboard (SWB1) corpus in order to create a manageable data set for speaker identification research. The SPIDRE corpus data has also been selected to maximize its utility in this domain.

The SPIDRE corpus contains 280 conversations, 180 of which contain at least one speaker who has been deemed to be a "target" speaker. The remaining 100 conversations contain only "non-target" speakers. The corpus contains 45 target speakers and 287 non-target speakers. Of the 287 non-target speakers, 161 are in a non-target conversation and the remaining 126 are speaking to a target speaker in a target conversation. The specific design and selection criteria used in forming the corpus are described below.

SPIDRE SELECTION CRITERIA:

NOTES
  1. 13 of the 180 target conversations contain target speakers on both channels (8-disc 1/ 5-disc 2), therefore the same conversation will exist under different speakers.
  2. Cases where a non-target speaker participated in more then one conversation with another non-target speaker could not be eliminated. In the cases where a non-target speaker participates in more then one conversation, the conversations were divided between the two discs in order to provide as many distinct non-target speakers, per disc, as possible. This was done to maximize the utility of the data in tests where the data on only 1 of the 2 discs is used.
  3. In order to produce a somewhat "clean" corpus with respect to channel effects, conversations that were determined by the transcribers to contain either high static or high echo were eliminated.
  4. Conversations that were listed on Switchboard's bug reports were also eliminated.
FILE FORMAT:

  All SPIDRE corpus files are of the form:

    sw.

  Where,

    CONVERSATION-ID ::= 1000 ... 9999 (base 10)

    FILETYPE ::= .wav | .txt | .mrk 

Further information

SPIDRE filetypes and detailed information can be found in the SWITCHBOARD Manual.