SPIDRE Corpus

Recorded Telephone Conversations
NIST Speech Discs 18-1.1 and 18-2.1
April, 1994

This two-disc set of CD-ROMs contains recordings of conversations, transcription files, and documentation for the SPeaker IDentification REsearch (SPIDRE) Corpus. It is a subset of the much larger Switchboard Corpus. SPIDRE contains transcripts and time marked word transcripts for 280 of the conversations in the Switchboard Corpus. There are 142 conversations on disc 18-1.1, and 138 conversations on disc 18-2.1.

In order to keep the corpus compact and manageable, 64 conversations that originally exceeded five minutes in length were truncated. The truncation point was intended to be the nearest ending of a speaking turn occurring before the 5 minute mark. The transcription files (both *.mrk and *.txt) were truncated to match their corresponding waveform files.

The truncation process was automatically implemented using turn taking information in the time aligned (*.mrk) transcription files. Please note, however, that since the "*.mrk" files are not always precise in their time alignments, some truncations of the waveform (*.wav) files were not performed exactly at the turn taking point indicated in the "*.mrk" files. In listening to the truncated waveforms, it appears that 29% (18/63) of the conversations have been imperfectly truncated, due to *.mrk file errors. These errors typically involve the addition or deletion of one word at the end of the time aligned (*.mrk) transcription file, and should not affect the use of these files for the purpose of speaker identification.

Modifications from SWITCHBOARD Corpus (SWB1)

In addition to the waveform truncation changes described above, a correction to the original SWB1 transcript has also been made for one of the conversations. Speech Disc 18-2.1 of the SPIDRE corpus contains a corrected transcript file for target speaker 1181. Note that the transcript for conversation 3169 now differs from its instantiation in the SWB1. The orthographic transcription file (sw3169.txt) was updated to include the speaker's proper personal identification number in the header field "FILENAME:".

SPIDRE.DOC

The purpose of the SPeaker IDentification REsearch (SPIDRE) corpus is to provide a "starter kit" for research in the area of speaker identification. The data in the SPIDRE corpus has been drawn from the much larger Switchboard (SWB1) corpus in order to create a manageable data set for speaker identification research. The SPIDRE corpus data has also been selected to maximize its utility in this domain.

The SPIDRE corpus contains 280 conversations, 180 of which contain at least one speaker who has been deemed to be a "target" speaker. The remaining 100 conversations contain only "non-target" speakers. The corpus contains 45 target speakers and 287 non-target speakers. Of the 287 non-target speakers, 161 are in a non-target conversation and the remaining 126 are speaking to a target speaker in a target conversation. The specific design and selection criteria used in forming the corpus are described below.

SPIDRE SELECTION CRITERIA:

Target Speakers
180 target conversations (92-disc 1/88-disc 2) were selected from SWB1 according to the following criteria:
1. Speakers must have participated in at least 4 calls.
2. Speakers must have used at least 3 different handsets.
3. If the speaker participated in more then 4 calls, 4 calls were selected somewhat randomly so that exactly 3 handsets were represented. Thus, one of the handsets is represented in two of the conversations.
Target speakers were distributed on the two discs so as to balance the representation of gender, age, and dialect region on both discs evenly.
Non-Target Speakers
100 non-target conversations (50 on each disc) were selected from SWB1 according to the following criteria:
1. Speakers must not be involved in either side of the 180 target conversations in the SPIDRE corpus.
2. Each speaker must have at least 60 seconds of speech in the first 210 seconds of the conversation (balanced conversations).
3. Total length of the waveform file must be as close to 5 minutes as possible in order to insure that the conversations were representative of both speakers and so that the corpus could be contained on 2 discs.

NOTES

13 of the 180 target conversations contain target speakers on both channels (8-disc 1/ 5-disc 2), therefore the same conversation will exist under different speakers.
Cases where a non-target speaker participated in more then one conversation with another non-target speaker could not be eliminated. In the cases where a non-target speaker participates in more then one conversation, the conversations were divided between the two discs in order to provide as many distinct non-target speakers, per disc, as possible. This was done to maximize the utility of the data in tests where the data on only 1 of the 2 discs is used.
In order to produce a somewhat "clean" corpus with respect to channel effects, conversations that were determined by the transcribers to contain either high static or high echo were eliminated.
Conversations that were listed on Switchboard's bug reports were also eliminated.

FILE FORMAT:


  All SPIDRE corpus files are of the form:

    sw.

  Where,

    CONVERSATION-ID ::= 1000 ... 9999 (base 10)

    FILETYPE ::= .wav | .txt | .mrk

Further information

SPIDRE filetypes and detailed information can be found in the SWITCHBOARD Manual.

SPIDRE Corpus

Recorded Telephone Conversations NIST Speech Discs 18-1.1 and 18-2.1 April, 1994

SPIDRE.DOC

Recorded Telephone Conversations
NIST Speech Discs 18-1.1 and 18-2.1
April, 1994