KING Speaker Verification

Item Name: KING Speaker Verification
Author(s): Dr. Alan Higgins, Dave Vermilyea
Citation: Dr. Alan Higgins, and Dave Vermilyea. KING Speaker Verification LDC95S22. Web Download. Philadelphia: Linguistic Data Consortium, 1995.


The KING corpus was collected at ITT in 1987 under a US government research contract and although other contractors have received it, it has not been officially available for public use before now. The version now available from LDC, referred to as KING-92, is based on a 1992 reprocessing of the original recordings (see below). It contains recorded speech from 51 male speakers in two versions, which differ in channel characteristics: one from a telephone handset and one from a high-quality microphone. The speakers are further subdivided into two groups, 25 in one and 26 in the other, who were recorded at different locations. For each speaker and channel there are ten files, corresponding to sessions of about 30 to 60 seconds' duration each. The interval between sessions varies from a week to a month. The transcripts contain about 54k word tokens (4.8k types).

KING is designed principally for closed set experiments in text-independent speaker identification or verification over toll-quality telephone lines, although the single-sided collection format does not permit simulation of real telephone traffic. The ten sessions allow for a variety of divisions into training and test data, with the possibility of multiple test sets. For example, one could examine the effects of the amount of training on performance, or examine the variability of performance over several test samples (sessions) given a fixed amount of training (but see below about the "Great Divide").


The collection method used in KING was to establish a call from a laboratory location at ITT (either San Diego, CA or Nutley, NJ) over long distance lines and back to another phone at the same location. The phones used by the test subjects were equipped with an additional microphone, so two parallel recordings were made of that side of the conversation, while the interlocutor's side was not recorded. The two parties either spoke spontaneously or carried out a variety of tasks designed to elicit natural-sounding speech: interpreting a drawing, solving a problem, describing a picture, etc.

There were 25 speakers in Nutley and 26 in San Diego. Speech-to-noise ratios average about 10 dB worse for the Nutley telephone data than for San Diego; in fact it is less than 20 dB for over half the Nutley files. Users of this corpus therefore usually run separate experiments, or at least report results separately, according to site. A more subtle difference in the recordings, however, sometimes referred to as the "Great Divide," cuts across the telephone data for the San Diego speakers. This was apparently due to a minor equipment change which was made during the collection; it results in a slight but consistent change in the average long term spectrum of the telephone data recorded after the fifth session. Training and testing on data from the same side of this divide gives significantly better results than across it. Since the discovery of this difference, investigators now generally report results on the first and last five sessions of the San Diego telephone KING data separately, or they report within vs. across this boundary. A detailed description of the spectral differences can be found in a report by Thomas Crystal and Ned Neuburg which accompanies the CD-ROM version.

Since there are a number of published papers with results based on the original KING corpus and two versions of the data in existence, note that the new CD-ROM version, called KING-92, is based on a 1992 re-issue of the data from ITT. It differs from the original corpus in a few details:

  • The original data was sampled at 10 kHz, but has now been resampled at 8 kHz;
  • Missing segments, most on the order of seconds, have been restored to the data and the alignment between the high quality microphone and the telephone handset data files has been corrected;
  • Originally both an orthographic and a phonetic transcription of the data, with time alignments, were part of the corpus, but there were numerous errors; only an unaligned orthographic transcription has been retained.
  • Documentation has been changed to reflect these differences and a description of the artifactual division between sessions 1-5 and 6-10 in the San Diego telephone data is included.


Please view this audio sample and transcript sample.


