Following is an excerpt from: Campbell, J. P., Jr. "Features and Measures for Speaker Recognition." Ph.D. Dissertation, Oklahoma State University, 1992. After this, we also include portions of an appendix ("Database Description") from: Higgins, A., J. Porter and L. Bahler. YOHO Speaker Authentication Final Report. ITT Defense Communications Division, 1989. We would like to express our deepest gratitude to Joseph Campbell for his assistance in preparing these materials for publication with the corpus. -------------------------------------------------------------------- YOHO DATABASE The YOHO database was collected by ITT under a U.S. Government contract administered by the author. The signal conditioning and acquisition was designed by the author using a 4-times oversampling method to provide bandwidth and linear phase up to 3.8 kHz. First, the analog signal is low pass filtered at approximately 5 kHz by a mild 4th order elliptic analog antialiasing filter that has negligible effect on the signal below 4 kHz. This analog antialiasing filter sufficiently limits the bandwidth to 16 kHz to prevent aliasing when it is then oversampled at 32 kHz with 12 bits of precision. Next, the 32-kHz sampled signal is passed through a 255-tap, finite duration impulse response (FIR) digital bandpass filter. This digital filter limits the bandwidth of the signal so that it can be decimated by 4:1 to arrive at the final desired sampling frequency of 8 kHz. Using iterative inverse- and forward- Fourier transforms, the author designed a frequency-sampling symmetric-FIR filter-design routine to determine the 255 coefficients that best approximate a secure voice terminal's input characteristics in a least mean-square magnitude-response error sense. The resulting response models the STU-III secure voice terminal's input characteristics very closely and is given in Table II-1. TABLE II-1 FREQUENCY RESPONSE OF YOHO DECIMATION FILTER Frequency (Hz) Response (dB) 0 -25 < 50 -21 100 -7 150 -2 200 -0.2 200 - 3600 -0.2 to +0.3 peak ripple 3600 -0.2 3800 -3 4000 -25 4400 -42 > 5000 -50 16,000 -57 The key to oversampling is that the analog antialiasing filter need not have steep skirts in the vicinity of the half sampling frequency, as in the Nyquist sampling methods, whereas the symmetric digital FIR filter has linear phase and can have arbitrarily flat magnitude response. The advantage of the oversampling method is that the magnitude and phase distortions near the half sampling frequency are far less than is common in traditional Nyquist sampling methods. For example, Digital Sound Corporation's -3 dB analog bandwidth for 8 kHz sampling is only 3.6 kHz, as opposed to the 3.8 kHz, -3 dB bandwidth achieved by this method. This additional 200 Hz of bandwidth is vital for listeners to be able to distinguish between sounds concentrated in high frequencies (e.g., the affricate sounds differentiating "chew" and "jew"). The YOHO database is the only large scale, scientifically controlled and collected, high-quality speech database for speaker authentication testing at high confidence levels. Table II-2 describes the YOHO database (Higgins 1990). TABLE II-2 THE YOHO DATABASE * "Combination lock" phrases (e.g., 36-24-36) * 138 subjects: 108 males, 30 females * Collected over 3 month period in a real-world office environment * 4 enrollment sessions per subject with 24 phrases per session * 10 test sessions per subject with 4 phrases per session * Total of 1932 validated sessions * 8 kHz sampling with 3.8 kHz analog bandwidth * 1.2 gigabytes of data (when uncompressed) In a text-dependent speaker verification scenario, phrases are prompted and the claimant is requested to say them. The syntax used in the YOHO database is "combination lock" phrases. For example, the prompt might read: "Say: thirty-six, twenty-four, thirty-six." Where the claimant is to speak the phrase as three doublets. REFERENCES Campbell, J. P., Jr. "Features and Measures for Speaker Recognition." Ph.D. Dissertation, Oklahoma State University, 1992. Higgins, A., J. Porter and L. Bahler. YOHO Speaker Authentication Final Report. ITT Defense Communications Division, 1989. Higgins, A. "YOHO Speaker Verification." Baltimore: 1990. Higgins, A., L. Bahler, and J. Porter. "Speaker Verification Using Randomized Phrase Prompting." Digital Signal Processing 1, no. 2 (1991): 89 - 106. ------------------------------------------------------------------------ DATABASE DESCRIPTION 1. Introduction The YOHO Speaker Verification Database was collected while testing a prototype speaker verification system by ITT Defense Communications Division under contract with the U.S. Departement of Defense. The database is the largest supervised speaker verification database known to the authors. The number of trials and the number of test subjects were determined to allow testing at the 80% confidence level to determine whether the system met specified performance requirements. The required error rates were 1% false rejection and 0.1% false acceptance. The testing and database collection were conducted at ITTDCD's head- quarters in Nutley, New Jersey. The system consisted of a Sun-3 workstation with added processor boards for real-time processing, a 19-inch monitor for prompting, and a telephone handset with a high-quality microphone. When the system was used in an enrollment or verification session, a sampled waveform file was created for each phrase-length utterance. The collection of these waveform files is contained on the database tape. 2. Equipment Setup The system was set up in the corner of a large room (approximately 25 x 25 feet) which was mostly empty. Low level noise could be heard from the adjoining office space, from people walking through the room, and from occa- sional paging on the public address system. Noise from the fan in the Sun workstation could also be heard. Subjects could stand or sit in a chair while using the system. Prompts were displayed on the console using "gal- lant" point-size 19 font. Upon completion of each phrase, the system automatically prompted the next phrase, with a delay of about one second. A "beep" was produced as each phrase was prompted to attract the user's atten- tion. These beeps can be heard at the beginning of many of the waveform files. On completion of each test session, a message "ACCEPTED" or "REJECTED" was displayed on the console. No other feedback or motivation was provided. 2.1. Data Acquisition A handset containing an omnidirectional non-noise-canceling electret microphone is connected to an external audio amplifier. The microphone signal first passes through a 6-pole passive elliptic filter with a cutoff frequency of 4.3 kHz. It is then amplified and applied as input to an ana- log to digital converter (ADC). The amplified output is also returned to the handset to supply sidetone. A 4-times oversampling scheme is implemented, which works as follows. The ADC operates at a sampling rate of 32 kHz, producing 12-bit samples. The analog lowpass filter prevents aliasing in the sampled signal without attenuating frequencies below 4 kHz. The sampled signal is passed through a 255-tap FIR bandpass filter. The upper band edge of this filter is approxi- mately 4 kHz, or one fourth of the original Nyquist frequency. Therefore, the filtered signal can be represented without aliasing at one fourth of the original sampling rate. This is accomplished by downsampling, with a 4 to 1 ratio, to the final sampling rate of 8 kHz. The advantage of this method is that the frequency response of the frontend is entirely controlled by the FIR decimation filter, assuming the net frequency response of the analog components is constant below 4 kHz. This allows the frequency response of the frontend to be more precisely con- trolled than would be possible using a conventional analog anti-aliasing filter and ADC conversions at an 8 kHz rate. The frequency response of the decimation filter is shown in Table I. Frequency (Hz) Response (dB) 0 -25 <50 -21 100 -7 150 -2 200 -0.2 200-3600 -0.2 to +0.3 peak ripple 3600 -0.2 3800 -3 4000 -25 4400 -42 >5000 -50 16000 -57 Table I: Frequency Response of Decimation Filter 3. Subjects A total of 189 subjects began the testing program. Three subjects dropped out before the enrollment sessions were complete, leaving 186 sub- jects who completed all the enrollment sessions. Of these subjects, 156 were male, and 30 were female. Members of several departments including engineering and program management, and support staff such as secretaries and draftsmen, were asked to participate in the test. Subjects spanned a wide range of ages, job descriptions, and educational backgrounds. Most subjects were from the New York area, although there were many exceptions, including some non-native English speakers. [ NB: Only 138 speakers have been included in the CD-ROM version published by the Linguistic Data Consor- tium.] Subjects were introduced to the system by watching a 5-minute video tape which demonstrated the intended usage of the system. This tape, which was delivered to the Government, documents the user's view of the system. A test monitor was present during all enrollment and test sessions. His pri- mary responsibilities were to maintain a continuous flow of subjects using the system, and to perform daily tape backups of the sampled waveform files. The test monitor also provided further instruction or assistance if needed in enrollment sessions, but did not interfere in test sessions except to take note of sessions in which the subject claimed to be someone else (which was allowed), or in which the prompts were not read correctly. A total of 57 such sessions were reported. These sessions are not present in the data- base. 4. Speech Material The speech material consists of "combination-lock" phrases. An example prompt is: "35 - 72 - 41", pronounced "thirty five, seventy two, forty one". Each phrase consists of three number doublets. The doublets are chosen from a list which includes all the doublets from 21 to 99 with the following exceptions: (1) no exact decades (30, 40, etc.), (2) no double digits (22, 33, etc.), and (3) no numbers ending in "8" (28, 38, etc.). Pausing between the doublets is optional, but not encouraged. An enrollment session consists of 24 such phrases. A verification trial or session con- sists of 4 such phrases. 5. Sessions Subjects were asked to participate in 14 sessions over a 3-month time interval. The first 4 sessions were enrollment sessions, which required about 3 minutes each, and the following 10 sessions were test sessions, which took about 20 seconds each. Each subject in the test completed the test sessions at his or her own rate. The nominal separation between sessions was 3 days. However, this varied to suit individuals' schedules. Table II shows the earliest, median, and latest dates of subjects' first, fifth, and tenth sessions. Session Earliest Median Latest First 3/07/89 3/27/89 5/08/89 Fifth 3/17/89 4/21/89 5/26/89 Tenth 3/29/89 5/03/89 5/26/89 Table III: Test Session Dates