Following is an excerpt from:

  Campbell, J. P., Jr. "Features and Measures for Speaker
     Recognition." Ph.D. Dissertation, Oklahoma State
     University, 1992.

After this, we also include portions of an appendix ("Database Description")
from:

  Higgins, A., J. Porter and L. Bahler. YOHO Speaker Authentication
     Final Report. ITT Defense Communications Division, 1989.

We would like to express our deepest gratitude to Joseph Campbell for
his assistance in preparing these materials for publication with the
corpus.

--------------------------------------------------------------------

                        YOHO DATABASE


The YOHO database was collected by ITT under a U.S. Government
contract administered by the author.

The signal conditioning and acquisition was designed by the author
using a 4-times oversampling method to provide bandwidth and linear
phase up to 3.8 kHz. First, the analog signal is low pass filtered at
approximately 5 kHz by a mild 4th order elliptic analog antialiasing
filter that has negligible effect on the signal below 4 kHz. This
analog antialiasing filter sufficiently limits the bandwidth to 16 kHz
to prevent aliasing when it is then oversampled at 32 kHz with 12 bits
of precision. Next, the 32-kHz sampled signal is passed through a
255-tap, finite duration impulse response (FIR) digital bandpass
filter. This digital filter limits the bandwidth of the signal so that
it can be decimated by 4:1 to arrive at the final desired sampling
frequency of 8 kHz. Using iterative inverse- and forward- Fourier
transforms, the author designed a frequency-sampling symmetric-FIR
filter-design routine to determine the 255 coefficients that best
approximate a secure voice terminal's input characteristics in a least
mean-square magnitude-response error sense. The resulting response
models the STU-III secure voice terminal's input characteristics very
closely and is given in Table II-1.


                         TABLE II-1

                 FREQUENCY RESPONSE OF YOHO

                      DECIMATION FILTER

		Frequency (Hz)    Response (dB)
		       0               -25
		    < 50               -21
		     100                -7
		     150                -2
		     200                -0.2
		 200 - 3600     -0.2 to +0.3 peak ripple
		    3600                -0.2
		    3800                -3
		    4000               -25
		    4400               -42
		  > 5000               -50
		  16,000               -57


The key to oversampling is that the analog antialiasing filter need not
have steep skirts in the vicinity of the half sampling frequency, as in
the Nyquist sampling methods, whereas the symmetric digital FIR filter
has linear phase and can have arbitrarily flat magnitude response. The
advantage of the oversampling method is that the magnitude and phase
distortions near the half sampling frequency are far less than is
common in traditional Nyquist sampling methods. For example, Digital
Sound Corporation's -3 dB analog bandwidth for 8 kHz sampling is only
3.6 kHz, as opposed to the 3.8 kHz, -3 dB bandwidth achieved by this
method. This additional 200 Hz of bandwidth is vital for listeners to
be able to distinguish between sounds concentrated in high frequencies
(e.g., the affricate sounds differentiating "chew" and "jew").

The YOHO database is the only large scale, scientifically controlled
and collected, high-quality speech database for speaker authentication
testing at high confidence levels.  Table II-2 describes the YOHO
database (Higgins 1990).


                         TABLE II-2

                      THE YOHO DATABASE

  * "Combination lock" phrases (e.g., 36-24-36)
  * 138 subjects: 108 males, 30 females
  * Collected over 3 month period in a real-world office environment
  * 4 enrollment sessions per subject with 24 phrases per session
  * 10 test sessions per subject with 4 phrases per session
  * Total of 1932 validated sessions
  * 8 kHz sampling with 3.8 kHz analog bandwidth
  * 1.2 gigabytes of data (when uncompressed)


In a text-dependent speaker verification scenario, phrases are prompted
and the claimant is requested to say them. The syntax used in the YOHO
database is "combination lock" phrases. For example, the prompt might
read: "Say: thirty-six, twenty-four, thirty-six." Where the claimant is
to speak the phrase as three doublets.


                        REFERENCES


Campbell, J. P., Jr. "Features and Measures for Speaker
     Recognition." Ph.D. Dissertation, Oklahoma State
     University, 1992.

Higgins, A., J. Porter and L. Bahler. YOHO Speaker Authentication
     Final Report. ITT Defense Communications Division, 1989.

Higgins, A. "YOHO Speaker Verification." Baltimore: 1990.
 
Higgins, A., L. Bahler, and J. Porter. "Speaker Verification
     Using Randomized Phrase Prompting." Digital Signal
     Processing 1, no. 2 (1991): 89 - 106.

------------------------------------------------------------------------

			 DATABASE DESCRIPTION

1.  Introduction

     The YOHO Speaker Verification Database was collected  while  testing  a
prototype speaker verification system by ITT Defense Communications Division
under contract with the U.S. Departement of Defense.  The  database  is  the
largest  supervised speaker verification database known to the authors.  The
number of trials and the number of test subjects were  determined  to  allow
testing  at  the  80%  confidence  level to determine whether the system met
specified performance requirements.  The required error rates were 1%  false
rejection and 0.1% false acceptance.

     The testing and database collection were conducted  at  ITTDCD's  head-
quarters in Nutley, New Jersey.  The system consisted of a Sun-3 workstation
with added processor boards for real-time processing, a 19-inch monitor  for
prompting, and a telephone handset with a high-quality microphone.  When the
system was used in an enrollment or verification session, a sampled waveform
file  was created for each phrase-length utterance.  The collection of these
waveform files is contained on the database tape.


2.  Equipment Setup

     The system was set up in the corner of a large room (approximately 25 x
25  feet)  which  was mostly empty.  Low level noise could be heard from the
adjoining office space, from people walking through the room, and from occa-
sional  paging  on the public address system.  Noise from the fan in the Sun
workstation could also be heard.  Subjects could stand or  sit  in  a  chair
while  using  the system.  Prompts were displayed on the console using "gal-
lant" point-size 19  font.  Upon  completion  of  each  phrase,  the  system
automatically prompted the next phrase, with a delay of about one second.  A
"beep" was produced as each phrase was prompted to attract the user's atten-
tion.   These  beeps  can  be heard at the beginning of many of the waveform
files.

     On completion of each test session, a message "ACCEPTED" or  "REJECTED"
was displayed on the console.  No other feedback or motivation was provided.


2.1.  Data Acquisition

     A handset containing an  omnidirectional  non-noise-canceling  electret
microphone  is  connected  to  an  external audio amplifier.  The microphone
signal  first passes through a  6-pole passive elliptic filter with a cutoff
frequency of 4.3  kHz.  It is then amplified and applied as input to an ana-
log  to digital converter (ADC).   The amplified output is also returned  to
the handset to supply sidetone.

     A 4-times oversampling scheme is implemented, which works  as  follows.
The  ADC  operates  at  a sampling rate of 32 kHz, producing 12-bit samples.
The analog lowpass filter prevents aliasing in the  sampled  signal  without
attenuating frequencies below 4 kHz.  The sampled signal is passed through a
255-tap FIR bandpass filter.  The upper band edge of this filter is approxi-
mately  4  kHz, or one fourth of the original Nyquist frequency.  Therefore,
the filtered signal can be represented without aliasing at one fourth of the
original sampling rate.  This is accomplished by downsampling, with a 4 to 1
ratio, to the final sampling rate of 8 kHz.

     The advantage of this method is that  the  frequency  response  of  the
frontend  is  entirely controlled by the FIR decimation filter, assuming the
net frequency response of the analog components is  constant  below  4  kHz.
This allows the frequency response of the frontend to be more precisely con-
trolled than would be possible using  a  conventional  analog  anti-aliasing
filter  and ADC conversions at an 8 kHz rate.  The frequency response of the
decimation filter is shown in Table I.

                 Frequency (Hz)   Response (dB)
                              0   -25
                            <50   -21
                            100   -7
                            150   -2
                            200   -0.2
                       200-3600   -0.2 to +0.3 peak ripple
                           3600   -0.2
                           3800   -3
                           4000   -25
                           4400   -42
                          >5000   -50
                          16000   -57


	     Table I:  Frequency Response of Decimation Filter


3.  Subjects

     A total of 189 subjects began  the  testing  program.   Three  subjects
dropped  out  before the enrollment sessions were complete, leaving 186 sub-
jects who completed all the enrollment sessions.   Of  these  subjects,  156
were  male,  and  30  were female.  Members of several departments including
engineering and program management, and support staff  such  as  secretaries
and  draftsmen,  were  asked to participate in the test.  Subjects spanned a
wide range of ages, job descriptions,  and  educational  backgrounds.   Most
subjects  were  from the New York area, although there were many exceptions,
including some  non-native English speakers.   [ NB:  Only 138 speakers have
been included in the CD-ROM version published by the Linguistic Data Consor-
tium.]

     Subjects were introduced to the system by  watching  a  5-minute  video
tape  which demonstrated the intended usage of the system.  This tape, which
was delivered to the Government, documents the user's view of the system.  A
test  monitor was present during all enrollment and test sessions.  His pri-
mary responsibilities were to maintain a continuous flow of  subjects  using
the system, and to perform daily tape backups of the sampled waveform files.
The test monitor also provided further instruction or assistance  if  needed
in  enrollment  sessions,  but  did not interfere in test sessions except to
take note of sessions in which the subject claimed to be someone else (which
was  allowed),  or in which the prompts were not read correctly.  A total of
57 such sessions were reported.  These sessions are not present in the data-
base.


4.  Speech Material

     The speech material consists of "combination-lock" phrases.  An example
prompt  is:   "35  -  72  - 41", pronounced "thirty five, seventy two, forty
one".  Each phrase consists of three  number  doublets.   The  doublets  are
chosen  from  a  list which includes all the doublets from 21 to 99 with the
following exceptions: (1) no exact decades (30, 40,  etc.),  (2)  no  double
digits  (22,  33,  etc.),  and  (3) no numbers ending in "8" (28, 38, etc.).
Pausing between the doublets is optional, but not encouraged.  An enrollment
session  consists  of  24 such phrases. A verification trial or session con-
sists of 4 such phrases.


5.  Sessions

     Subjects were asked to participate in 14 sessions over a  3-month  time
interval.   The  first  4  sessions were enrollment sessions, which required
about 3 minutes each, and the following  10  sessions  were  test  sessions,
which  took  about 20 seconds each.

     Each subject in the test completed the test sessions  at his or her own
rate.  The nominal separation  between sessions  was 3 days.   However, this
varied to suit individuals' schedules.  Table II shows the earliest, median,
and latest dates of subjects' first, fifth, and tenth sessions.

                   Session   Earliest   Median    Latest
                    First    3/07/89    3/27/89   5/08/89
                    Fifth    3/17/89    4/21/89   5/26/89
                    Tenth    3/29/89    5/03/89   5/26/89


		       Table III:  Test Session Dates