Final Report of the Chairman

Frontiers of Speech Processing - Robust Speech Recognition '93.

Outline of Chairmans' Report

The Workshop
Introduction
Directory CD
Original Idea
The Workshop Focus
The Focus Problem - Speech Recognition
The Underground Problems
Speaker Identification
Front End Analysis without Recognition
Realistic Degraded Speech
Technical Highlights
Focus Problem - HTK and Switchboard
Baseline System
Pre-workshop Support
Workshop experiences
Parameter Modifications
RASTA (Morgan/Hermansky)
Variable Sampling (Bahler)
Time/Frequency Kernels (Atlas/Pitton)
Generalized Production Model (Bakis)
Auxiliary tests
BBN
SRI
Segment Models
Notes about other problems with the focus
Underground Problems
Analysis without recognition
(Nelson and Miller)
Auditory Models
Derrick Butler
Kupin/Allen
Mazin Rahim
Sensitivity to noise (enhancement)
Kupin/Allen
Hermansky/
Speaker Recognition
King Database (Reynolds)
Microphone Arrays (Lin and Flanagan)
Noise and Wireline Simulator
Kupin and Hansen
RASTA
The Participants
The Workshop Schedule
Facilities and Administration
Final Notes
CD Gems

Appendix I - The Workshop Schedule

Chairman's Report - Robust Speech Recogntion '93

Introduction

This document is the report of the chairman for the Rutgers
summer workshop on Robust Speech Recognition. It is an introduction
to the workshop itself, a short history, and a description of the
problems, participants, and results of the workshop.

Much of the effort of the workshop participants is recorded on
a single directory, avaliable as a CD ROM from NIST, or available via
anonymous FTP from the Rutgers machine "frontier". On this directory
you will find descriptions of work by the participants, relevant
programs, some data, and the baseline shell scripts for the Hidden
Markov Toolkit (HTK) based speech recognition systems for both
Resource Management (RM) and the Credit Card Corpus of Switchboard.
Programs are either here in their entirety, or they depend on one or
another of the licensed software used during the workshop. Questions
may be addressed to the chairman (jrc@server.rutgers.edu), or to the
individual participants (almost all have email aliases at
server.rutgers.edu).

A second source of information (currently virtual information)
is a planned special edition of the IEEE Journal on Speech and Audio,
scheduled for spring 1994. Questions may be adressed to Rick Mammone
(Mammone@caip.rutgers.edu), one of the co-editors of the Journal.
Workshop problems and solutions will be reported there.

Directory CD

Chances are that you are reading this on a computer, and it is
at the top level of a directory (we call it CD). This directory has
the following structure:

CD
/ | \
/ | \
/ | \
README usera...userz

The README file contains this chairman's report from this workshop,
as well as a virtual map for the users' directories. Below you will find
a listing of each users' id, in alphabetical order, with a comment about
the contents of his/her entries. Under each users' directory, you should
find a NOTES file with information about the directory structure provided
by that particpant, and a README file about the general nature of the work,
and a summary of results if appropriate. Uniformity was not enforced.
This collection of work is intended to provide an informal snapshot of the
workshop efforts. Questions are best directed to the participants
(userx@server.rutgers.edu), but may be addressed to the chairman as well
(jrc@server.rutgers.edu).

The Original Idea

This workshop is an outgrowth of summers at the Center for
Communications Research in Princeton (CCR-P). Each summer a focus
problem is chosen, and many outside visitors (academics, students, and
others) are invited to join the staff temporarily to work on the focus
problem or some related topic of interest to them. (This venue dates
from the early 1950's). Unfortunately these SCAMP's are often on
closed topics, thus limiting the distribution of the ideas
produced, the generality of the results, and the makeup of the
particants. It has been true, however, that the the mix of
participants and the relatively unstructured interaction of SCAMP's
has provided a catalyst for new work through generation of novel
ideas, and has provided a rich format in which to accomplish serious
research.

Research on parameterizations of speech has always been
difficult because the only reliable way to assess a new idea is to
measure the performance of a complete system. Researchers with
interest in "front end" issues are often not willing to spend their
resources building complete systems, and systems builders have found
no short-term benefit in manipulating the front end, as interface
issues, statistical estimation techniques, signal quality issues and
general language problems make the analysis of front-end performance
particularly difficult. Until recently, only a few research
institutions in the United States could support a credible Speech
Recognition effort, further limiting the possible work.

Three recent advances have mitigated the problems somewhat.
First, high capacity workstations have become quite inexpensive, and
large disk storage with associated file servers are commodities. It
is possible to collect a sizeable computational resource quickly, and
deliver adequate and reliable computing at a site of your choosing.
It was our intention to do that. (This workshop had 25 Sparc 10's and
a large (30 Gigabyte) file server available as system resources.)

Second, the Linguistic Data Constortium, formed with seed
money from ARPA, has been collecting and distributing large amounts of
speech data and the associated transcripts for two years. Thus a
substantial amount of realistic speech data and the peripheral textual
support was available for our use. In addition, Bolt, Beranek and
Newmann (BBN) has been working on the Switchboard corpus under
government contract, and they provided the workshop corrected
transcripts and a pronouncing dictionary for a portion of the
Switchboard corpus. Texas Instruments offered similar data.

Finally, Steve Young and his collegues at Cambridge, England,
have written a very flexible continuous-output-distribution Markov
Model toolkit (HTK), and have made it available through Entropic, a
Washington DC based supplier of research software. This toolkit made
it possible to develop a high-performance, modular, flexible
speech-recognition system to serve as the focus of a workshop with a
reasonable upfront effort. Software support was available during the
workshop both through Entropic and from Cambridge, as Dave Talkin and
Bill Burne from Entropic and Steve Young from Cambridge were able to
participate in this summer's adventure. In addition, Entropic provided
access to waves and ESPS during the summer.

It had been suggested to DARPA (now ARPA) in 1992 that a
summer workshop in speech recognition was a possibility, and they
originally responded enthusiastically. Unfortunately ARPA was unable
to fund this workshop, but did offer support through its permission
for the HLT (Human Language and Technology) contractors to charge
workshop expenses to existing ARPA research efforts. Curt Boylls, of
NSA, and his boss Lane Livingston, however, were strongly in favor of
having a workshop, and agreed to fund one in the Fall of 1992.

Jim Flanagan (director of the CAIP center at Rutgers) agreed
to serve as workshop administrator, host the research activity in
the CAIP Center. With the chairman and NSA management, he collaborated
in the planning, organization, and implementation of the technical
agenda. The CAIP business office provided the infrastructure for
administrative aupport, laboratory space, housing, and transportation.
The CAIP Computer Support Group implemented laboratory facilities,
workstation installations, and dedicated data networking to a large
central server.

The workshop started on July 6th, 1993 and ended on August
13th. We had about 28 participants at Rutgers, of which 24 were full
time. The conference record speaks for itself. The workshop was a
public, cooperative venture. No proprietary software (except licensed
software) was welcome, and it was assumed that all workshop efforts
would be made available to the public upon the completion. This CD
satisfies that expectation.

The Workshop focus:

The Focus Problem

The general focus of the workshop was general-purpose speech
recognition, with particular emphasis on front end processing.
The particular focus problem was transcription of the
Switchboard Credit-Card Corpus, a 4-hour subset of the 250 hour
Switchboard database available through LDC consisting of 40 spontaneous
conversations about credit cards. This data had been used during the
past year as the focus of efforts in topic spotting. Attempts to
do speech transcription had suggested that the problem was difficult.

The focus problem was speech recognition on this 4-hour corpus, with
emphasis on the portions which were clearly a single speaker at a time.
The conversations were filled with spontaneous speech effects, including
stutters, restarts, and interruptions. The quality of the recordings
was generally quite good, with a few exceptional cases of intermittent
single-sample errors.

Entropic Systems produced a "baseline" recognitions system
using the HTK Toolkit, with scripts to perform training and recognition on
both the Switchboard data and on Resource Management data. (Some ability
to port the general recognizer to different environments was thereby assured.)
BBN assisted the Entropic effort to build the HTK, as did the group in
Cambridge. Both BBN and TI offered dictionaries and other
textual support. Thanks to all of them.

The Underground Problems

It has been a tradition at CCR-P that, whatever the focus
problem, there are one or more "underground" problems at every SCAMP,
most of them posed by the SCAMP participants rather than the organizers.
This workshop was not different; there were three underground problems.

The first was Speaker Identification, a problem closely related to
Speech Recognition. Whatever parameterization optimized speaker
recognition performance might also optimize speech recognition
performance (and vice-versa), so this work was well within the
guidelines of our general focus. Doug Reynolds of Lincoln Labs
brought his work on the King database (a well known difficult speaker
recognition problem) with him, and led an effort to accomplish speaker
recognition using various front ends with Gaussian Mixture modeling.
He used all the front ends from the speech recognition work, and one
provided by Khalid Assaleh, of CAIP, used for an independent speaker
identification task.

The second underground problem was that of analyzing a front
end without doing the speech recognition explicitly. Three proponents
of this attack were Doug Nelson (NSA), Jont Allen (AT&T),
and Rahim Mazin (AT&T), each
coming at the problem from a different perspective.
These efforts are reported below.

The third problem was that of producing realistic degraded
speech. John Hansen, from Duke University, had brought a collection
of environmentally recorded noises (crowds, cars, airplanes, etc).
Joe Kupin, of IDA, had provided a wireline simulator to mimic the
effects of various communications channels and other disortions. In
addition, we had access to the standard NIST software for measuring
signal to noise. Working together, Joe and John have produced a tool
that takes speech, your favorite noise, your preferred channel, and
your chosen signal to noise, and produces a speech signal with
additive noise which looks like it has been operated on by the channel
and results in a signal with your chosen signal-to-noise. This code
is available on the CD.

Several individuals pursued their own interests, as will be
evident further on.

Technical Highlights

Much of the workshop is detailed in the various directories on the CD.
The following highlights are offered as a general guide.

The Focus Problem:

Entropic Systems was funded prior to the workshop to produce a
baseline recognizer using HTK, with support generously provided by BBN
and TI and LDC. This recognizer produced about 21% word accuracy
prior to arrival at Rutgers, being trained and tested entirely on two
Sparc 10's provided by the workshop for that purpose. During this
effort, a target database consisting of two training segments and a
test segment were extracted from the Credit Card corpus of the
Switchboard data, and speech was split into "turns" of uninterrupted
speech from (mostly) one talker at a time. Turns were between 3 and
20 seconds in duration, and excluded utterances with clear
over-talking or unusual background interference. Otherwise, these are
canonical examples of spontaneous speech. These turns may be found in
ESPS-compatible speech files, separated by talker, in directory data
on the CD.

The baseline system used a bigram model, and used mel Cepstra with
mean subtraction. This system design had proven robust in ARPA work to date,
and had become a standard in the community at large.

Note that the results reported here are accuracy (words
correct minus inserted words) using a probabilistic bigram grammar
produced from the training material only. In the HTK case, a smooth
grammar was produced by having a floor probability for any unseen
bigram, while in the BBN and SRI cases, more sophisticated smoothing
algorithms were used.

Shortly after the workshop started, a 26% accuracy run was
accomplished by changing the trimming thresholds, fixing minor bugs,
and otherwise cleaning up the experiment. This was the "base-line"
performance.

There were several attempts to improve this performance.
Larry Bahler, paying attention to the details of the phonetic models,
produced a 1% gain quickly, modifying the phonetic models to have
short-circuit paths (the original models were limited to have at least
three frames each; his models allowed single frame phonemes).
He was later able to produce about 28% accuracy
with large (64 mixture) models and monophone phonetics.

The RASTA/PLP (Hermansky and Morgan)
folks were able to produce peformance about the same as the baseline,
except they noted that simple models (produced while training the
complicated models in a bootstrap mode) performed substantially worse
than the Mel Cepstral front ends. It is assumed that the large
dependence of the RASTA front end on past signals produces a
context-dependence not captured in the simpler models, but this seems
to point out a problem we did not anticipate - some front ends cannot be
optimized using simple models.

Smoothing the bigram model from the training text produced an
additional percentage or two, and the final best result from the
workshop was about 29% accuracy, using the standard Mel Cepstral
front end with first and second derivatives, and energy.
The system performed less well with spectral front ends, or without the
standard sentence-by-sentence Cepstral mean subtraction (essentially
blind deconvolution). No alternative front ends to date have bettered
this result in a fair test. This is difficult data.

But that is hardly the end of the story. Although no front ends
were able to substantially improve performance, we have yet to get
performance numbers from the finished rate-of-change weighted models
(Bahler), the analysis-by-synthesis model (Bakis) or the
time/frequency kernel models (Atlas/Pitton). In addition, speaker
dependent (or adaptive) models remain untested, mostly due to the slow
turnaround for experiments on our Sparc-10 system (no implications about
the slowness of the Sparc-10's, but rather about the computational
complexity of the multiple-Gaussian output distribution Markov models used).

One might ask whether the simple speech recognizer used in
this workshop limits performance to 30%. There are several pieces of
evidence that this is not the case:

a. Leo Neumeyer of SRI took the SRI recognizer DECIPHER,
trained on the same data as the workshop recognizer, but allowing
use of male/female information to produce sex-specific models (producing 32%
accuracy), and allowed the recognizer to retrain output distribution
means and variances on the TEST data using the EM algorithm. Thus the
word-level language model is legimitate, but the acoustics are
"adjusted" for the test. The accuracy increased dramatically to 70%.

b. Steve Young recognized 1/5 of the training corpus. In this
test, both the language model and the acoustics are cheats. However,
the accuracy of 76% indicatined that there was not a
structural limit to performance.

c. The BBN folks did several tests which speak to the
possibility of overtraining as the culprit. They doubled the amount
of training material, adding phrases with miscellaneous noises or
interfering signals which were excluded for the workshop standard corpus
to the training data. (Accurate transcripts allowed the trainer to
use the data appropriately.) No difference in the performance was
found. (Caution, the same speakers made up the
training material as before). They then split the training material
into males and females, and produced a male model, a female model, and
a mixed model. Past experience indicated that a substantial
improvement should be seen, but in fact this model, while producing
the best accuracy of the workshop (33%), fell far short of
expectations.

d. A second look at training was performed by the SRI folks,
who took a Wall Street Journal trained system (using 7000 training
sentences) and decoded the CC test material. Their performance fell
from 32% correct to 29%. Using the WSJ training as a starting point
and "sharpening" the model by hillclimbing from there using the
forward/backward algorithm on the CC training material, they were
able to achieve 33% accuracy, equal to the best.

While overtraining might be the culprit, I believe that it is
more likely that the acoustic models produce non-generalized training,
making any small subset of speakers likely to be recognized badly.

We have been able to demonstrate that standard recognition
performance by state-of-the-art recognizers and that of the workshop
toolkit-based recognizers are quite similar. Simple fixes do not work
on this data. On the other hand, our recognizer is capable of
credible performance (upwards of 70%). The puzzle is offered as one
of the problems for the coming year.

There are several open problems with the standard workshop
recognition scheme. The dictionary is single-pronunciation, an
obvious error. (See notes by David Talkin for interesting examples).
The phonetic models have minimum length of 3 frames,
and are "linear" models without allophones. The bootstrap training
scheme results in overtraining locally, and can be made better (See
Bill Byrne's notes for one possible solution). Many of these are
fixable in theory, and merely require work. It is not clear how much
improvement can be gained by any one a-priori.

Thus we were able to calibrate the performance of reasonable
recognizers on this difficult task, and demonstrate that the performance
of a manageable HTK recognizer was as good as the state-of-the-art
recognition system at SRI and BBN. Much remains to be done.

Underground problems: Analysis without recognition.

a. Doug Nelson and Chris Miller worked on the implementation
of a complex cepstrum based parameterization (using MATLAB), and are
analyzing the clustering performance of their parameters for various
versions of the TIMIT (and NTIMIT) databases. They hope to be able to
correlate speech recognition performance with some measure of
clustering performance in hopes that one can ultimately design speech
parameterizations without continually retraining a speech recogniton
system. If successful, this will substantially improve our ability to
improve performance through rapid turnaround front end design.
The analysis should be completed during the coming year.

b. A similar program was being pursued by Derrick Butler, a
student of Jont Allen's, who visited the workshop often, and who had
access to our computers. He was working with an auditory model
inspired by Jont Allen, and attempting to assess the goodness of
clustering performance inposed by the marked TIMIT text using d', a
ratio of inter-cluster to intra-cluster spacing. Unfortunately, Derrick
has not finished his work.

c. The same question in another guise was explored by Mazin
Rahim. He used a vector quantization error measure to look at the
sensitivity of various front end processors to noise and channel
distortion, based on several sentences from the TIMIT database. He notes
that dynamic parameterizations are different from static ones in their
sensitivity to mean adjustment, and his best predicted performance is
using RASTA (see Morgan, below). This prediction remains unconfirmed
in the speech recognition performance.

No group was able to fully evaluate their strategies
during the workshop, but each was working hard to finish afterwards.

In a different attempt to evaluate front end performance, Joe
Kupin and Jont Allen produced a speech-in-noise signal enhancer based
on Jont's suggestion that the correlations between some non-linear
function of the outputs of auditory filters was an essential part of
an auditory model. Joe programmed a signal enhancer in which each
auditory filter was summed at the output as a function of the
correlation of the zero crossings of the filters nearby. When fed
with noisy speech, this process produced undistorted speech at an
improved signal-to-noise.

A somehow different process was produced by Hynek Hermansky,
who applied RASTA processing (RASTA attempts to remove both very
fast and very slow components of the speech at each frequency)
to the cubic-root compressed power spectrum of speech
in the overlap-add analysis-resynthesis system.
This process produced speech that was qualitatively better
than the original noisy input, but was distorted.

While both processes produce qualitatively improved speech
signals, neither process has been tested yet in the speech recognizer.
There are plans to do so.

Speaker Identification

A second underground problem was that of speaker ID. The
focus was the well-known King database, in which there is a
substantial difference between the first 5 recordings for each speaker
and the second 5 (in the San Diego narrowband recordings). It was
discovered that it is difficult to produce speaker identification
algorithms which were robust across this "great divide" (first
noted at CCR-P in 1988). Best published performance has been about
80% speaker identification.

Doug Reynolds took to the opportunity of this workshop to
analyze various front end algorithms written by experts to feed his
Gaussian mixtures modeling scheme for speaker ID. Several front
ends performed at about the published rate of 80%, but a
world-record-holding performance of 93% was obtained from using 23rd
order LPC (very high order). This result remains to be verified on
other tasks, but is tantalizingly better than any previous work. See
Doug's writeup on this CD for details.

w Lin worked in parallel on a different venue, assessing the
efficacy of arrays of microphones for speaker identification. He
found a substantial positive effect of a linear array of microphones
on the ability to identify speakers in noise and reverberation.

Simulated Channels and Noise

Joe Kupin and John Hansen worked together to produce a shell
script which takes your favorite noise, your target speech, a defined
signal-to-noise ratio, and your favorite channel, and will produce a
database which looks like it has the defined characteristics. This is
one of the gems of the workshop, and is available on this CD. (A cleaned
up version is promised by John - contact him directly for further details).

Attendees and their interests:

The Participants

The 25 members of this workshop are varied in their
backgrounds, diverse in skills, and to a person dedicated to
understanding the puzzles of speech as an information bearing signal.
I have listed the participants and their affiliations, along with
their user logons from the workshop (critical to find things on the
accompanying CD).
This more detailed accounting of people's work might guide your perusal of the
CD. Listings are alphabetically by user ID.

Jim Flanagan CAIP: Jim served as technical catalyst, workshop
administrator, and organizer. His contributions and those of his
staff were low profile during the summer, but critical to the success
of this efforet.

(assaleh) Khaled Assaleh, CAIP: Feature extraction. Khaled's
robust feature extraction work (used in speaker recognition) is a
possible front end for future analyses.

(bahler) Larry Bahler, ITT Aerospace - Communications. Larry worked on
details of training and understanding the HMM models which were
produced by HTK. He and Jordan Cohen worked on variable frame rate
encoding with little success. He was able to demonstrate very good
performance with monophone models and large mixtures (up to 64
Gaussians per state).

(bakis) Raimo Bakis IBM-T.J.Watson Research Laboratory. Raimo worked on the
problem of Speech recogntion by synthesis, using a numerically trained
synthesizer ("abstract vocal tract model"). His work was being
finished as the workshop completed, and recognition results are
expected soon. Raimo was a large user of the Dec Alpha workstation.

(bdevries) Bert Devries - Sarnoff Research Center. Bert worked in array
microphone processing. He was an occasional visitor at the workshop.

(bbyrne) Bill Byrne, Entropic and University of Maryland. Bill was
charged with making the HTK toolkit available to the researchers with
as little pain as possible. in which he was successful. His support,
along with that of Steve Young, made it possible to perform experiments
from the first day of the workshop. His own work on training HMM's,
found in this CD, is provocative.

(cemille) Chris Miller, National Security Agency. Chris, a relative
newcomer to speech processing, worked with Doug Nelson applying MATLAB
to their analysis and display efforts.

(cwood) Cliford "Chip" Wood, Motorola: Chip worked on speech-epoch based
analysis of speech signals, with the emphasis on vocoding.

(dar) Doug A. Reynolds, MIT. Doug spearheaded the speaker
identification effort, trying out different front ends with gaussian
mixture modeling on the King database. He was the largest user of
the Dec Alpha workstation when it arrived on our network.

(djnelso) Doug Nelson, National Security Agency. Doug spearheaded the
effort to evaluate front ends without doing recognition. He also
worked on a novel magnitude-and-angle ceptstral representation.
Results are promised.

(dt) David Talkin, Entropic: The author of Waves, David provided
essential services in support for visualization and analysis of both
data and results during the workshop. He verified the transcripts of
much of the database, and provided an analysis of the alignment
successes of the recognizer (in short, terriffic). David's notes
contain detailed analyses of much of the switchboard data.

(hynek) Hynek Hermansky, Center for Spoken Language Understanding,
Oregon Graduate Institute: Hynek worked with Nelson Morgan to
understand the PLP/RASTA front ends and their interaction with the
Markov models and this fluent speech database. Their interesting
findings comparing the complexity of the Markov models and the RASTA
front end appear on this CD. Hynek also produced a RASTA-based noise
suppression algorithm which remains to be tested.

(jba) Jont B. Allen, AT&T Bell Labs: Jont provided many of the sparks
in this summer, offering several excellent reviews of auditory
processing from his unique perspective as a hands-on experimenter in
neurophysiology of hearing. In addition, Jont participated vigorously
in discussions of front end analysis without recognition, and provided
several students from Bell Labs to work in this area. With Joe Kupin,
Jont produced an auditory-model-based noise suppression algorithm
which remains to be tested.

(jfk) Jim Kaiser - Jim hobnobbed at the workshop, and worked on simple-to-
compute estimates of envelopes of signals, and on the extraction of
single tones from tone-in-noise.

(jhlh) John Hansen, Duke University. John brought his considerable skill in
noise simulation and processing of degraded speech. His work with
Kupin produced the wireline simulator available here, and he began
working with HTK recognition of noisy speech. (Initial results - Terrible!)

(jmcd) John McDonough BBN With Manhung Siu, John worked on an
automatic verification scheme to sort training data into valid and
invalid classes for better model-building. This work remains
unfinished. In addition, he provided continuing interface activities
with the BBN speech group, passing results, information, and models in
both directions.

(jrc) Jordan Cohen - Center for Communications Research - Princeton.
Workshop chairman and technical cheerleader.

(kupin) Joe Kupin, Center for Communications Research - Princeton. Joe brought the
wireline simulator and an IDA inspired X-based plot package, and got
both running during the workshop. He then worked with John Hansen to
develop the noise-and-channel-degradation softwoare, and with Jont
Allen to develop the first-pass auditory model for noise supression.
All of these programs are available on this CD.

(leo) Leo Neumeyer - SRI. Leo, who shared the six weeks with Digalakis
Vassilios, worked on workshop problems both at Rutgers and at Palo Alto.
They report that a recognizer trained on WSJ and tested on Switchboard
is only 3% worse than self-trained, and that bootstrapping from that
model to a switchboard-trained model matches the 33% reported by BBN.

(les) Les Atlas - U. Washington. Les joined us for the last week, and
worked with Jim Pitton to get the time/frequency based cepstral analysis
ready for speech recognition trials, which are continuing.

(mende) Bob Mende - CAIP. Bob was one of the primary support people for
our software/networks. We all depended on him.

(mo)Mari Ostendorf, Boston University: Mari pursued many interests
during this workshop. She and her students at BU attempted to compare
the straightforward HTK recognizer to her segment-based Markov
recognizer. In addition, she was instrumental in providing the
experimental platform to try out the Pitton/Atlas Time/Frequency
Kernel Transforms.

(morgan) Nelson Morgan, Berkley and International Computer Sci. Inst.:
Nelson worked with Hynek Hermansky to assess PLP/RASTA processing.
Hynek and he rewrote the RASTA programs, and these general-purpose
portable codes are included in this CD.

(mrahim) Mazin Rahim, AT&T Bell Labs, formerly CAIP: Mazin worked in
the analysis of speech degradations (or front end efficacy) using
TIMIT and a novel measurement of differences in vector quantizer
assignments. His notes describe this intriguing technique.

(msiu) Manhung Siu - BBN. Manhung spent a week at the workshop exploring
verfication of training data with John McDonough. He assisted in using
the BBN speech recognition system to run parallel experiments.

(nagendra) Nagendra Kumar - Johns Hopkins. Nagendra was a one week visiting
student, who experimented with standard triphone models, and started work
on a cochlear model following the work of the Hopkins group.

(netsch) Lorin Netsch - Texas Instruments. Lorin began to explore ideas in spectral
subtraction in the context of the HTK recognizer.

(neuburg) Ned Neuburg, Center for Communications Research - Princeton. Ned worked on the
problem of reconstructing a signal from its spectogram. He considered
various signal processing techniques, and hillclimbing to get a
"valid" two dimensional probability distribution.

(porter) Adam Porter - CAIP. Adam carried much of the load for system and
network support.

(qlin) Quigang Lin, CAIP, with Jim Flanagan. Q Lin worked comparing
data recorded with a microphone array to that of a close talking
microphone. He reports success with both a linear and 2-D microphone
array in enhancing speaker identification under adverse acoustic
conditions.

(scarter) Stephen Carter - CAIP. Steve "owned" the hardware and software and
building in which we were housed. He kept us in chairs, desks, keycards,
cycles, and various other necessities. He was definitely "downhill" - and
did a terriffic job.

(sjy) Steve Young, University of Cambridge: Principal author of the HTK
toolkit, Steve provided support, and worked on his favorite problem, that of
a multi-noise-model decoder for corrupted signals.

(vas) Vassilios Digalakis, SRI: Vassili served as a
spark with Mari to pursue alternate decoder representations. In
addition, he provided liason with the efforts at SRI to shadow the
workshop efforts, and to provide interesting and timely feedback.

(yojimbo) Jim Pitton, University of Washington: Jim and Les Atlas (a
latecomer) worked on cepstral analysis based on time/freuency kernels.
Their pictures were beautiful, but thus far the results are
counterintuitvely poorer than the standard. This is a classic case
of an intuitively superior process providing poorer results. The
interface is probably incorrect, and remains a puzzle.

The Workshop Schedule

One of the tenets of this workshop was cooperative effort, and to that
end we started each week with a "show and tell" to announce coming
events and encourage technical discussion. Bull sessions were held on
an as-needed basis during the week, and Fridays were times when
outside visitors were especially encouraged (visitors were always
welcome). We often hosted one or two outside speakers. Bisnhu
Atal chaired two ad-hoc discussion groups on dynamic features in
speech, and there were other miscellaneous events. The calendar was
kept by the chairman, and was available via email to a list of
interested parties.

The final calendar for the workshop may be found in Appendix I.

The time was filled with work, informal discussions, troubleshooting,
and general fellowship. Each morning a continental breakfast was
available at CAIP starting at 8:30 AM, and "tea" was served around
3:00 PM.

Facilities and Administration

The host site for the Workshop was the Center for Computer Aids
for Industrial Productivity (CAIP) at Rutgers University. CAIP is an
Advanced Technology Center, chartered by the Commission on Science and
Technology, and devoted to applications of high-performance computing.
It is a consortium of industry, government and university. One-fourth
of its support stems from its 30 member corporations, one-fourth from
the Commission, and one-half from contract research with industry and
government.

CAIP's operating budget is approximately $6 million per year, not
including faculty salaries. This budget supports approximately 80
researchers, half of whom are faculty members and research staff, and half
of whom are advanced graduate thesis researchers.

CAIP is housed in 42,000 square feet of laboratory and office space in
the top two floors (6th and 7th) of a new research building on the
science campus (Busch) of the university. CAIP owns and, with a group
of four professionals, maintains state-of-the art computing facilities
which are networked internally (FDDI and Ethernet) and externally
(Internet). The computing complex includes several main frame
machines, servers, and disc support. CAIP also is a node on the
experimental fiber optic data network, XUNET, providing 45Mbps
connectivity to major computing sites across the country.

CAIP's business office, with a staff of four, reports to the Associate
Director, and provides an infrastructure for facility administration,
equipment acquisition, contracting, member company services, and planning
for seminars, short courses and workshops.

CAIP was selected by NSA to provide a collaborative and hospitable
research environment for approximately 25 accomplished scientists working
in speech processing. The Center was asked to help organize, plan, and
implement the 6- week research effort, and to help select attendees
suitable to the technical objectives of the sponsor. Arrangements for travel,
living accommodations, per diem payments and local auto rental
transportation were organized by the business office, and acquisition of
laboratory fixtures and networked computing facilities was conducted by
the CAIP computer services group. Recreational facilities of the
university, including access to a new $20 million athletic center
adjoining CAIP, were also arranged for participants.

To stay within allocated budgets, the computing equipment selected for
the workshop included a new Sparc 10/20 with 64 mbytes of memory for
each participant. These workstations were set up with individual
desks, files and storage in the two largest rooms on the east end of
Floor 6. These rooms are adjacent to spacious foyer areas and to a
third room which was used as a lounge for informal discussions, coffee
breaks, and library of relevant papers. The workstations were
connected by a dedicated network to a large Sun Microsystems server
(4/690 MP) located in the CAIP computer complex, and supported by 30
Gbytes of disk storage. Additional to the digital audio input/output
of the Sparc 10's, Ariel Proports were available for mono and stereo
I/O. CD-Rom readers were also available at the workstations, and a
central CD-Rom writer was available in the main computer room.

Additional to the upstairs spaces and facilities, the first-floor
lecture hall of the CAIP building, seating approximately 125 persons
and providing projection and audio/video facilities, was made
available for all large-group discussions, periodic project reviews,
seminars, and invited tutorials (see activities schedule in Appendix
I).

An especial effort was made to create a climate and atmosphere for
effective collaboration and information exchange. The research topics
stressed pre-competitive knowledge acquisition more than competitive
system comparisons, realizing that many of the attendees are competitors
for similar contract support.

These objectives seem to have been met, with remarkable team efforts being
mounted and with significant advances in fundamental understanding
resulting from the close collaborations. In a number of instances, the
investigations initiated in the Workshop are continuing. CAIP Log-ons for
participants and workshop databases are being kept active, so that
experiments can continue from participants' home locations.

Final Notes:

The workshop machine remains on the internet
(server.rutgers.edu), and workshop participants have been active since
August in pursuing their favorite ideas. Comments and suggestions are
welcome. Future workshops are anticipated. Our thanks to BBN, BU,
SRI, Cambridge University, ICSI, and Entropic Systems for making our
efforts more interesting this summer. Special thanks to the staff of
CAIP, and especially Marilyn Ballentine and Sandra Epstein, without
whom we would be thinner and much more haggard after dealing with the
Rutgers administration.

This workshop cannot be fully judged from this CD. Work continues on
many of the problems, and the computers at CAIP continue to support
this effort. You are welcome to join in the task, as a complete
description of the workshop efforts is included below. Let us know if
you have success.

There are some gems on this CD:
The HTK scripts allow the easy use of the HTK Toolkit for building a speech recognizer.
The scripts by Hansen and Kupin allows the production of realistically degraded noisy speech.
RASTA afficianados will find a completely rewritten and modularized version under morgan
Time/Frequency Kernel fans will find a general time/frequency analysis suite under yojimbo
Switchboard Credit Card Corpus details are provided by dt
A very nice X-based interactive graphics package is provided by Kupin
The Gaussian Mixture Speaker ID programs provided by Reynolds

There may be more - dig in and see. Let us know what you think -
comments to jrc@server.rutgers.edu. Complaints, comments, and
opinions welcome.

Jordan Cohen and Jim Flanagan
10 September 1993

Appendix I - The Final Workshop Schedule

7/6/93 8:30 - Registration Lobby, CoRE Building
9:30 a.m. Continental Breakfast
Available

7/6/93 9:30 a.m. Convocation Lecture Hall
Introductions 1st Floor,CoRE Building
Administrative Procedures
Workshop Schedule
Workshop Objectives
CAIP Overview
Laboratory Assignments

7/6/93 10:30 a.m. Coffee, Tea; Adjourn to Laboratory Offices

7/6/93 5:30 p.m. Wine/Cheese Reception CAIP Board Room
7th Floor,CoRE Building

7/6/93 6:30 p.m. Buffet Dinner CAIP Board Room Foyer
7th Floor,CoRE Building

7/7/93 9:30 a.m. Staff Self Introductions
10:30 a.m. Introduction to Computer Lecture Hall
Facilities 1st Floor,CoRE Building

2:00 p.m. Introduction to Entropics Lecture Hall
Software 1st Floor,CoRE Building

3:30 p.m. Entropics Helpsessions 715
7/8/93 10:00 a.m. Nelson Morgan Lecture Hall
RASTA signal processing 1st Floor,CoRE Building

2:00 p.m. Jim Pitton Lecture Hall
Time Frequency Representations 1st Floor,CoRE Building

7/9/93 10:00 a.m. Doug Nelson Lecture Hall
New Ideas in Signal Processing 1st Floor,CoRE Building

1:30 p.m. Jont Allen Lecture Hall
Human Speech Recognition - 1st Floor,CoRE Building
Implications for modeling

3:30 p.m. Bull Session (Bishnu Atal) 6th Floor Foyer
Dynamics in Speech Signals

7/12/93 10:00 a.m. General Show and Tell 6th floor foyer
10:30 a.m. Steve Young 1st floor,CoRe building
HTK overview and tutorial

7/13/93 10:00 a.m. Richard Stern (cmu) Lecture Hall
Adaptation Algorithms 1st Floor,CoRe Building

3:00 p.m. Bull Session (Nelson) 6th floor foyer
Recognizer independend analysis
of front ends

7/15/93 10:00a.m. Another Waves Demo 715
David Talkin

2:00 p.m. Bull Session (Atal) 6th Floor Foyer
Speech Dynamics

7/16/93 10:00 a.m. Tutorial - Fred Juang Lecture Hall
Speech Recognition in Adverse 1st Floor,CoRE Building
Environments

7/18/93 2:00 p.m. Picnic and games Cohens, Belle Mead
(Note SUNDAY) 32 Dead Tree Run Road
Belle Mead, NJ
908-359-7926

7/19/93 10:00 a.m. Show and Tell 6th floor foyer
Everybody

7/22/93 2:00 p.m. Yariv Ephraim Lecture Hall
Speech Recognition in 1st Floor,CoRe Building
Noisy Environments

7/23/93 10:00 a.m. Tutorial - Oded Ghitza Lecture Hall
Auditory Models and Human 1st Floor,CoRE Building
Performance in Tasks Related
To Speech Coding and Speech
Recognition

2:00 p.m. Steve Young Lecture Hall
Noise Compensation in the 1st Floor,CoRE Building
Parallel Model Combination
Framework"

7/26/93 10:00 a.m. Show and tell 6th floor foyer
Everybody

7/28/93 2:30 CAIP tour All Over Caip
Everybody

7/29/93 10:00 Who's Doing What Lecture Hall
Everybody 1st Floor,CoRE Building

7/30/93 10:00 a.m. Tutorial - Moise Goldstein Lecture Hall
Auditory Periphery as 1st Floor,CoRE Building
Speech Processor

2:00 p.m. Jont Allen Lecture Hall
Mechanics of Hearing 1st Floor,CoRE Building

8/6/93 10:00 a.m. Tutorial - Alan Gorin Lecture Hall
Semantic Associations, Acoustic 1st Floor,CoRE Building
Metrics and Adaptive Language
Acquisition

8/9/93 10:00 a.m. Show and tell 6th floor foyer
everybody

8/9/93 2:00 p.m. SGI Indy Demo 1st floor conference rm

8/10/93 2:00 p.m. Planning for Thursday: 6th floor foyer
Who says What When
Everyone

(many visitors on 11 and 12 August)

(Note - watch this space for changes in time/location of the following 2 talks)
8/11/93 10:00 a.m. khaled Assaleh 1st floor conference rm
Robust Features for Speaker ID

10:45 a.m. Kevin Farrell 1st floor conference rm
Neural Tree Networks for
Speaker Recognition

8/12/93 10:00 a.m. Staff and Invitees 1st floor conference rm
to 4:00 p.m. Wrapup Session (open to the public)

4:30 p.m. Reception 7th floor, CAIP Board rm

7:00 p.m. Dinner (and entertainment?) Seoul House

8/13/93 Cleanup and Departure