TRANSCRIPTION OF NEW SPEAKING STYLES - VOICEMAIL M. Padmanabhan, B. Ramabhadran, E. Eide, G. Ramaswamy, L. R. Bahl, P. S. Gopalakrishnan, S. Roukos IBM T. J. Watson Research Center P. O. Box 218, Yorktown Heights, NY 10598 1 INTRODUCTION In this paper we describe a new testbed for developing speech recognition algorithms - a VoiceMail transcription task, analogous to other tasks such as the Switchboard, CallHome [1] and the Hub 4 tasks [2] which are currently used by speech recognition researchers. Spontaneous speech occurring in day-today life can broadly be classified into two categories (i) where the speaker does not receive any external feedback to direct his/her speech, and (ii) where the speaker receives external feedback from another person/machine/audience. Examples of the former category are radio broadcast news, voicemail etc., and examples of the latter category are telephone conversations, natural language transaction systems (eg. ATIS), seminars, etc. In general to obtain the best performance in transcribing a certain style of speech, it is necessary to train the speech recognition system on similar style of training data. Some of the speech categories mentioned above are quite well represented in currently existing databases. However, voicemail data is not well represented in any database, even though it represents a very large volume of real-world speech data. Consequently there is a need for a Voicemail database in order to improve transcription performance on a voicemail transcription task, and also to establish a new testbed for speech recognition algorithms. Similar to the Switchboard/CallHome databases, the Voicemail database comprises telephone bandwidth spontaneous speech. However the difference with respect to the Switchboard and CallHome tasks is that the interaction is not between two humans, but rather between a human and a machine. Consequently, the speech is expected to be a little more formal in its nature, without the problems of cross-talk, barge-in etc. This eliminates some of the variables and provides more controlled conditions enabling one to concentrate on the aspects of spontaneous speech and effects of the telephone channel. In this paper, we will describe the modality of collection of the speech data, and some algorithmic techniques that were devised based on this data. We will also describe the initial results of transcription performance on this task. 2 DATA COLLECTION For details of the data collection scheme see [3]. Briefly, some of the characteristics of the voicemail data are as follows: - The data represents extremely spontaneous speech. - The data contains both long-distance and local calls. - Each voicemail message typically has a click at the beginning and/or end of the message arising from the caller hanging up. - The data is subject to the compression of the phonemail system, which leads to a small degradation in accuracy. - The average length of a voicemail message is 31 seconds, however, the peak of the histogram of voicemail durations occurs at 18 seconds. - The average rate of the speech is approximately 190 words per minute. - The topics covered in the collected data ranged from personal messages to messages with technical or business-related content. - The database was not quite gender balanced, with the percentage of male speakers being 38%. 3 SYSTEM OVERVIEW We will first briefly describe the IBM large-vocabulary speech recognition system. Essential aspects of the system used in the experiments here have been described earlier [4]; however, we will summarize the main features here: The acoustic features used are 13-dimensional cepstra and their first and second differences, and a feature vector is extracted every 10 msec from the 8KHz sampled voicemail data. Words are represented as sequences of phones. Each phone is further divided into 3 sub-phonetic units which correspond roughly to the beginning, middle, and end of each phone. The system uses context-dependent HMM acoustic models for these sub-phonetic units. For each sub-phonetic unit a decision tree is constructed from the training data [4]. Each leaf of the tree corresponds to a different set of contexts. The acoustic observations that characterize the training data at each leaf are modeled as a mixture of gaussian pdf's, with diagonal covariance matrices. The systems used in this paper had approximately 2700 leaves, and anywhere from 17000 to 170000 gaussians. The system also uses an envelopesearch algorithm [4] to hypothesize a sequence of words corresponding to the utterance. A simple word Ngram (bigram or trigram) model is used to compute the language model probabilities. 4 ACOUSTIC MODELS In this section, we will describe the construction of the acoustic models for this task. The first step is the construction of the decision trees to model contextdependent variations of the sub-phonetic units. The goal here is to model variations in pronunciation arising from context. However, as the voicemail data contains data from different environments, use of this data during the tree growing process may result in trees that try to model the environment variations rather than pronunciation variations. Further, the amount of voicemail data currently available is only around 20 hours. Consequently, we decided to bandlimit the Wall Street Journal SI-284 primary microphone data (WSJ-P) to 200-3400 Hz using a linear-phase 200 tap Lerner filter [5], and used this data to construct the decision trees and the gaussians modelling the leaves of the tree. The parameters of the acoustic model were then re-estimated via the E-M algorithm using the voicemail data. In order to model the clicks in the voicemail messages, we decided to augment the phone alphabet by adding a 'click' phone. We also added a 'mumble' phone to model inarticulate segments of the messages. Both the 'click' and 'mumble' phones were modelled with 3-state HMM's just as for the other phones. 4.1 Clean-up of transcriptions The initial transcriptions that we started off with for the 20 hours of voicemail data were not very clean and had a fair number of transcription errors. As it would have been impractical to verify all these transcriptions manually, we devised an automatic scheme to identify possible transcription errors. This tagged around 1% of the data, and we then corrected these transcriptions manually. Very briefly, the main idea used in the tagging scheme was to viterbi align the speech data against the (possibly incorrect) transcription, and then identify regions where the log-likelihood assigned to a phone by the alignment process was particularly low. For more details see [3]. This process identified script errors as well as baseform errors - for example (i) we originally only had one baseform for IRA, AY AA R EY (the acronym baseform). In the recorded data IRA occurred as a name with pronunciation AY R AA, and was flagged as an error (ii) there were several instances where disfluencies such as 'UH' and 'UM' had not been transcribed, and the technique flagged a number of these errors. 4.2 Compound words An additional observation arising from the tagged segments of the acoustic data was that crossword coarticulation was very common in this data because of the casual nature of the speech and the fast speaking rate. For instance, the phrase 'going to take' would often be pronounced as 'gontake = G OW N T EY KD', in which case at least one of the phones in the phonetic representation for 'going to take' would be flagged. This was clearly not a transcription error, but we needed some mechanism to model such crossword co-articulation effects (degemination, palatization etc.). For our initial experiments, we chose to model such effects by constructing compound words [9, 10]. For instance going-to-take would be a compound word, with several possible baseform representations, one of which would be 'G OW N T EY KD'. We selected these compound words based on the tagged segments of the acoustic training data. Some examples of the compound words and their pronunciations is given in Table I (see postscript file). The use of these compound words serves a dual purpose. Firstly, they enable the modelling of crossword co-articulation effects. Secondly, it is generally the case that decoding errors are more common in shorter words, hence, as the compound words have relatively long baseforms, there are fewer errors in the compound words. We decided to extend the second piece of reasoning above and apply it to model commonly occurring phrases in the voicemail data. Hence, we constructed compound words of the form 'giveme-a-call', 'thank-you', 'thanks-a-lot', 'when-you-geta-chance' etc. The use of these compound words helped bring down the error rate as shown in the section on experimental results. 4.3 Phonological rules In order to model co-articulation effects in words other than compound words, we used some of the phonological rules described in [6]. Examples of such coarticulation effects are plosive deletion (deletion of word final TD in the word sequence 'excellent point'), palatization (did-you being pronounced as 'D IH JH UW'), etc. Such effects can be accounted for using linguistic rules [6, 7, 8], that specify the conditions under which the boundary phones in a word may be deleted or replaced by other phones. In our implementation, we assumed that only the final and initial phones of the two words in question would be candidates for modification. Also, the changes to the boundary phones were determined using the last two phones of the previous word and the initial phone of the succeeding word only. Further, any number of words could be combined using these rules to produce one long word; for example, 'what-did-you' is a result of the application of two rules, one at the 'what did' juncture and the other at the 'did you' juncture. Finally, all the phonological rules used were optional, i.e., there were no compulsory replacements. Some of the rules that we implemented are listed below (see postscript file) (Pn\Gamma 1 and Pn denote the last two phones of the first word, and N1 denotes the first phone of the next word). 1. Geminate Deletion: If Pn = Consonant and N1 = Same consonant then delete Pn Example: thisstreet DH IH S T R IY TD 2. Palatization: If Pn =D and N1 = Y then replace Pn with JH and delete N1 Example: did-you D IH JH UW and what-you W AH CH UW 3. Plosive Deletion: If Pn\Gamma 1 = N, Pn = plosive and N1 = plosive the delete Pn Example: went-down W EH N D AW N 4. If Pn\Gamma 1 = N, Pn = D and N1 = DH then delete Pn Example: and-then AX N DH EH N 5. If Pn\Gamma 1 = DH, Pn = AX and N1 = Vowel then replace Pn with IH Example: the-apple DH IX AE P AX L 6. If Pn = S or Z and N1 = SH then delete Pn Example: this-show DH IH SH OW 7. If Pn\Gamma 1 = Vowel, Pn =T and N1 = Vowel then replace Pn with DX Example: that-again DH AE DX AX G EH N These rules helped bring down the error rate as indicated in Table VI. Also, analysis of the decoded output indicated that we did not introduce any new insertions or deletions in the process of combining words. 4.4 Model Complexity Adaptation As mentioned earlier, we model leaves in our system with mixtures of gaussians. In general, ad-hoc rules are used to determine the number of mixture components that will be used to model a particular leaf - for example, the number of components is made proportional to the amount of data, subject to a maximum number. This choice of the number of components may not necessarily provide the best classification performance - consequently, we introduced a discriminant measure to choose the number of mixture components in a more optimal manner. The details of this algorithm are given elsewhere [11], so we will only summarize it briefly here. The essence of the algorithm is to start with a small baseline system, and evaluate how well the gaussian mixture model for a leaf models the data for that leaf. This is done by computing the posterior probability of correct classification of the data for that leaf. If this probability is low, this implies that the model for the leaf does not match the data for the leaf very well; hence, the resolution of the model for the leaf is increased by adding more components to its model. In our implementation, we start with two systems (say S1 and S2), where S2 models each leaf with more gaussians than S1. Subsequently, we find those leaves that are not adequetely modelled by S1 according to our discriminant criterion, and replace the model for that leaf in S1 with the corresponding model from S2. 4.5 VTL Adaptation We implemented the VTL technique described in [12, 13, 14] to obtain speaker-normalized models. The technique of [12] uses a mixture of gaussians to model voiced speech, and tries to warp the frequencies of a speaker such that the likelihood of the warped data is maximized by the voiced speech model. The initial generic voiced speech model (mixture of 512 Gaussians) used to seed the iterative process was obtained from gender-balanced WSJ data (10 male and 10 female speakers). In order to determine the voiced frames of speech, we viterbi aligned the data and picked only the frames corresponding to vowels. We selected 17 discrete warp scales ranging from 0.80 to 1.12, and signal-processed the speaker's data using each of these warp scales, and compute the likelihood of the warped features using the generic voiced speech model. The warp scale that scores best is then selected. We repeated this process a few times, re-estimating the generic voiced-speech model at every iteration. Finally, the gaussians modelling the context-dependent sub-phonetic units were trained on the features corresponding to the best warp scale for each speaker, to obtain speakernormalized models. Experimental results are tabulated in Table VI. 4.6 MLLR Adaptation Finally, we used MLLR adaptation [15] to adapt the acoustic models. In brief, MLLR tries to compute a linear transform that is applied to the means and variances of the gaussians in order to maximize the likelihood of the adaptation data computed with the transformed model. For this technique, it is necessary to have acoustic adaptation data and the corresponding transcription. We used the test data itself as adaptation data, along with the transcription produced by a speaker-independent system to bootstrap the acoustic models. The acoustic models were adapted independently for every voicemail message in the test set (unsupervised sentence-based adaptation). 5 LANGUAGE MODEL The transcription of the 20 hours of voicemail data contained approximately 220K words. This was adequate to build a bigram/trigram language model for the voicemail task. In addition, we attempted to make use of the 2M words of data from the Switchboard database by constructing a trigram language model from the Switchboard data and using a weighted mixture of the language model probabilities provided by the Voicemail and Switchboard language models in the decoder. Further, these language models were constructed from transcriptions that included compound words (i.e. the original transcriptions had been filtered to replace selected sequences of words with a compound word). Furthermore, in an attempt to use the small amount of voicemail data parsimoniously, we investigated the use of word-classes. The classes were hand-selected based on semantics and/or transcription inconsistencies, and the trigram model used was: p(w3|w2w1) = p(c3|c2c1)p(w3|c3) (1) where ci is the class of word i and p(wi|ci) is the relative frequency of word i in its class, smoothed against a flat model. Some specimen classes are shown in Table II (see postscript file). 6 EXPERIMENTAL RESULTS Our first set of experiments were conducted when we had only 10 hours of training data available, and several of these experiments were repeated on 20 hours of training data. We will present experimental results for both these training sets (we will refer to them as Vmail10 and Vmail20), as the difference in performance gives an indication of the effect of increasing the amount of training data on different components of the recognizer (acoustic model, language models, etc.). 6.1 Test data The test data was 43 voicemail messages (picked at random from the collected data, and not included in the training set). The size of the Vmail10 vocabulary was 6K words, and the out-of-vocabulary (o.o.v.) rate of the test data with respect to this vocabulary was 4.6%. The size of the Vmail20 vocabulary was 10K words, and the oov rate of the test data with respect to this vocabulary was 3.5%. The results in this paper are reported only on the development test data, as the evaluation set had not yet been defined at the time the paper was written. 6.1.1 Computation of word error rate--In computing the word error rate, as disfluencies do not contribute to the semantic meaning of the utterance, we decided to filter out all instances of disfluencies in both the reference transcripts and the decoded transcripts, before computing the word error rate of the decoded transcripts. Consequently, deletions of disfluencies in the original reference transcript would not be interpreted an error, substitutions of disfluencies in the original reference transcript with other disfluencies would again not be interpreted as an error. However, substitutions of disfluencies in the original reference script with words other than disfluencies would be interpreted as insertion errors. Also, as we are primarily concerned with the word error rate and not the compound-word error rate, we replaced all compound words in the reference and decoded transcripts with the corresponding sequence of words before computing the error rate. 6.1.2 Perplexity of test set--We computed the word perplexity of the test set using various language models. As mentioned above, the filtered reference transcript did not contain any disfluencies; however, the language model data did contain disfluencies (as they are known to be useful linguistic predictors). Consequently, in our perplexity calculations, we computed the total log probability of the words in the unfiltered reference transcripts using the language model (hence the disfluencies were used to predict the word probabilities, and the log probabilities of the disfluencies was also included in the total), and subsequently subtracted out the log probabilities of all the disfluency words in the original reference transcript, before computing the average log probability per word. Also, this measure of perplexity was computed with compoundwords in the reference transcript because the language model data also included compound-words 1. The word perplexity measure was computed with a bigram and trigram LM qconstructed from the 220K words of voicemail data, and with a weighted mixture of the voicemail trigram LM and a trigram LM constructed from switchboard data in the proportion 0.8 to 0.2. Also, we present the perplexity numbers both with and without taking into account the log probability of the disfluencies in the reference transcript (see Table III in postscript file). 6.2 Switchboard training As the voicemail data and switchboard data both represent telephone-bandwidth spontaneous speech, we initially decoded the voicemail test data using the models used in the Switchboard '95 evaluation [1] (row 1 of Table IV). Subsequently, we replaced the switchboard language model with a bigram that had been trained on 10 hours of voicemail (row 2 of Table IV). Finally, we re-estimated the parameters of the switchboard acoustic model using the Vmail20 data (row 3 of Table IV). The word error rates are summarized in Table IV. The last row in this table represents a system bootstrapped from a switchboard model and then trained on the voicemail data. (see postscript file). 6.3 Vmail10 training set The results of several experiments are summarized in Table V (see postscript file). For all experiments except the last two, only the Vmail10 training set was used for both the acoustic and language models. Results are presented in an incremental manner i.e. each row of the table represents a single change that was made with respect to the previous row, and the description of this change is indicated in the row of the table. The row numbers referred to in the next paragraph refer to the rows in Table V (see postscript file). (1) The baseline system corresponded to a system with 83.5K gaussians and a bigram LM (row 1). (2) Next we added compound words to the vocabulary (see Section. 4.2) and decoded with the same acoustic models as before (row 2). (3) Next, we cleaned up the transcriptions and retrained the acoustic models (see Section 4.1). The error rate corresponding to this condition is shown in row 3. (4) Then, we estimated a model-complexity-adapted model by putting together a system with 17K gaussians and 175K gaussians (S1 has 17K gaussians, S2 has 175K gaussians - see Section 4.4). The modelcomplexity-adapted (MCA) model had 32K gaussians. The parameters of this system were then re-estimated using the Vmail10 training set, and the corresponding error rate is shown in row 4. (5) Next, we replaced the bigram LM with a trigram and used the new LM in conjunction with the MCA system described above (row 5). (6) Then we used a class-based trigram LM (see Section 4.6) (row 6). (7) Finally, the acoustic models were re-estimated using MLLR adaptation in unsupervised mode, and on a per-sentence basis, and the adapted models were used with the class-based trigram language model (see Section. 4.6) (row 7). 6.4 Vmail20 training set We conducted a number of incremental experiments to observe the effect of adding additional training data to different components of the recognizer. The word error rates are given in Table VI (see postscript file) (any reference to row numbers in the remainder of this section should be interpreted as row of Table VI). (1) We started with the system corresponding to row 3 of Table V, which gave an error rate of 49.75%, and simply re-estimated the parameters of this acoustic model using the Vmail20 datbase (LM is a bigram estimated from the Vmail10 data). This dropped the error rate to 46.22% (row 1). (2) Subsequently, we re-estimated the bigram LM using the Vmail20 database, and decoded the test data using the same acoustic model as in row 1. This dropped the error rate to 45.12% (row 2). (3) Subsequently, we estimated a trigram LM using the Vmail20 database, and used this with the same acoustic model of row 1. This dropped the error rate to 42.7% (row 3). (4) Next we used a weighted mixture of the Vmail trigram LM of row 3, and a trigram built off the Switchboard data (in the proportion 0.3 Swb LM probability + 0.7 Vmail20 LM probability). The error rate corresponding to this condition was 42.95% (row 4). (5) Next, we estimated a MCA model putting together a system (S1) with 83.5K gaussians, and a system (S2) with 175K gaussians. The resulting MCA model had 78K gaussians. Using the mixture trigram LM of row 4, and the MCA model dropped the error rate to 42.20% (further details are given in the next section). Further, tuning the mixture weights in the language model reduced the error to 41.94% (final weights were 0.2 Switchboard trigram and 0.8 Vmail20 trigram). (6) Next, we used VTL to construct a speaker-normalized equivalent of the MCA model, and decoded using the weighted mixture LM of row 5. The error rate dropped to 40.52% (row 6). (7) Next, we used the iterative MLLR technique to adapt the means of the gaussians of row 5, individually for each message, and used these adapted models with the weighted mixture LM of row 5. The error rate dropped to 39.43% (row 7). (8) Next, we started with the speaker-normalized VTL models of row 6 and further applied the iterative MLLR technique to further refine the means of the gaussians for each individual message. These adapted models were then used with the weighted mixture LM of row 5. As can be seen from the results (row 8), the effect of VTL and MLLR does appear to be additive. (9) Finally, we applied the phonological rules of Section. 4.3 in the decoding process, and used them with the models of (8). This brought the error rate down to 38.18% (row 9). 6.4.1 Model Complexity Adaptation--We now present some experimental results on model complexity adapation (MCA) (see Section. 4.4) that indicate that the new method of determining the complexity of the model yields consistent gains over standard methods. We constructed five models using the standard ad-hoc method of allocating a fixed number of gaussians for the each leaf. These models respectively had a maximum of 7, 12, 35, 60, and 150 gaussians per mixture (gpm). Subsequently, we used MCA to construct models that replace the gaussian mixtures for some leaves in the 7 gpm model with gaussian mixtures from the 35 gpm model. This model will be referred to as 7x35 in the following table (Table VII see postscript file). Table VII tabulates the error rates and the size of several models, constructed by conventional means, and using MCA. Table VII (see postscript file) The error rate as a function of the number of gaussians in the model is shown plotted in Fig. 1 (see postscript file), and it can be seen that the MCA models consistently outperform the conventional models by around 5% (relative). Also, note that due to the limited amount of training data, the error rate starts increasing as the number of parameters increases beyond a certain point. 7 CONCLUSION We reported transcription word error rates on a new testbed representing telephone-bandwidth spontaneous speech i.e., the task of voicemail transcription. We described the process of bootstrapping the models starting from either the Switchboard data, or bandlimited Wall Street Journal data. The results show that better performance was obtained in the latter case. We described several techniques that were used to construct the acoustic models including (i) the use of compound words and linguistically derived phonological rules to model co-articulation effects that occur in spontaneous speech (ii) a new model-complexity adaptation technique that uses a discriminant measure to allocate gaussians to the mixtures modelling allophones. We also investigated the efficacy of some well known acoustic adaptation techniques on this task. We also described experiments related to building language models using the limited amount of training data available in this domain. We then reported experimental results that showed that most of the modelling techniques we investigated were useful in reducing the word error rate. Finally, we reported experimental results on two different sized training sets to show the effect of increasing the training data on different (acoustic and linguistic) components of the recognizer. 8 ACKNOWLEDGEMENT We would like to acknowledge the support of DARPA under Grant MDA972-97-C-0012 for funding this work. REFERENCES [1] Proceedings of LVCSR Workshop, Oct 1996, Maritime Institure of Technology. [2] Proceedings of ARPA Speech and Natural Language Workshop, 1995, Morgan Kaufman Publishers. [3] M. Padmanabhan, G. Ramaswamy, B. Ramabhadran, P. S. Gopalakrishnan, C. Dunn, "Issues involved in voicemail data collection", elsewhere in these proceedings. [4] L. R. Bahl et al., "Performance of the IBM large vocabulary continuous speech recognition system on the ARPA Wall Street Journal task", Proceedings of the ICASSP, pp 41-44, 1995. [5] K. Martin and M. Padmanabhan, "Resonator-In A-Loop Filter-Banks based on a Lerner grouping of outputs", Proceedings of the ICASSP, 1992. [6] E. P. Giachin, A. E. Rosenberg and C. H. Lee, "Word juncture modeling using phonological rules for HMM-based continuous speech recognition", Computer, Speech and Language, pp 155- 168, Academic Press, 1991. [7] P. S. Cohen and R. L. Mercer, "The Phonological Component of an Automatic Speech Recognition System", Speech Recognition, (D. Raj Reddy ed.), Academic Press, pp 275 - 320, 1975. [8] B. T. Oshika, V. W. Zue, R. V. Weeks, H. New, and J. Aurback, "The Role of Phonological Rules in Speech Undersatnding Research", IEEE Transactions on ASSP, vol. 23, pp 104-112, 1975. [9] M. Finke and A. Waibel, "Speaking mode dependent pronunciation modeling in large vocabulary conversational speech recognition", Proceedings of EUROSPEECH 1997, vol. 5, pp 2379-2382. [10] P. Jeanrenaud, et al., "Reducing word error rate on conversational speech from the Switchboard corpus", Proceedings of ICASSP, 1995, pp 53-56. [11] L. R. Bahl, M. Padmanabhan, "A discriminant measure for model complexity adaptation", submitted to ICASSP 98. [12] S. Wegman, D. McAllaster, J. Orloff, and B. Peskin, "Speaker normalization on conversational telephone speech". [13] E. Eide and H. Gish, "A parametric approach to vocal tract length normalization", Proceedings of ICASSP, 1996, pp 346-348. [14] T. Kamm, G. Andreou, and J. Cohen, "Vocal tract normalization in speech recognition: compensating for systematic speaker variability", Proc. 15th Annual Speech Research Symposium, CLSP, Johns Hopkins University, Baltimore, June 1995, pp 175-178. [15] C. J. Legetter and P. C. Woodland, "Maximum Likelihood Linear Regression for Speaker Adaptation of Continuous density HMM's", Computer Speech and Language, vol. 9, no. 2, pp 171-186.