KING DATABASE DEVELOPMENT 1. Database Specification The database contains ten short conversations from 51 adult male subjects recorded over long-distance telephone lines. The recordings were made at two locations: DCD-West in San Diego, California and DCD-East in Nutley, New Jersey. Twenty six subjects were recorded at DCD-West and twenty five at DCD-East. The first five sessions were recorded at nominal time intervals of one week, the second five, one month. In some cases, when scheduling conflicts existed or when subjects were not available for scheduled sessions, intervals between sessions were greater or less than the nominal interval. Each session contains at least 30 seconds of speech, excluding silence. The recordings were made in quiet (not anechoic) rooms using high-quality recording equipment. A "loop-back" calling arrangement was employed, allowing simultaneous recordings of the of the original "clean" speech and the degraded speech after transduction by a standard carbon-button microphone and long-distance transmission. The speech material itself consisted of excerpts from conversations involving each subject and an interlocutor. The interlocutor's side of the conversations was not recorded. Conversational elicitation tasks were designed and employed to obtain natural, extemporaneous speech from the subjects. After the database was recorded and processed, the speech material itself and the phonetic marks were stored digitally on magtapes to be delivered to the Government with this report. Three types of information are contained on the magtapes: (1) PCM files, containing digitized speech waveforms, (2) LOLA files (for "localized labels") containing phonetic transcriptions and endpoints, and (3) WORDS files containing orthographic transcriptions (English text) with endpoints of every word. 2. Equipment Setup Figure 1 is an overall picture of the equipment setup. The subject and the interlocutor sat in separate, adjoining rooms with a window in between. The subject spoke into a telephone handset, and wore a small, lightweight headset through which he could hear the interlocutor. The telephone was a standard Western Electric rotary-dial phone, modified by addition of second microphone inside the mouthpiece of the handset. The second microphone was a small, high-quality electret microphone (Electronic Enterprises Model 5333). The interlocutor wore a pair of stereo headphones and a boom-mounted microphone. The two ears of the headphones monitored the clean speech and telephone-degraded speech as they were recorded on the tape recorder. The interlocutor's microphone was connected through an audio amplifier to the subject's headset. Two separate lines were used in the long-distance loop-back calling procedure. The subject's telephone was connected to Line 1. To initiate a loop-back connection, the subject placed a call by dialing the long-distance service and giving the access code, a customer code, the area code, and the telephone number corresponding to Line 2. The interlocutor "answered" by turning on a phone patch unit connected to Line 2. The Allnet long-distance service was used. At DCD-West, Allnet was reached simply by dialing "615". The access code and customer code were not required. At DCD-East, Allnet was reached by dialing an "800" number. No attempt was made to select telephone lines based on their audio fidelity. Every conversation (session) used a different connection, and no connections were rejected because of poor quality. Figure 2 is a more detailed diagram of the recording equipment. The central component of the setup is a recording system consisting of two units-- a Sony model SL-2000 beta-format video cassette recorder and a Sony model PCM-701 ES digital audio processor. The inputs and outputs of the recording system are analog, but the audio is recorded digitally on video cassette tapes. Recordings were made in stereo, with the clean speech on the right channel, and the telephone speech on the left channel. At the beginning of each recording session, an audio sweep tone was included on the telephone channel for frequency-response calibration purposes. At DCD-West, the sweep tone was played out from a computer file by means of a digital-to-analog converter. At DCD-East, recordings of the sweep tone were played out from an audio cassette recorder. Interfaces between audio lines and telephone line were handled by Heathkit model HD-15 phone patches. Phone Patch A was used to put the sweep tone on telephone Line 1. Phone Patch B was used to monitor Line 2, supplying audio to the left channel input of the recording system. Audio input to the right channel came from the electret microphone in the subject's telephone handset. 3. Subjects The subjects were all adult male employees of ITTDCD. Most of the subjects were engineers; some were engineering managers. Almost all the subjects were native English speakers. Talkers from DCD-East are largely from the New York area dialect; talkers from DCD-West are more varied, coming from a wider geographic distribution. The 26 DCD-West subjects included all the male employees of the facility. All the subjects finished all ten sessions. At DCD-East, 33 subjects started the recordings and 25 finished all ten sessions. 4. Speech Material Subjects were first asked to read in a normal voice a printed page that contained (1) identification information, (2) an informed consent passage, (3) a short paragraph or group of sentences, and (4) a list of 3-digit phrases. They were then given one of five elicitation tasks by the interlocutor, The purpose of these tasks was to elicit natural, fluent speech from the subjects. Although the speech material was the subjects' side of conversations with the interlocutor, the interlocutor's participation was kept to a minimum. The following is a description of the elicitation tasks. 1. The Toys This task involves the use of a prop made from plastic components of various shapes and sizes that are sold under the brand name of Construx. The subject is given a boxy object with turnable wheels, suction cups, a movable platform and several appendages. The interlocutor asks him to describe the object, how it is put together, and what it might be used for. 2. Shapes and Figures The subject is given twelve cards, each containing an abstract figure. He is asked by the interlocutor to arrange them in a particular and then to describe each one so that the interlocutor can put them in the same order. 3. The Conversation This is an open conversation that usually requires more interaction with the interlocutor. The beginning topic is one in which the subject has indicated an interest. This is determined through a brief questionnaire filled out prior to the task. The interlocutor encourages the subject to discuss at length his own interests and experiences, and topics he is familiar with. Questions were chosen for interest and effectiveness, but were screened to avoid disclosing subject's political views or sensitive personal information. 4. The Road Rally The subject is given a map of a road rally that has thirteen place names missing. He must fill in the names on the map based on information from the questions he asks. The interlocutor has a similar map that contains all the place names. The interlocutor will elicit as much verbal output as possible by asking for more details and clarification. 5. The Photographs The subject views several photographs mounted on cards and is asked questions by the interlocutor such as (1) describe each photograph, (2) guess what the person in the photograph is doing or thinking, and (3) describe which photograph he likes best and why. 6. The Comic Book In this task the subject is given a set of several comic strips from which the dialogue has been deleted. The subject is then asked to make up a frame by frame reconstruction of the dialogue in his own words. Sessions were terminated after an elapsed time of about five minutes of conversation. Segments of speech were then selected for digitization. Long, uninterrupted phrases and sentences were selected first. Long pauses were eliminated. For each session, about 30 seconds of speech, excluding pauses, was digitized. The video-format tapes contain the whole recorded sessions, not just the digitized portions. 5. Subjective Evaluation The elicitation tasks were very effective in producing natural conversational speech. Even speakers known to be "quiet" had no difficulty talking fluently for the required period of time. The audio fidelity of the "clean" channel of all the recordings is very good. The fidelity of the telephone channel is noticeably worse for recordings made at DCD-East than for recordings made at DCD-West. There are two reasons for this. First, the DCD-East telephone lines appear to have narrower bandwidth and greater loss in amplitude. Second, there is noticeable 60-Hz "hum" in the DCD-East recordings. This hum exists in spite of the fact that the same equipment was used at the two locations, and every possible step was taken to eliminate hum. A second difference between the recordings made at the two locations is that there is practically no time delay between the clean and telephone channels of the DCD-West recordings, whereas there is often substantial delay (up to about one half second) in the DCD-East recordings. This observation suggests that the DCD-East connections may have sometimes involved a satellite link. Several listeners who were experienced in processing telephone speech judged that the fidelity of the DCD-East calls was poorer than average, but that both the DCD-West and DCD-East recordings were within the range of typical long-distance calls. References 1. D. H. Klatt, "Review of the ARPA Speech Understanding Project," J. Acoust. Soc. Amer., vol. 62, no. 6, pp. 1345-1366, December 1977. 2. V. Zue, S. Cyphers, R. Kassel, D. Kaufman, H. Leung, M. Randolph, S. Seneff, J. Unverferth, III, and T. Wilson, "The Development of the MIT Lisp-Machine Based Speech Research Workstation," Proc. ICASSP-86, vol. 1, pp. 329-332, Tokyo, Japan, April, 1986. 3. A. L. Higgins, "Speaker Recognition by Template Matching," Proc. Speech Tech 86, New York, NY, April 1986. 4. A. Higgins and R.E. Wohlford, "Keyword Recognition Using Template Concatenation," Proc. Int'l. Conf. Acoust., Speech, and Sig. Proc., Tampa, FL, March 1985. 5. R.E. Wohlford, "Wordspotting Techniques," Final Report, ITT Defense Communications Division - West, San Diego, California, 1982. 6. J. W. Glenn and N. Kleiner, "Speaker identification based on nasal phonation," J. Acoust. Soc. Amer., vol. 43, pp. 368-372, 1968. 7. R. Duda and P. Hart, Pattern Classification and Scene Analysis, Wiley and Sons, Inc., 1973. 8. K. N. Stevens, C. E. Williams, J. R. Carbonell, and B. Woods, "Speaker Authentication and Identification: A Comparison of Spectrographic and Auditory Presentations of Speech Material," J. Acoust. Soc. Amer., vol. 44, pp. 1596-1607, 1968. 9. S. Seneff, "Pitch and Spectral Estimation of Speech Based on Auditory Synchrony Model," Proc. Int'l. Conf. on Acoustics Speech and Sig. Proc., 1984. 10. D. Friedman, "Instantaneous Frequency Distribution vs. Time: An Interpretation of the Phase Structure of Speech," Proc. ICASSP 85, vol. 3, pp. 1121-1124, Tampa, FL, March 1985. 11. M. J. Hunt, "A Robust Formant-Based Speech Spectrum Comparison Measure," ICASSP-85, pp. 1117-1120, 1985. 12. D. Broad and H. Wakita, "Piecewise-planar Representation of Vowel Formant Frequencies," J. Acoust. Soc. Amer., vol. 62, no. 6, pp. 1467-1473, December 1977. 13 D. J. Broad, "Piece-wise planar vowel formant distribution across speakers," J. Acoust. Soc. Amer., vol. 69, no. 5, pp. 1423-1429, 1981. 14. R. Hamming, Coding and Information Theory, Prentice-Hall, Englewood Cliffs, NJ, 1980. 15. E. H. Wrench, "A realtime implementation of a text independent speaker recognition system," Proc. IEEE Int'l. Conf. Acoust., Speech and Signal Process., no. 1, pp. 193-196, 1981. 16. J. J. Wolf, "Efficient acoustic parameters for speaker recognition," J. Acoust. Soc. Amer., vol. 51, pp. 2044-2055, 1972. 17. J. D. Wise, J. A. Caprio, and T. W. Parks, "Maximum Likelihood Pitch Estimation," IEEE Trans. Acoustics, Speech, and Sig. Proc., vol. ASSP-24, no. 5, pp. 418-423, Oct. 1976. 18. J. E. Atkinson, "Inter- and intra-speaker variability in fundamental voice frequency," J. Acoust. Soc. Amer., vol. 60, pp. 440-445, 1976. 19. J. B. Attili and M. I. Savic, "A TMS-32020 Based Speaker Verification System," Final Report, p. 75 pages, Rensselaer Polytechnic Institute, Troy, New York, 1986. 20. W. Majursky, NBS. personal communication