readme.txt The UCLA Speaker Variability Database by Patricia Keating, Jody Kreiman, Abeer Alwan, Adam Chong, Yoonjeong Lee Language (ID): English (eng) Data Source: microphone speech, microphone conversation Recommended Applications: linguistic analysis, speech recognition, speaker identification, phonetics, psycholinguistics The UCLA Speaker Variability Database comprises high-quality audio recordings from 202 speakers, 101 men and 101 women, performing 12 brief speech tasks in English over three recording sessions (total amount of speech: 300-450 sec per speaker). This public database was designed to sample variability in speaking within individual speakers and across a large number of speakers. The large set of speakers (similar in age) sampled from the current university community is gender-balanced and has a variety of language backgrounds. The database can serve as a testing ground for research questions involving between-speaker variability, within-speaker variability, and text-dependent variability. Most of the tasks are unscripted except for the reading tasks. The recordings have been orthographically transcribed, and dictionary broad transcriptions (in ARPAbet) have been force-aligned. Transcriptions are provided in the form of separate .TextGrid files. Recordings were made in a sound-attenuated booth using a Bruel & Kjaer microphone suspended from a baseball cap worn by the speaker. Recordings were direct-to-disk at 22 kHz sampling rate (bit rate: 705 kbps, bit depth: 32) using PCQuirerX and its hardware. The audio files available through LDC were downsampled to 16-bit, 16 kHz audio and FLAC-compressed. The higher resolution .wav files (16-bit, 22 kHz audio) are available by request to the first author, Patricia Keating. Each folder contains one speaker's data. The corpus is entirely in English, but speakers' language backgrounds vary. Of the 202, 143 are (self-reported) monolingual English speakers. The Excel file "public_database_speaker_info" contains information about the speakers and the recordings. Files are named with the speaker number followed by the session (A/B/C) followed by the speech task. The vowels and sentences task were performed in all 3 sessions. The other tasks were: - instructions in session A - neutral in session A - happy in session B - phonecall in session B - annoyed in session C - video in session C Instructions to the speakers for the tasks (shown on-screen) were: Vowels task: "Please say the "aahh" vowel sound (as in the word "spa") three times, pausing in between, like this: “aahh” ... “aahh” ... “aahh”" Sentences task: "Each of the next 30 screens shows one sentence. Please read the sentence out loud, then click "Next" to move on to the next sentence. If you make a mistake, don't worry - no need to read it again." [The 5 IEEE/Harvard sentences recorded are: The boy was there when the sun rose. / Kick the ball straight and follow through. / Help the woman get back to her feet. / A pot of tea helps to pass the evening. / The soft cushion broke the man's fall.] Instructions task: "Talk to the RA who is outside the booth. Give her either DIRECTIONS on how to go somewhere, or INSTRUCTIONS on how to do something (your choice - anything you like). Try to talk for 30 seconds." [a printed list of a few possible topics was available] Neutral task: "Tell the RA about a CONVERSATION that wasn't important - not exciting, not upsetting, just normal. Repeat that conversation as best as you can, in a "FIRST SHE SAID...THEN I SAID" style. Try to talk for 30 seconds." [a printed list of a few possible topics was available] Happy task: "Talk to the RA who is outside the booth. Tell her about a CONVERSATION you've had about something exciting, that made you really happy. Repeat that conversation as best as you can, in a "FIRST SHE SAID...THEN I SAID" style. Try to talk for 30 seconds." [a printed list of a few possible topics was available] Phonecall task: "Using your own phone or ours, call the friend or relative you've arranged to talk to at this time. Talk about anything you want for a couple of minutes. Only your side of the conversation will be recorded." Annoyed task: "Talk to the RA who is outside the booth. Tell her about a CONVERSATION you've had about something that really annoyed you. Repeat that CONVERSATION as best as you can, in a "FIRST HE SAID..., THEN I SAID ..." style. Try to talk for 30 seconds. Don't say anything that would embarrass anyone else!" [a printed list of a few possible topics was available] Video task: "You're going to watch a 1-minute collection of videos of either kittens or puppies (your choice). Please talk out loud to the pets as you watch the videos. Can you be as cute as they are?" Most of the .flac files have 2 corresponding Praat TextGrid files. "filename.TextGrid" has one tier, an orthographic sentence/utterance transcription, which was the input to forced alignment. "filename_FAVE.TextGrid" or "filename_darla.TextGrid" is the output from forced alignment (either the Penn FAVE, or the Dartmouth DARLA), with 2 tiers, the first with the aligned phonemic transcription (in ARPAbet) and the second with the aligned orthographic word transcription. The vowel tasks have only one TextGrid, as these recordings were not force-aligned. In vowel files, the three [a] vowels are segmented and labeled. For all of the recordings from the sentence reading task, the force-aligned segmentations were individually checked and manually corrected to provide precise alignments. A more detailed description of the database is available at: https://icphs2019.org/icphs2019-fullpapers/pdf/full-paper_46.pdf P. Keating, J. Kreiman, & A. Alwan (2019): "A new speech database for within- and between-speaker variability", Proceedings of the 19th International Congress of Phonetic Sciences, Melbourne, Australia 2019, ed. Calhoun et al. (pp. 736-739), Canberra, Australia: Australasian Speech Science and Technology Association Inc. If you have any questions, please email Patricia Keating (keating AT humnet DOT ucla DOT edu). Date: December 9, 2020