Documentation: This database includes high-speed laryngeal video recordings of the vocal folds and synchronized audio recordings. It provides useful data for studying speech production theory and related topics. For example, it can be used to study the relationship between the characteristics of vocal folds vibration and resultant voice quality. Vocal folds vibration characteristics can be extracted from high-speed laryngeal video recordings and corresponding voice characteristics can be extracted from the synchronized audio recordings. A set of synchronous audio recordings and high-speed videoendoscopic images of the vocal folds were collected. The database included recordings from 9 subjects. None of the subjects had a history of voice disorder. There was no requirement on native language when we recruited subjects. Therefore subjects' native languages include English, Chinese Mandarin, Taiwanese Mandarin, Cantonese, German, etc. Speakers were asked to sustain the vowel /i/ for approximately 10 s while holding voice quality, fundamental frequency (F0), and loudness as steady as possible. Across tokens, speakers varied their F0 (low, normal, and high) and voice quality (pressed, normal, and breathy) quasi-orthogonally, resulting in nine steady-state recordings from each speaker. In addition to these steady-state recordings, each speaker may also have glide phonations such as loudness glide, voice quality glide, and F0 glide. The vowel /i/ was selected to optimize the view of the vocal folds. The voice recorded is only /i/, so no audio annotation is necessary. Each folder S** (** denotes the subject number) contains data from one subject. For example, S01 denotes recordings from subject 1. Video files are in .avi format and audio files are in .flac format. Each video file has a corresponding audio file, with the same base file name (but different file extension). Audio signals were synchronously recorded with a Bruel & Kjær microphone (1.27cm diameter; type 4193 -L-004) and directly digitized at a sampling rate of 50 kHz, with a conditioning amplifier (NEXUS 2690, Bruel & Kjær, Denmark). Microphone signals were bandpass filtered between 20 Hz and 22.4 kHz. The A/D converter (PCI-DAS64/M1/16, Measurement Computing, Norton, MA) had a voltage resolution of 16 bits with input range +/-5V. The audio recordings were later downsampled to 16kHz and flac compressed for analysis. High-speed images of the vocal folds were recorded using a Phantom V210 camera (Vision Research, Wayne, NJ) at a sampling rate of 10 000 frames/s, with a resolution of 208x352 pixels. The camera was mounted on a Glidecam Camcrane 200 (Glidecam Industries, Kingston, MA). The A/D converter (Module 9223, National Instruments, Austin, TX) had a voltage resolution of 16 bits with input range +/-/10 V. Synchronized audio and high-speed images were recorded for 6 seconds. Each video recording has a size of about 5 GB. The video recordings included in the database have been converted to 5 fps for normal playback. The recordings were collected between April 2012 and April 2013.