CSLU: ISOLET Spoken Letter Database Version 1.3

Item Name: CSLU: ISOLET Spoken Letter Database Version 1.3
Author(s): Ronald Cole, Y Muthusamy, Mark Fanty
LDC Catalog No.: LDC2008S07
ISBN: 1-58563-488-3
ISLRN: 707-184-716-094-7
Release Date: September 15, 2008
Member Year(s): 2008
DCMI Type(s): Sound
Sample Type: PCM
Sample Rate: 16000
Data Source(s): microphone speech
Application(s): speech synthesis, speaker identification, language modeling
Language(s): English
Language ID(s): eng
License(s): CSLU Agreement
Online Documentation: LDC2008S07 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Cole, Ronald, Y Muthusamy, and Mark Fanty. CSLU: ISOLET Spoken Letter Database Version 1.3 LDC2008S07. Web Download. Philadelphia: Linguistic Data Consortium, 2008.


CSLU: ISOLET Spoken Letter Database Version 1.3, Linguistic Data Consortium (LDC) catalog number LDC2008S07 and isbn 1-58563-488-3, was created by the Center for Spoken Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon Health and Science University, Beaverton, Oregon.

CSLU: ISOLET Spoken Letter Database Version 1.3 is a database of letters of the English alphabet spoken in isolation under quiet laboratory conditions and associated transcripts. The data was collected in 1990 and consists of two productions of each letter by 150 speakers (7800 spoken letters) for approximately 1.25 hours of speech. The subjects were recruited through advertising and consisted of 75 male speakers and 75 female speakers. Each subject received a free dessert at a local restaurant in exchange for his or her participation in the data collection. All speakers reported English as their native language. Their ages varied from 14 to 72 years; the speakers' average age was 35 years.


Speech was recorded in the OGI speech recognition laboratory. The room measured 15' by 15' with a tile floor, standard office wall board and drop ceiling and contained two Sun workstations and three disk drives.

The recording equipment was selected to mimic the equipment used to collect the TIMIT database as closely as possible. The speech was recorded with a Sennheiser HMD 224 noise-canceling microphone, low pass filtered at 7.6 kHz. Data capture was performed using the AT&T DSP32 board installed in a Sun 4/110. The data were sampled at 16 kHz and converted to RIFF(.WAV) format.

The subjects were seated in front of a Sun workstation and prompted with letters in random order. After each prompt, the subject would strike the return key and say the letter. Two seconds of speech were recorded and immediately played back for verification. If the subject spoke too soon or too late and missed the two-second buffer, or if the experimenter or subject decided that the letter was misspoken, the recording was repeated. There was no attempt to elicit ideal speech. A letter was judged to be misspoken only if there was a significant departure from normal pronunciation.

After the recording session, each utterance was verified by a human examiner for two determinations. First, the examiner viewed a waveform of the utterance to determine that the speech was padded with silence. The examiner then listened to the speech and noted any ambiguous or misspoken utterances. All utterances noted by the examiner were examined by two additional human examiners. If a majority of the examiners perceived that an utterance was abnormal, that utterance, and the rest of the utterances from that speaker, were removed from the corpus.

The transcriptions of the recorded speech are time-aligned phonetic transcriptions conforming to the CSLU Labeling standards. Time-aligned word transcriptions are represented in a standard orthography or romanization. Speech and non-speech phenomena are distinguished. The transcriptions are aligned to a waveform by placing boundaries to mark the beginning and ending of words. In addition to the specification of boundaries, this level of transcription includes additional commentary on salient speech and non-speech characteristics, such as glottalization, inhalation, and exhalation.


For an example of the data in this corpus, please listen to this audio sample (.WAV) of a speaker speaking the letter "a". The labeling for this sample can be seen below: MillisecondsPerFrame: 1.000000 END OF HEADER 0 95 .pau 95 285 ^ 285 425 .pau

Available Media

View Fees

Extra Copy
Login for the applicable fee