CSLU: ISOLET Spoken Letter Database Version 1.3


Item Name: CSLU: ISOLET Spoken Letter Database Version 1.3
Authors: R. A. Cole, Y. Muthusamy, and M. Fanty
LDC Catalog No.: LDC2008S07
ISBN: 1-58563-488-3
Release Date: Sep 15, 2008
Data Type: speech
Sample Rate: 16000 Hz
Sampling Format: PCM
Data Source(s): microphone speech
Application(s): language modeling, speaker identification, speech synthesis
Language(s): English
Language ID(s): eng
Distribution: 1 CD
Member fee: $0 for 2008 members
Non-member Fee: US $150.00
Reduced-License Fee: US $150.00
Extra-Copy Fee: US $150.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: R. A. Cole, Y. Muthusamy, and M. Fanty
2008
CSLU: ISOLET Spoken Letter Database Version 1.3
Linguistic Data Consortium, Philadelphia

Introduction

CSLU: ISOLET Spoken Letter Database Version 1.3, Linguistic Data Consortium (LDC) catalog number LDC2008S07 and isbn 1-58563-488-3, was created by the Center for Spoken Language Understanding (CSLU) at OGI School of Science and Engineering, Oregon Health and Science University, Beaverton, Oregon.

CSLU: ISOLET Spoken Letter Database Version 1.3 is a database of letters of the English alphabet spoken in isolation under quiet laboratory conditions and associated transcripts. The data was collected in 1990 and consists of two productions of each letter by 150 speakers (7800 spoken letters) for approximately 1.25 hours of speech. The subjects were recruited through advertising and consisted of 75 male speakers and 75 female speakers. Each subject received a free dessert at a local restaurant in exchange for his or her participation in the data collection. All speakers reported English as their native language. Their ages varied from 14 to 72 years; the speakers' average age was 35 years.

Data

Speech was recorded in the OGI speech recognition laboratory. The room measured 15' by 15' with a tile floor, standard office wall board and drop ceiling and contained two Sun workstations and three disk drives.

The recording equipment was selected to mimic the equipment used to collect the TIMIT database as closely as possible. The speech was recorded with a Sennheiser HMD 224 noise-canceling microphone, low pass filtered at 7.6 kHz. Data capture was performed using the AT&T DSP32 board installed in a Sun 4/110. The data were sampled at 16 kHz and converted to RIFF(.WAV) format.

The subjects were seated in front of a Sun workstation and prompted with letters in random order. After each prompt, the subject would strike the return key and say the letter. Two seconds of speech were recorded and immediately played back for verification. If the subject spoke too soon or too late and missed the two-second buffer, or if the experimenter or subject decided that the letter was misspoken, the recording was repeated. There was no attempt to elicit ideal speech. A letter was judged to be misspoken only if there was a significant departure from normal pronunciation.

After the recording session, each utterance was verified by a human examiner for two determinations. First, the examiner viewed a waveform of the utterance to determine that the speech was padded with silence. The examiner then listened to the speech and noted any ambiguous or misspoken utterances. All utterances noted by the examiner were examined by two additional human examiners. If a majority of the examiners perceived that an utterance was abnormal, that utterance, and the rest of the utterances from that speaker, were removed from the corpus.

The transcriptions of the recorded speech are time-aligned phonetic transcriptions conforming to the CSLU Labeling standards. Time-aligned word transcriptions are represented in a standard orthography or romanization. Speech and non-speech phenomena are distinguished. The transcriptions are aligned to a waveform by placing boundaries to mark the beginning and ending of words. In addition to the specification of boundaries, this level of transcription includes additional commentary on salient speech and non-speech characteristics, such as glottalization, inhalation, and exhalation.

Samples

For an example of the data in this corpus, please listen to this audio sample (.WAV) of a speaker speaking the letter "a". The labeling for this sample can be seen below: MillisecondsPerFrame: 1.000000 END OF HEADER 0 95 .pau 95 285 ^ 285 425 .pau

Content Copyright

Portions 1990, 1996, 2000, 2002 Center for Spoken Language Understanding, Oregon Health and Science University, 2008 Trustees of the University of Pennsylvania