|LDC Catalog No.:
|March 19, 2008
|speech recognition, speech synthesis
LDC User Agreement for Non-Members
|Subscription & Standard Members, and Non-Members
|Morales, Nicolas. STC-TIMIT 1.0 LDC2008S03. Web Download. Philadelphia: Linguistic Data Consortium, 2008.
STC-TIMIT 1.0 is a telephone version of TIMIT Acoustic Phonetic Continuous Speech Corpus, LDC93S1 (TIMIT). TIMIT contains broadband recordings of 630 speakers of eight major dialects of American English reading ten phonetically rich sentences. Created in 1993, TIMIT was designed to provide speech data for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. Since that time, several corpora have been developed using the TIMIT database: NTIMIT, LDC93S2 (transmitting TIMIT recordings through a telephone handset and over various channels in the NYNEX telephone network and redigitizing them); CTIMIT, LDC96S30 (passing TIMIT files through cellular telephone circuits); FFMTIMIT, LDC96S32 (re-recording TIMIT files with a free-field microphone); and HTIMIT, LDC98S67 (re-recording a subset of TIMIT files through different telephone handsets).
What differentiates STC-TIMIT 1.0 from other TIMIT-derived corpora is that the entire TIMIT database was passed through an actual telephone channel in a single call. Thus, a single type of channel distortion and noise affect the whole database.
The process was managed using a Dialogic switchboard for the calling and receiving ends. No transducer (microphone) was employed; the original digital signal was converted to analog using the switchboard's A/D converter, transmitted trough a telephone channel and converted back to digital format before recording. As a result, the only distortion introduced is that of the telephone channel itself.
The STC-TIMIT 1.0 database is organized in the same manner as in the original TIMIT corpus: 4620 files belonging to the training partition and 1680 files belonging to the test partition. Files were recorded using 8kHz sampling frequency and muLaw encoding. Additionally four sets of two calibration tones were generated. These were passed through the telephone line approximately at the start of every 1/4th of the whole database (both the source and recorded calibration tones in each set are provided). Calibration tones are:
- 2 sec. 1kHz tone
- 2 sec. sweep tone from 10 Hz to 4000 Hz.
Utterances in STC-TIMIT 1.0 are time-aligned with those of TIMIT with an average precision of 0.125 ms (1 sample), by maximizing the cross-correlation between pairs of files from each corpus. Thus, labels from TIMIT may be used for STC-TIMIT 1.0, and the effects of telephone channels may be studied on a frame-by-frame basis.
Originally a single wav file was created by concatenation of all files in the TIMIT database. This file was downsampled to 8kHz and compressed using muLaw encoding.
Two telephone lines within the same building were connected to a Dialogic(R) card. One of the lines was used as the calling-end and played the speech file, while the other line was used as the receiving-end and recorded the new signal. The whole recording process was conducted in a single call. Incoming speech was recorded using 8kHz sampling frequency and muLaw encoding.
After recording, the file was pre-cut according to the length of the corresponding TIMIT database file. Each resulting file was then aligned to its corresponding file in TIMIT using the xcorr routine in Matlab(R). Based on these results, the recorded file was sliced again from the original recorded file using the newly-generated alignments. Thus, each file in STC-TIMIT 1.0 is aligned to its equivalent in TIMIT and has the same length.
For an example of the data contained in this corps, please listen to this audio sample.