File: STC-TIMIT_1_0.txt, updated: 4/October/2007. Author: Nicolas Morales The STC-TIMIT 1.0 Speech Corpus HCTLab ATVS Lab Universidad Autónoma de Madrid Escuela Politécnica Superior Ciudad Universitaria de Cantoblanco Calle Francisco Tomás y Valiente, 11 28049 - Madrid (Spain) Copyright Pending The STC-TIMIT 1.0 Speech Corpus, is a telephone version of the widely used TIMIT speech corpus. STC-TIMIT 1.0 was recorded by passing the original files of the TIMIT corpus through an actual telephone channel. The process was managed using a Dialogic switchboard for the calling and receiving ends and no transducer (microphone) was employed (the original digital signal is converted to analog using the switchboard's A/D converter, transmitted trough a telephone channel and converted back to digital format before recording). As a result the only distortion introduced is that of the telephone channel itself. What makes this database special is that the whole original TIMIT database was passed through the telephone channel in a single call. Thus a single type of channel distortion and noise affect the whole database. The database is organized in the same manner as the original TIMIT corpus: 4620 files belonging to the training partition and 1680 belonging to the test partition. Files are recorded using 8kHz sampling frequency and muLaw encoding. Additionally 4 sets of 2 calibration tones were generated. These were passed through the telephone line approximately at the start of every 1/4th of the whole database (both the source and recorded calibration tones in each set are provided). Calibration tones are: - 2 sec. 1kHz tone - 2 sec. sweep tone from 10 Hz to 4000 Hz. Utterances in STC-TIMIT are time-aligned with those of TIMIT with an average precision of 0.125 ms (1 sample), by maximizing the cross-correlation between pairs of files from each corpus. Thus, labels from TIMIT may be used for STC-TIMIT 1.0, and the effects of telephone channels may be studied on a frame-by-frame basis (our method is more efficient than sending start and end tones as is done in the NTIMIT corpus and the result is a significantly more precise alignment). Development ----------- Originally a single wav file was created by concatenation of all files in the TIMIT database. This file was downsampled to 8kHz and compressed using muLaw encoding. Two telephone lines within the same building were connected to a Dialogic(R) card. One of the lines was used as the calling-end and played the speech file, while the other was set as the receiving-end and recorded the new signal. The whole recording process was conducted in a single call. Incoming speech was recorded using 8kHz sampling frequency and muLaw encoding. After recording, the file was pre-cut according to the length of the TIMIT database files. Each resulting file was then aligned to its corresponding file in TIMIT using the xcorr routine in Matlab(R). Based on these results, the recorded file was sliced again from the original recorded file, using the newly generated alignments. Thus, each file in STC-TIMIT 1.0 is aligned to its equivalent in TIMIT and has the same length. Related publications -------------------- 1. N. Morales. "Robust Speech recognition under band-limited channels and other channel distortions". PhD. Dissertation, Computer Science Department, Universidad Autonoma de Madrid, Spain. November, 2007. 2. N. Morales, D.T. Toledano, J.H.L. Hansen and J. Garrido. "Multivariate cepstral feature compensation on band limited data for robust speech recognition". Proceedings NODALIDA’07, pages 144-151. May, 2007. 3. N. Morales, J. Tejedor, D.T. Toledano and J. Garrido. "STC-TIMIT: Sending TIMIT through a real and single telephone channel". Submitted to Language Resources and Evaluation Conference. May, 2008. Credits ------- Original idea, database design and signal pre- and post-processing by Nicolas Morales. Dialogic(R) handling and the recording process was performed by Javier Tejedor. Supervised by Profs. Doroteo T. Toledano, Javier Garrido and Jose Colas.