File: WTIMIT_1_0.txt, 17.12.2009 Author: Patrick Bauer The WTIMIT 1.0 Speech Corpus Abteilung Signalverarbeitung Institut für Nachrichtentechnik Technische Universität Braunschweig Schleinitzstraße 22 38106 Braunschweig, Germany (Copyright Pending) WTIMIT 1.0 [1] is a wideband mobile telephony derivative of TIMIT Acoustic-Phonetic Continuous Speech Corpus. Due to upcoming wideband speech codecs, wideband telephony speech transmission is already feasible nowadays, even in an increasing number of mobile networks. Hence,a real-world wideband mobile telephone bandwidth adjunct to TIMIT appears to be desirable for a wide range of scientific investigations, as well as development and evaluation of systems [2]. WTIMIT 1.0 was recorded after transmission of the original TIMIT speech files over T-Mobile’s AMR-WB-capable 3G mobile network in The Hague, The Netherlands. The transmission was conducted by means of two Nokia 6220 mobile phones that were prepared for electrical input and output with microphone equalization switched off, and employing the AMR-WB codec. WTIMIT 1.0 is organized according to the original TIMIT corpus. The training subset consists of 4620 speech files, while the test subset contains 1680 speech files. The speech format of the WTIMIT corpus is raw (without any header information) and specified as follows: - 16 kHz sampling rate - signed, 16 bit, 1-channel linear PCM sampling format - little-endian byte order Development ----------- The original TIMIT speech files were converted into raw data by dropping the first 1024 bytes of header information and concatenated to 11 signal chunks of at most 30 min duration. Each signal chunk was preceded by a 4 s calibration tone, comprising 2 s of a 1 kHz sine wave and another 2 s of a linear sweep from 0 to 8 kHz. At the sending end, the prepared speech chunks were played back by a laptop PC and digitally transferred via FireWire to an external DAC. The analog signal was electrically fed into the microphone input of the transmitting mobile phone using an audio quality test cable. The output attenuation of the DAC was adjusted in orders to prevent analog saturation at the input circuit of the phone while ensuring optimal dynamic range. For each speech chunk, a separate call was established to the phone at the receiving end. Using the field test monitoring software of the phones, it was confirmed that they were situated in different network cells at all times during transmission and that the AMR-WB speech codec was employed constantly at the data rate of 12.65 kbit/s. Note that this bitrate is by far the most widely used one. At the receiving end, the analog headphone output of the receiving mobile phone was connected electrically to an ADC. The analog input gain of the latter device was adjusted once initially to exploit the dynamic range of the ADC. Sampling was performed at a rate of 48 kHz, the native sampling rate of the ADC, and with 16 bit precision. The digital speech signals were transferred to a laptop PC again via FireWire and recorded onto a hard drive. The transmitted speech chunks were decimated from 48 kHz to 16 kHz sampling rate and finally de-concatenated by maximizing the cross-correlation between them and the original speech files. Following the de-concatenation methodology of STC-TIMIT [3], the utterances in WTIMIT 1.0 can be considered to be time-aligned with an average precision of one sample with those of TIMIT. So TIMIT’s original label files are basically valid for WTIMIT as well. However, misalignments of about 10 to 20 ms were found to be frequently produced by the channel mainly during speech pauses. Parts of the affected speech files are therefore slightly misaligned against the original label information. These channel effects may be related to the packet switching domain in the UMTS Core Network. Depending on the traffic load in the network, packets are buffered and queued, which results in a variable packet delay (jitter). Related publications -------------------- [1] P. Bauer, T. Fingscheidt, and D. Scheler. "WTIMIT: The TIMIT Speech Corpus Transmitted Over the 3G AMR Wideband Mobile Network". Submitted to Language Resources and Evaluation Conference (LREC). May, 2010. [2] P. Bauer, and T. Fingscheidt. "A Statistical Framework for Artificial Bandwidth Extension Exploiting Speech Waveform and Phonetic Transcription". European Signal Processing Conference (EUSIPCO). August, 2009. [3] N. Morales, J. Tejedor, J. Garrido, J. Colás, and D. T. Toledano. "STC-TIMIT: Generation of a Single-channel Telephone Corpus". Language Resources and Evaluation Conference (LREC). May, 2008. Credits ------- Original idea and supervision by Prof. T. Fingscheidt, Technische Universität Braunschweig, Braunschweig, Germany. Database design, signal pre-/post-processing, and recording process by P. Bauer (assisted by D. Scheler), Technische Universität Braunschweig, Braunschweig, Germany. General project support by D. Kistowski-Cames, Deutsche Telekom AG, Bonn, Germany. Local SIM cards by P. Lang, T-Mobile NL, The Hague, The Netherlands. Prepared mobile phones by P. Nevala, Nokia, Oulu, Finland. Funded by German Research Foundation (DFG) under grant no. FI 1494/2-1.