|Item Name:||WTIMIT 1.0|
|Author(s):||Patrick Bauer, Tim Fingscheidt|
|LDC Catalog No.:||LDC2010S02|
|Release Date:||March 17, 2010|
|Sample Type:||1-channel signed linear PCM (raw)|
|Data Source(s):||telephone speech|
|Application(s):||speech recognition, speaker identification|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2010S02 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Bauer, Patrick, and Tim Fingscheidt. WTIMIT 1.0 LDC2010S02. Web Download. Philadelphia: Linguistic Data Consortium, 2010.|
WTIMIT 1.0 is a wideband mobile telephony derivative of TIMIT Acoustic-Phonetic Continuous Speech Corpus (TIMIT, LDC93S1). TIMIT contains wideband speech recordings (i.e., sampled at 16 kHz) of 630 speakers in American English from eight major dialectic regions, each reading ten phonetically rich sentences. The TIMIT speech corpus was completed in 1993, being intended for acoustic-phonetic studies as well as for development and evaluation of automatic speech recognition (ASR) systems. In the meantime, five TIMIT derivatives have been developed: FFMTIMIT, NTIMIT, CTIMIT, HTIMIT, and STC-TIMIT. The FFMTIMIT (LDC96S32) corpus (Free-Field Microphone TIMIT) consists of the original TIMIT database, being recorded by a free-field microphone. NTIMIT (LDC93S2) (Network TIMIT) serves as a telephone bandwidth adjunct to TIMIT, containing its speech files transmitted over a telephone handset and the NYNEX telephone network, subject to a large variety of channel conditions. For the cellular bandwidth speech corpus CTIMIT (LDC96S30), the original TIMIT recordings were passed through cellular telephone circuits. The HTIMIT (LDC98S67) corpus (Handset TIMIT) offers a TIMIT subset of 192 male and 192 female speakers through different telephone handsets for the study of telephone transducer effects on speech. For the single-channel telephone corpus STC-TIMIT (LDC2008S03), the TIMIT recordings were sent through a real and, in contrast to NTIMIT, single telephone channel.
While some of these derivative TIMIT corpora consist of wideband speech, others are telephony corpora representing narrowband speech, i.e., sampled at 8 kHz and containing frequency components from about 300 Hz to 3.4 kHz. Until now, no real-world wideband telephony speech corpus has been publicly available. Due to upcoming wideband speech codecs, such as G.722, G.722.1, G.722.2 (i.e., Adaptive Multi-Rate Wideband, AMR-WB), and G.711.1, wideband telephony speech transmission is already feasible nowadays, even in an increasing number of mobile networks. Hence, a wideband telephone bandwidth adjunct to TIMIT is desirable for a wide range of scientific investigations, as well as development and evaluation of systems, e.g., Interactive Voice Response (IVR) systems. WTIMIT 1.0 (Wideband Mobile TIMIT) contains the recordings of the original TIMIT speech files after transmission over a real 3G AMR-WB mobile network.
WTIMIT 1.0 is organized according to the original TIMIT corpus. The training subset consists of 4620 speech files, while the test subset contains 1680 speech files. The speech format of the WTIMIT corpus is raw (i.e., no header information) and specified as follows:
- 16 kHz sampling rate
- 16 bit, 1-channel linear PCM sampling format
- little-endian byte order
Data preparation was conducted by converting the original TIMIT speech files into raw data (i.e., dropping the first 1024 bytes of header information) and concatenating them to 11 signal chunks of at most 30 minutes duration. In order to allow precise de-concatenation after transmission, and in order to be able to examine codec influence and channel distortion, each signal chunk is preceded by a 4 s calibration tone. It comprises 2 s of a 1 kHz sine wave followed by another 2 s of a linear sweep from 0 to 8 kHz. After having stored the prepared speech chunks on a laptop PC, they are ready for transmission over T-Mobile's AMR-WB-capable 3G mobile network in The Hague, The Netherlands.
At the sending end, the speech chunks were played back by a laptop PC. Via an IEEE 1394 link (FireWire), the data was transmitted digitally to an external DAC (digital-to-analog converter) of type RME Fireface 400. The analog signal was then fed electrically into the microphone input of the transmitting Nokia 6220 mobile phone. For this purpose, an audio quality test cable for Nokia mobile phones was used. Prior to the actual transmission, the output attenuation of the DAC was adjusted such as to prevent analog saturation at the input circuit of the phone while ensuring optimal dynamic range. Furthermore, a call to the phone at the receiving end, a second mobile phone of type Nokia 6220, was established for each speech chunk separately. Using the field test monitoring software of the phones, we confirmed that they were situated in different network cells at all times during transmission; moreover, we verified that the proper speech codec, the widely used AMR-WB at a constant data rate of 12.65 kbit/s, was being employed. Note that this bitrate is by far the most widely used one. Furthermore, the internal microphone equalization of the transmitting mobile phone was switched off.
At the receiving end, the analog headphone output of the receiving mobile phone was connected electrically to an ADC (analog-to-digital converter) of type RME Fireface 400. The analog input gain of the latter device was adjusted once initially to exploit the dynamic range of the ADC. Sampling was performed at a rate of 48 kHz, the native sampling rate of the ADC, and with 16 bit precision. The digital speech signals were transferred to a laptop PC again via an IEEE 1394 link and recorded onto a hard drive. The transmitted speech chunks were decimated from 48 kHz to 16 kHz sampling rate using a high-quality lowpass filter. Finally, they were de-concatenated by maximizing the cross-correlation between them and the original speech files. We followed the de-concatenation methodology of STC-TIMIT, as described in STC-TIMIT: Generation of a Single-channel Telephone Corpus, in order to assure a precise sample alignment to the TIMIT speech files. Hence, utterances in WTIMIT 1.0 can be considered to be time-aligned with an average precision of 0.0625 ms (one sample) with those of TIMIT. Basically, TIMIT's original label files (*.TXT, *.WRD, *.PHN) are valid for WTIMIT as well. However, misalignments of about 10 to 20 ms were found to be frequently produced by the channel mainly during speech pauses. Parts of the affected speech files are therefore slightly misaligned against the original label information. These channel effects may be related to the packet switching domain in the UMTS Core Network. Depending on the traffic load in the network, packets are buffered and queued, which results in a variable packet delay (jitter).
If you have any problems, questions or suggestions concerning WTIMIT, please send a brief email to Tim Fingscheidt (Technische Universität Braunschweig, Braunschweig, Germany): firstname.lastname@example.org.
Please examine the following samples for an example of the data in this corpus (raw audio has been converted to wav for purposes of demonstration):
The authors would like to thank Mr. Dirk Kistowski-Cames, Deutsche Telekom AG, Bonn, Germany, for providing general project support and SIM cards, and Mr. Petri Lang, T-Mobile NL, The Hague, The Netherlands, for local support and SIM cards. Thanks also to Mr. Panu Nevala, Nokia, Oulu, Finland, for providing the prepared mobile phones, which are in that form not available on the market.
This work was funded by German Research Foundation (DFG) under grant no. FI 1494/2-1.