Top-Level Documentation for HUB-4 Mandarin Speech Data ------------------------------------------------------ This CD-ROM contains a portion of the acoustic data designated as the training set for the 1997 DARPA HUB-4 Mandarin Benchmark. All acoustic files contain a 1024-byte NIST SPHERE header, followed by linear PCM sample data; the samples are 16-bit linear PCM with the high-byte first, and all recordings were done using a single channel and 16-KHz sample frequency. The sample data are not compressed. Most files contain approximately 30 minutes of recorded material, and some contain either 60 minutes or 120 minutes (approximately) from CC-TV, KAZN-AM, or Voice of America broadcasts; since the sampling format requires roughly 2 megabytes (MB) per minute of recording, the file sizes are typically around 60 MB, with some files ranging up to 120 or 240 MB.