The corpus is made up of two discs, each of which has the following directory structure in common:
0readme.1st - this file shorten - path to Tony Robinson's ``shorten v2.0'' package (source code and documentation) sphere - path to NIST SPHERE software package (source code and documentation) vaha - path to corpus data vaha/doc - path to corpus documentation & tables vaha/train - path to speaker directories, which contain speech files designated for use as training dataIn addition, disc 2 of the set also contains:
vaha/test - path to speaker directories, which contain speech files designated for use as test dataThe distribution of training data over the two discs is as follows:
disc 1 contains training speakers 0001 through 0799 (total of 455 speakers) disc 2 contains training speakers 0800 through 1449 (total of 360 speakers)The speech data files are stored in compressed form, using the ``shorten'' algorithm for speech data compression developed by Tony Robinson at Cambridge University. The shorten algorithm is implemented both in a stand-alone program from Cambridge University, and in the ``w_encode'' and ``w_decode'' utilities provided in sphere software package from NIST. Either implementation can be used to uncompress the waveform data, and the source code and documentation for both implementations are provide with this corpus. Please refer to the file ``vaha/doc/comprssn.doc'' for a description of procedures.