The corpus is made up of two discs, each of which has the following directory structure in common:
0readme.1st - this file
shorten - path to Tony Robinson's ``shorten v2.0'' package
(source code and documentation)
sphere - path to NIST SPHERE software package
(source code and documentation)
vaha - path to corpus data
vaha/doc - path to corpus documentation & tables
vaha/train - path to speaker directories, which contain
speech files designated for use as training
data
In addition, disc 2 of the set also contains:
vaha/test - path to speaker directories, which contain
speech files designated for use as test
data
The distribution of training data over the two discs is as follows:
disc 1 contains training speakers 0001 through 0799
(total of 455 speakers)
disc 2 contains training speakers 0800 through 1449
(total of 360 speakers)
The speech data files are stored in compressed form, using the
``shorten'' algorithm for speech data compression developed by Tony
Robinson at Cambridge University. The shorten algorithm is
implemented both in a stand-alone program from Cambridge University,
and in the ``w_encode'' and ``w_decode'' utilities provided in sphere
software package from NIST. Either implementation can be used to
uncompress the waveform data, and the source code and documentation
for both implementations are provide with this corpus. Please refer
to the file ``vaha/doc/comprssn.doc'' for a description of procedures.