VAHA Readme File

The VAHA Telephone Speech Corpus was collected by Texas Instruments in Dallas, TX, for the Linguistic Data Consortium at the University of Pennsylvania.

The corpus is made up of two discs, each of which has the following directory structure in common:

        0readme.1st     - this file
        shorten         - path to Tony Robinson's ``shorten v2.0'' package
                           (source code and documentation)
        sphere          - path to NIST SPHERE software package
                           (source code and documentation)
        vaha            - path to corpus data
        vaha/doc        - path to corpus documentation & tables
        vaha/train      - path to speaker directories, which contain
                           speech files designated for use as training
                           data

In addition, disc 2 of the set also contains:

        vaha/test       - path to speaker directories, which contain
                           speech files designated for use as test
                           data

The distribution of training data over the two discs is as follows:

        disc 1 contains training speakers 0001 through 0799 
                (total of 455 speakers)

        disc 2 contains training speakers 0800 through 1449
                (total of 360 speakers)

The speech data files are stored in compressed form, using the ``shorten'' algorithm for speech data compression developed by Tony Robinson at Cambridge University. The shorten algorithm is implemented both in a stand-alone program from Cambridge University, and in the ``w_encode'' and ``w_decode'' utilities provided in sphere software package from NIST. Either implementation can be used to uncompress the waveform data, and the source code and documentation for both implementations are provide with this corpus. Please refer to the file ``vaha/doc/comprssn.doc'' for a description of procedures.