ABOUT SPEECH FILE COMPRESSION AND UNCOMPRESSION
	   -----------------------------------------------

All speech data files in this release of the Callhome English corpus
(*.sph) are stored on the CD-ROMs in compressed form.  The particular
form of compression used is version 2.0 of ``shorten'', developed by
Tony Robinson at Cambridge University.  This algorithm is intended to
give optimal compression results for speech sample data.

Two software implementations of shorten are available.  Both will
provide the same sample data output when uncompressing the Callhome
speech data, and complete source code packages are included here for
both.

One is a stand-alone program developed by Tony Robinson (in the
``shorten'' directory).  This package includes a pre-compiled
executable program file for MS-DOS users (shorten.exe), so that MS-DOS
users do not need a C compiler system to get started.  (Users of other
operating systems will need a C compiler and related utilities to make
an executable program from the source files; but assuming these are
available, the actual compilation is very simple.)

The other is as embedded functions within the NIST SPHERE software
package (in the ``sphere'' directory).  This package must be installed
via a process involving creation of some object library files and
several executable utility programs; this installation process is
designed for use with the UNIX operating system, and is unlikely to be
easily adaptable to other systems.

In terms of choosing which implementation to use, people who are not
using UNIX platforms should simply use the stand-alone shorten
program; this will be sufficient to provide uncompressed sample data.

People who are using a UNIX system will have a choice of using shorten
or sphere.  These differ in the following regards.  Stand-alone
shorten is compact, and is easy and quick to install and use, but it
does only one thing: compression or uncompression of speech files.
The sphere package is larger, takes longer to install, and may require
custom installation steps on some UNIX systems; execution speed may be
slightly slower (but perhaps not significantly so); program usage is
only slightly more complicated, by virtue of having options to support
a wider range of activities.  The ``w_decode'' utility can produce
uncompressed output with selectable sample format and byte order.  It
also makes sure that the file header of the resulting file is updated
to reflect all changes to the file contents; this allows for use of
other sphere utilities (w_edit, h_edit, h_read, etc) on the output
data, which can be very convenient.

UNIX users should NOT use both packages (e.g. shorten to uncompress
and other sphere utilities to do other things); shorten will not
modify the file headers, and this will cause the sphere utilities to
perform incorrectly on the resulting files.  If you intend to use
other sphere utilities (or other processes that recognize and use
sphere file headers), we strongly recommend the use of ``w_decode''
for uncompression.

The following explains how to use each package to uncompress the
waveform data; it will be assumed that the programs can be found in
the user's current execution path, and that the names ``infile.sph''
and ``outfile.sph'' represent suitable file names, with directory
paths included if necessary, to locate and identify the input and
output files.

SHORTEN:

	shorten -x -a 1024 infile.sph outfile.sph
    or
	shorten -x -d 1024 infile.sph outfile.sph

The "-a 1024" option specifies that the 1024-byte sphere header
should be passed through unmodified to the output file, whereas the
"-d 1024" option simply discards the header, leaving just the raw
(headerless) sample data on output; without one of these two options,
the command will fail.  The uncompressed sample data will always be in
it's original mu-law form (one byte per sample).

SPHERE:

	w_decode -o ulaw infile.sph outfile.sph

The "ulaw" argument can be replaced with "pcm_10" or "pcm_01", to
force the output samples to be 16-bit linear samples, in either
high-byte-first or low-byte-first form, respectively (using just "pcm"
will default to the native byte order for the system on which the
utility is installed).  The command as shown above will produce output
samples in their original mu-law form (one byte per sample).

In all cases shown above, it is possible to replace "outfile.sph"
with a dash "-", representing stdout, in order to have the
uncompressed data piped directly to some other process (e.g. a D/A
playback program) -- but bear in mind that the data stream will begin
with the 1024-byte ASCII-format sphere header in two of the three
methods shown.

Both software packages contain complete documentation on the use of
these utilities; please look there for further details and additional
options in there use.

Note that the compression has been done in a way that leaves the
sphere headers uncompressed; it is therefore possible to read the
headers without having to uncompress the files first.