TRAINS Dialog Corpus


This CD-ROM contains a corpus of task-oriented dialogs.  These
dialogs were collected as part of the TRAINS project, a project to
develop a conversationally proficient planning assistant, which helps
a user construct a plan to achieve some task involving the
manufacturing and shipment of goods in a railroad freight system.  To
do this, we need to know what kinds of phenomena occur in such
dialogs, and how to deal with them.  To provide empirical data, we
have been collecting a corpus of dialogs in this domain with a
person playing the role of the system.  The collection procedure was
designed to make the setting as close as to human-computer interaction
as possible, but was not a ``wizard'' scenario, where one person
pretends to be a computer.  Thus these dialogs provide a snapshot
into an ideal human-computer interface that would be able to engage in
fluent conversations.

Altogether, there are 98 dialogs included, collected using 20
different tasks and 34 different speakers.  This amounts to six and a
half hours of speech, about 5900 speaker turns, and 55000 transcribed
words.  The audio files, along with time-aligned word and phoneme
transcriptions are in the `dialogs' subdirectory.  ASCII transcripts
of the dialogs are in the `transcripts' subdirectory.  Also included
are several technical notes in the `doc' subdirectory.  One of these,
trains_93_dialogs.ps, describes the task, the collection situation,
transcriptions conventions, and how to use the corpus.  A fourth
directory, `tools', contains tools that are useful for manipulating a
dialog (using WAVES).  A fifth directory `sphere', contains tools
for manipulating the audio files, including the utility `w_decode',
for decompressing the audio files.

We are planning in the future to further annotate the dialogs.
These annotations will be available by anonymous ftp transfer from
ftp.cs.rochester.edu in pub/packages/trains/dialogs-93.  Or visit our
WWW page at http://www.cs.rochester.edu/research.

A NOTE ABOUT THE CD-ROM FORMAT:

The data on this CD-ROM were developed and prepared on UNIX systems.
Preparation involved the inclusion of a NIST SPHERE header at the
beginning of each audio file, and compression of the audio data by
means of the SPHERE utility `w_encode -t shorten'.  The SPHERE utility
`w_decode' must be installed and run on the audio files in order to
access the audio data.  (Note that this command, if run with the flag
`-o pcm', will automatically write the 16-bit sample data in the byte
format that is native to the machine where it is installed.)

Installing the SPHERE utilities is quite simple for most UNIX users
(see sphere/readme.doc), but may be rather more difficult for others.
You may address questions to Jon Fiscus at NIST
<jon@jaguar.ncsl.nist.gov> or to David Graff at LDC
<graff@unagi.cis.upenn.edu> if you run into problems.

Another bias of the UNIX-based preparation of this corpus involves the
naming of directories and files.  This CD-ROM has been produced using
the ISO 9660 Level 1 data format with the so-called Rock Ridge
Extensions to ISO 9660.  The Level 1 format provides compatibility
with virtually all CD-ROM devices and computer systems commonly in use;
it does this in part by limiting the size and format of file names.
The Rock Ridge Extensions permit the storage and use of UNIX/POSIX
file names as `supplemental' information within the Level 1 directory
structure, so that these less-constrained file names are accessible to
UNIX users.  Non-UNIX users may find that the names they see in
directory listings are altered and/or truncated, relative to the
descriptions given in the documentation files.  In order to lessen the
possible confusion, each directory on this CDROM contains a file
called `namtrans.tbl' -- this file lists the original UNIX name and
the altered/truncated Level 1 name for each file in the corresponding
directory.

ACKNOWLEDGEMENTS:

Funding for the corpus collection and the TRAINS project was
gratefully received from NSF under Grant IRI-90-13160, and from
ONR/DARPA under Grant N00014-92-J-1512.

CD-ROM production for this corpus was managed and funded by the
Linguistic Data Consortium, at the University of Pennsylvania.