TRAINS Dialog Corpus

This CD-ROM contains a corpus of task-oriented dialogs. These dialogs were collected as part of the TRAINS project, a project to develop a conversationally proficient planning assistant, which helps a user construct a plan to achieve some task involving the manufacturing and shipment of goods in a railroad freight system. To do this, we need to know what kinds of phenomena occur in such dialogs, and how to deal with them. To provide empirical data, we have been collecting a corpus of dialogs in this domain with a person playing the role of the system. The collection procedure was designed to make the setting as close as to human-computer interaction as possible, but was not a ``wizard'' scenario, where one person pretends to be a computer. Thus these dialogs provide a snapshot into an ideal human-computer interface that would be able to engage in fluent conversations.

Altogether, there are 98 dialogs included, collected using 20 different tasks and 34 different speakers. This amounts to six and a half hours of speech, about 5900 speaker turns, and 55000 transcribed words. The audio files, along with time-aligned word and phoneme transcriptions are in the `dialogs' subdirectory. ASCII transcripts of the dialogs are in the `transcripts' subdirectory. Also included are several technical notes in the `doc' subdirectory. One of these, trains_93_dialogs.ps, describes the task, the collection situation, transcriptions conventions, and how to use the corpus. A fourth directory, `tools', contains tools that are useful for manipulating a dialog (using WAVES). A fifth directory `sphere', contains tools for manipulating the audio files, including the utility `w_decode', for decompressing the audio files.

We are planning in the future to further annotate the dialogs. These annotations will be available by anonymous ftp transfer from ftp.cs.rochester.edu in pub/packages/trains/dialogs-93. Or visit our WWW page.

A NOTE ABOUT THE CD-ROM FORMAT:

The data on this CD-ROM were developed and prepared on UNIX systems. Preparation involved the inclusion of a NIST SPHERE header at the beginning of each audio file, and compression of the audio data by means of the SPHERE utility `w_encode -t shorten'. The SPHERE utility `w_decode' must be installed and run on the audio files in order to access the audio data. (Note that this command, if run with the flag `-o pcm', will automatically write the 16-bit sample data in the byte format that is native to the machine where it is installed.)

Installing the SPHERE utilities is quite simple for most UNIX users (see sphere/readme.doc), but may be rather more difficult for others. You may address questions to Jon Fiscus at NIST jon@jaguar.ncsl.nist.gov or to David Graff at LDC graff@unagi.cis.upenn.edu if you run into problems.

Another bias of the UNIX-based preparation of this corpus involves the naming of directories and files. This CD-ROM has been produced using the ISO 9660 Level 1 data format with the so-called Rock Ridge Extensions to ISO 9660. The Level 1 format provides compatibility with virtually all CD-ROM devices and computer systems commonly in use; it does this in part by limiting the size and format of file names. The Rock Ridge Extensions permit the storage and use of UNIX/POSIX file names as `supplemental' information within the Level 1 directory structure, so that these less-constrained file names are accessible to UNIX users. Non-UNIX users may find that the names they see in directory listings are altered and/or truncated, relative to the descriptions given in the documentation files. In order to lessen the possible confusion, each directory on this CDROM contains a file called `namtrans.tbl' -- this file lists the original UNIX name and the altered/truncated Level 1 name for each file in the corresponding directory.

For more documentation please click here .

ACKNOWLEDGEMENTS:

Funding for the corpus collection and the TRAINS project was gratefully received from NSF under Grant IRI-90-13160, and from ONR/DARPA under Grant N00014-92-J-1512.

CD-ROM production for this corpus was managed and funded by the Linguistic Data Consortium, at the University of Pennsylvania.