DCIEM SLEEP DEPRIVATION STUDY: MAP TASK DIALOGUES

PART 1
VERSION 1.0

Defence and Civil Institute of Environmental Medicine North York, Ontario, Canada

Human Communication Research Centre University of Edinburgh & University of Glasgow, UK

under the aegis of

NATO DRG Panel 3 Research Study Group 10 (Automatic Speech Processing)

LICENSE: The copyright holder grants to the purchaser of these CD-ROMs unrestricted license to use all the corpus materials (speech, transcription, maps, tools, documentation) included herein, subject only to the following restrictions: 1) No onward distribution of the corpus materials is allowed -- copies may be made only for use by the purchaser and his/her research group, for ease of use by that group, etc.; 2) The contributions of DCIEM and HCRC are acknowledged in any public presentation or publication of any work based on the corpus.

The DCIEM Sleep Deprivation Study Map Task Corpus carries no warranty of any kind.

Since DCIEM, HCRC, and RSG10 members continue to use the Corpus in our own research, we welcome contact with colleagues engaged in similar projects. For this reason we ask purchasers to notify us as a matter of courtesy of the topic of their intended work with these materials.

Funding by Department of National Defence, Canada; Economic and Social Research Council, UK; Linguistic Data Consortium, USA

Pre-mastering by Speech Data Services Ltd, Great Malvern, UK

This is CD-ROM 1 of a set of 3 in Part 1. Taken together, Parts 1 and 2 contain:

the materials used to collect all 216 spoken dialogues in the full corpus;
digital audio for those 216 dialogues;
transcriptions of the 216 dialogues;
documentation;
source code for tools.

Part 1 contains a) through e) for 54 dialogues selected to represent each of 6 runs and 3 drug conditions at different points in the sleep deprivation experiment. Part 2 contains a) through e) of the remaining dialogues.

The transcriptions of all of the dialogues are repeated together on all of the CDs for ease of access when the speech files are not required.

I. Directory Structure and File Contents

All CD-ROMs have a common structure.

The top-level directory contains the following files on each:

         0dir.txt   A complete listing of all files, giving the CD on which
                    each can be found.

      0direye.txt   A complete listing of all dialogues, giving the CD on
                    which each can be found, in a form more convenient for
                    visual scanning.

          read.me   This file, with the part and CD number changing
		    from one CD to the next.

The top-level of each CD contains the following directories in all cases:

             doc/   ASCII and/or PostScript(TM) versions of various documents
                    on the corpus: START HERE

             lib/   Resources for included tools

         trn_all/   All the transcripts

	     etc/   Information about participants and maps.

             src/   UNIX(TM) scripts and C sources for useful tools,
			emacs interface, world wide web interface and
			a Microsoft Windows(tm) sound playing program.

In addition to the common directories, this CD also contains

            run1/
            run2/

Any run/ directory contains sampled audio, transcripts, and maps for one of the six runs of the sleep deprivation experiment. Within each run 39 dialogues were collected, the first 3 of which were used for practice and are not included in the corpus. Part 1 of the corpus contains only 9 dialogues per run, three for each pair of speakers produced at roughly the same time of day on the successive days of sleep deprivation. (See doc/design.sgm for a description of the design.) Part 2 contains the remaining dialogues. Thus, Part 1 CD 1 contains dialogues 10-12, 21-23, and 34-36 for run1 and run2. The remaining dialogues from the first run will be found in the run1/ directories of CDs 1 and 2 of Part 2.

The ordinal numbers of dialogues within their run provide the names of the sub-directories.

For any run:

                    d04/

                    d05/

                    d06/        ...      Conversations

                    d07/

                    d08/ 

                    etc

Each conversation directory has the following files

                    NIST header (.nst)

                    sampled speech (.ses)

                    annotated orthographic transcription(.trn)

                    giver's map (.gmp)

                    follower's map (.fmp)

                    TEI entry-point (.sgm)

Note that the transcripts are linked to the sampled speech files by time-stamps on every turn. The file src/signal/player.el makes it possible to adapt emacs so that portions of the speech (.ses) file can be accessed from their transcription (.trn). The directory src/web contains a world wide web based interface to do the same thing. The directory src/windows contains a Microsoft Windows(TM) program able to play the speech files.

Please contact maptask@cogsci.ed.ac.uk for updated interfaces, or see http://www.cogsci.ed.ac.uk/hcrc/wgs/dialogue/corpora/.

II. File naming conventions

The names for files associated with dialogues indicate the experimental conditions under which the dialogue was produced. The conditions are described in full in doc/design.sgm. File headers include the condition information in a more verbose manner than the filenames. The latter remain short enough to be accessed by PC. File names are constructed to the following model:

The first digit is the run of the sleep deprivation experiment from which the dialogue came.
The next two digits represent the number of the dialogue in absolute order within a run. Corresponding dialogue numbers in different runs used the same map materials.
q denotes the `quad', a group of 8 dialogues produced using one of number of standard map and subject assignment treatments.
c denotes the conversation within the quad. All dialogues qicj use the same maps.
The letter `f' or `p' refers to the size of group of subjects involved in the dialogues of a quad. f = foursome; p = pair.

For example, 108q2c4f.ses is the sampled speech for conversation 4 of quad 2, performed as the 8th dialogue of Run 1, by two of the speakers assigned to the foursome. It will be found on CD 1 in Part 2.

These filenames omit the identification numbers of individual speakers taking part. A table including this information is found at the end of doc/design.sgm. The internal header of each .trn file also includes this information.

Note that as each conversation has an id, and each turn has a number, to refer to an individual turn in a standard way, use DCIEMMTC::, e.g. DCIEMMTC:111q2c6f:32 is the Instruction Follower saying "Okay."

III. Use of SGML (Standard Generalized Markup Language)

The transcripts, documentation and some of the associated materials included in this corpus are marked up using SGML, following the draft guidelines of the Text Encoding Initiative (TEI), as used in the HCRC Map Task Corpus. We have attempted to observe the guidelines for document headers, where we have changed very little of what has been distributed by the TEI. In the body of the transcripts, mindful of the needs of those who will read them as they stand and/or process them with tools which are not sensitive to SGML markup, we have had to deviate rather more from TEI norms. All the files anywhere in the corpus with extension ".sgm" are SGML-conformant, as validated by version 1.0 of the public domain UNIX(TM) tool sgmls.

Of course, non-SGML-based tools can access the .trn files directly. The file doc/editorl.sgm is taken from the HCRC Map Task Corpus and provides detailed information about the editorial conventions and markup used in the transcripts.

Public entity references are used throughout for external references, and the script in src/mtei documents the search path which is required for those references to succeed.

For further information about these issues, see lib/tei/0read.me and the DTD files in the same directory.

IV. Contacts

The DCIEM Sleep Deprivation Study was designed at DCIEM (contact Martin Taylor. The Map Task as used here was based on the design in the HCRC Map Task Corpus (copyright HCRC 1992) and reported in Anderson et al (1991)[Language and Speech, 34: 351-366]. The current corpus was designed by Ellen Gurman Bard, Cathy Sotillo, and Anne Anderson of HCRC (contact Ellen Bard) and Martin Taylor of DCIEM. Transcription was managed at HCRC by Cathy Sotillo. Documentation was adapted by Ellen Bard, David McKelvie and Cathy Sotillo from originals by Henry S. Thompson, Miles Bader, Cathy Sotillo, Jan McAllister and Ellen Bard). SGML and TEI files were created by David McKelvie after models by Henry Thompson. Time-stamping, pre-mastering, and CD production were done by Jim McQuillan of Speech Data Services Ltd, Great Malvern, UK.

HCRC is the distributor of this corpus. Email inquiries should go to maptask@cogsci.ed.ac.uk

Paper mail enquiries should be sent to

          DCIEM Map Task
          Human Communication Research Centre
          University of Edinburgh
          2 Buccleuch Place
          Edinburgh EH8 9LW
          SCOTLAND

UNIX is a trademark of AT&T Bell Laboratories. PostScript is a trademark of Adobe Systems Incorporated.

DCIEM SLEEP DEPRIVATION STUDY: MAP TASK DIALOGUES

PART 1 VERSION 1.0

PART 1
VERSION 1.0