VERSION 1.0
Defence and Civil Institute of Environmental Medicine North York, Ontario, Canada
Human Communication Research Centre University of Edinburgh & University of Glasgow, UK
under the aegis of
NATO DRG Panel 3 Research Study Group 10 (Automatic Speech Processing)
Corpus Copyright 1995 DCIEM Distributed by HCRC and LDC
LICENSE: The copyright holder grants to the purchaser of these CD-ROMs unrestricted license to use all the corpus materials (speech, transcription, maps, tools, documentation) included herein, subject only to the following restrictions: 1) No onward distribution of the corpus materials is allowed -- copies may be made only for use by the purchaser and his/her research group, for ease of use by that group, etc.; 2) The contributions of DCIEM and HCRC are acknowledged in any public presentation or publication of any work based on the corpus.
The DCIEM Sleep Deprivation Study Map Task Corpus carries no warranty of any kind.
Since DCIEM, HCRC, and RSG10 members continue to use the Corpus in our own research, we welcome contact with colleagues engaged in similar projects. For this reason we ask purchasers to notify us as a matter of courtesy of the topic of their intended work with these materials.
Funding by Department of National Defence, Canada; Economic and Social Research Council, UK; Linguistic Data Consortium, USA
Pre-mastering by Speech Data Services Ltd, Great Malvern, UK
This is CD-ROM 1 of a set of 3 in Part 1. Taken together, Parts 1 and 2 contain:
The transcriptions of all of the dialogues are repeated together on all of the CDs for ease of access when the speech files are not required.
I. Directory Structure and File Contents
All CD-ROMs have a common structure.
The top-level directory contains the following files on each:
0dir.txt A complete listing of all files, giving the CD on which each can be found. 0direye.txt A complete listing of all dialogues, giving the CD on which each can be found, in a form more convenient for visual scanning. read.me This file, with the part and CD number changing from one CD to the next.The top-level of each CD contains the following directories in all cases:
doc/ ASCII and/or PostScript(TM) versions of various documents on the corpus: START HERE lib/ Resources for included tools trn_all/ All the transcripts etc/ Information about participants and maps. src/ UNIX(TM) scripts and C sources for useful tools, emacs interface, world wide web interface and a Microsoft Windows(tm) sound playing program.In addition to the common directories, this CD also contains
run1/ run2/Any run
The ordinal numbers of dialogues within their run provide the names of the sub-directories.
For any run:
d04/ d05/ d06/ ... Conversations d07/ d08/ etcEach conversation directory has the following files
NIST header (.nst) sampled speech (.ses) annotated orthographic transcription(.trn) giver's map (.gmp) follower's map (.fmp) TEI entry-point (.sgm)Note that the transcripts are linked to the sampled speech files by time-stamps on every turn. The file src/signal/player.el makes it possible to adapt emacs so that portions of the speech (.ses) file can be accessed from their transcription (.trn). The directory src/web contains a world wide web based interface to do the same thing. The directory src/windows contains a Microsoft Windows(TM) program able to play the speech files.
Please contact maptask@cogsci.ed.ac.uk for updated interfaces, or see http://www.cogsci.ed.ac.uk/hcrc/wgs/dialogue/corpora/.
II. File naming conventions
The names for files associated with dialogues indicate the experimental conditions under which the dialogue was produced. The conditions are described in full in doc/design.sgm. File headers include the condition information in a more verbose manner than the filenames. The latter remain short enough to be accessed by PC. File names are constructed to the following model:
These filenames omit the identification numbers of individual speakers taking part. A table including this information is found at the end of doc/design.sgm. The internal header of each .trn file also includes this information.
Note that as each conversation has an id, and each turn has a number, to
refer to an individual turn in a standard way, use DCIEMMTC:
III. Use of SGML (Standard Generalized Markup Language)
The transcripts, documentation and some of the associated materials
included in this corpus are marked up using SGML, following the draft
guidelines of the Text Encoding Initiative (TEI), as used in the HCRC Map
Task Corpus. We have attempted to observe the guidelines for document
headers, where we have changed very little of what has been distributed by
the TEI. In the body of the transcripts, mindful of the needs of those who
will read them as they stand and/or process them with tools which are not
sensitive to SGML markup, we have had to deviate rather more from TEI
norms. All the files anywhere in the corpus with extension ".sgm" are
SGML-conformant, as validated by version 1.0 of the public domain UNIX(TM)
tool sgmls.
Of course, non-SGML-based tools can access the .trn files directly. The
file doc/editorl.sgm is taken from the HCRC Map Task Corpus and provides
detailed information about the editorial conventions and markup used in the
transcripts.
Public entity references are used throughout for external references, and
the script in src/mtei documents the search path which is required for
those references to succeed.
For further information about these issues, see lib/tei/0read.me and the DTD
files in the same directory.
IV. Contacts
The DCIEM Sleep Deprivation Study was designed at DCIEM (contact Martin
Taylor. The Map Task as used here was based on
the design in the HCRC Map Task Corpus (copyright HCRC 1992) and reported
in Anderson et al (1991)[Language and Speech, 34: 351-366]. The current
corpus was designed by Ellen Gurman Bard, Cathy Sotillo, and Anne Anderson
of HCRC (contact Ellen Bard) and Martin Taylor of DCIEM.
Transcription was managed at HCRC by Cathy Sotillo.
Documentation was adapted by Ellen Bard, David McKelvie and Cathy Sotillo
from originals by Henry S. Thompson, Miles Bader, Cathy Sotillo, Jan
McAllister and Ellen Bard). SGML and TEI files were created by David
McKelvie after models by Henry Thompson. Time-stamping, pre-mastering, and
CD production were done by Jim McQuillan of Speech Data Services Ltd,
Great Malvern, UK.
HCRC is the distributor of this corpus. Email inquiries should go to
maptask@cogsci.ed.ac.uk
Paper mail enquiries should be sent to
DCIEM Map Task
Human Communication Research Centre
University of Edinburgh
2 Buccleuch Place
Edinburgh EH8 9LW
SCOTLAND
UNIX is a trademark of AT&T Bell Laboratories.
PostScript is a trademark of Adobe Systems Incorporated.