Home › Language Resources › Data

DCIEM/HCRC

Item Name:	DCIEM/HCRC
Author(s):	Martin Taylor, Ellen Gurman Bard, Cathy Sotillo, David McKelvie, Anne Anderson
LDC Catalog No.:	LDC96S38
ISBN:	1-58563-089-6
ISLRN:	139-466-600-760-1
DOI:	https://doi.org/10.35111/4540-j072
Member Year(s):	1996
DCMI Type(s):	Sound, Text
Sample Type:	2-channel pcm
Sample Rate:	20000
Data Source(s):	microphone speech
Application(s):	speech recognition
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC96S38 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Taylor, Martin, et al. DCIEM/HCRC LDC96S38. Web Download. Philadelphia: Linguistic Data Consortium, 1996.
Related Works: Hide	View hasAnnotation LDC2019S09 First DIHARD Challenge Development - Eight Sources LDC2019S12 First DIHARD Challenge Evaluation - Nine Sources LDC2021S10 Second DIHARD Challenge Development - Eleven Sources LDC2022S14 Third DIHARD Challenge Evaluation LDC2022S06 Second DIHARD Challenge Evaluation - Eleven Sources LDC2022S12 Third DIHARD Challenge Development

Introduction

DCIEM/HCRC was developed by the Defence and Civil Institute of Environmental Medicine in Canada and the Human Communication Research Centre at the University of Edinburgh and the University of Glasgow. It contains approximately 23 hours of English speech data along with corresponding transcripts from 36 participants, 34 male and 2 female. This release contains the materials used to collect all 216 spoken dialogues digital audio, orthographic transcriptions, documentation and source code for tools. The dialogues were selected to provide balanced representation at different points in a sleep deprivation experiment.

Data

The top-level directory contains the following files:

0dir.txt: A complete listing of all files, giving the CD on which each can be found.
0direye.txt: A complete listing of all dialogues, giving the CD on which each can be found, in a form more convenient for visual scanning.
read.me: A readme file, with the part and CD number changing from one CD to the next.

The top-level directory contains the following directories:

doc/ ASCII and/or PostScript(TM) versions of various documents on the corpus: START HERE
lib/ Resources for included tools
trn_all/ All the transcripts
etc/ Information about participants and maps
src/ UNIX(TM) scripts and C sources for useful tools, emacs interface, world wide web interface and a Microsoft Windows(tm) sound playing program.

In addition to the common directories, each also contains

run1/
run2/

Any run/ directory contains sampled audio, transcripts, and maps for one of the six runs of the sleep deprivation experiment.

Each conversation directory has the following files:

NIST header (.nst)
sampled speech (.ses)
annotated orthographic transcription(.trn)
giver's map (.gmp)
follower's map (.fmp)
TEI entry-point (.sgm)

Audio data is presented as 2-channel, 16-bit, 20 kHz ses files. Metadata including participant age, gender, and birthplace are included. The materials have been designed to be easily accessible to users with different equipment and a variety of needs from those who merely wish to generate hardcopies of the orthographic transcriptions to those who require computational analyses of the speech material. All the text files (transcriptions and documentation) should be readable and printable via most systems. The maps are intended for printing via POSTSCRIPT printers and the speech files are provided with human-readable standard headers, enabling them to be played by a wide range of environments for processing sampled speech.

DCIEM/HCRC

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees