WSJCAM0 Readme file

WSJCAM0

THE CAMBRIDGE VERSION OF THE CONTINUOUS SPEECH RECOGNITION CORPUS

This release of WSJCAM0 represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of 31 August, 1994. The contents are organized as follows:

training data from head-mounted microphone:

wsjcam0/data/primary_microphone/si_tr/

development test data from head-mounted microphone, plus first set of evaluation test data:

wsjcam0/data/primary_microphone/si_dt/ wsjcam0/data/primary_microphone/si_et_1/

training data from desk-mounted microphone:

wsjcam0/data/primary_microphone/si_tr/

development test data from desk-mounted microphone, plus second set of evaluation test data:

wsjcam0/data/secondary_microphone/si_dt/ wsjcam0/data/secondary_microphone/si_et_2/

Each speaker directory contains a set of .dot, .ifo, .pmt and .ptx files per recording session, and a set of .wv1, .wv2, .phn and .wrd files per utterance. The meanings of the file extensions are:

dot:	detailed orthographic transcription 
ifo:	speaker information (age, sex, dialect, microphones used, ...) 
ptx:	source prompting text 
pmt:	prompts as they appeared on the screen 

wv1:	waveforms from the head mounted microphone in compressed SPHERE format 
wv2:	waveforms from the desk mounted microphone in compressed SPHERE format 
phn:	phoneme alignments in TIMIT format 
wrd:	word alignments in TIMIT format

In addition to the directories of waveform data listed above, each CD-ROM contains both a ``wsjcam0/data/primary_microphone/doc'' and a ``wsjcam0/etc'' directory, whose contents are identical across all discs. The ``wsjcam0/data/primary_microphone/etc'' directory contains the recommended training set and development test sets. The development test sets are split into 5k closed and 20k open (really 64k closed) tests in accordance with the word lists (supplied as files wlist20o.nvp and wlist5c.nvp). Each development test set is further subdivided into two to give test set sizes of about 370 utterances.

Thus, the following partitions are defined which taken together encompass all the recorded waveforms except the adaptation utterances:

si_tr		official training data 
si_trx		training data that has been excluded due to mispronunciation 
si_dt5a		the primary development test set for the 5k task 
si_dt5b		the secondary development test set for the 5k task 
si_dt20a	the primary development test set for the 20k task 
si_dt20b	the secondary development test set for the 20k task 
si_dt20x	development data that has been excluded due to mispronunciation

Four files exist for non-excluded data with the following file suffixes:

dbl:		base path names (e.g. wsjcam0/si_tr/c02/c02c0202) 
dot:		the sum of the .dot files for this partition 
ref:		reference transcription (in upper case) 
prf:		.dbl entry followed by reference transcription (in lower case)

See the files wsjcam0/data/primary_microphone/doc/abstract.{tex,ps} and wsjcam0/data/primary_microphone/doc/wsjcam0.{tex,ps} for more details.

Version 1.0 was released on 5 April 1994.

Several transcription errors have been corrected since April, a pronunciation dictionary has been developed (currently available by ftp as ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/data/beep-0.3.tar.gz ) and automatically generated phoneme and word alignments have been added.

Version 1.1 was released on 31 August 1994. Published on CD-ROM on 15 February 1995 by: Linguistic Data Consortium University of Pennsylvania Updated to Web Download in October, 2015.

Enquiries to Tony Robinson ( ajr@eng.cam.ac.uk )

Please send comments reguarding this page to avm@unagi.cis.upenn.edu