This release of WSJCAM0 represents version 1.1 of the corpus, which was initially released on tape by Cambridge University as of 31 August, 1994. The contents are organized as follows:
training data from head-mounted microphone:
wsjcam0/data/primary_microphone/si_tr/
development test data from head-mounted microphone, plus
first set of evaluation test data:
wsjcam0/data/primary_microphone/si_dt/
training data from desk-mounted microphone:
wsjcam0/data/primary_microphone/si_tr/
development test data from desk-mounted microphone, plus
second set of evaluation test data:
wsjcam0/data/secondary_microphone/si_dt/
Each speaker directory contains a set of .dot, .ifo, .pmt and .ptx files
per recording session, and a set of .wv1, .wv2, .phn and .wrd files per
utterance. The meanings of the file extensions are:
Thus, the following partitions are defined which taken together
encompass all the recorded waveforms except the adaptation utterances:
Version 1.0 was released on 5 April 1994.
Several transcription errors have been corrected since April, a
pronunciation dictionary has been developed (currently available by ftp
as
ftp://svr-ftp.eng.cam.ac.uk/pub/comp.speech/data/beep-0.3.tar.gz ) and automatically
generated phoneme and word alignments have been added.
Version 1.1 was released on 31 August 1994.
Published on CD-ROM on 15 February 1995 by:
Linguistic Data Consortium
University of Pennsylvania
Updated to Web Download in October, 2015.
Enquiries to Tony Robinson ( ajr@eng.cam.ac.uk
)
dot: detailed orthographic transcription
ifo: speaker information (age, sex, dialect, microphones used, ...)
ptx: source prompting text
pmt: prompts as they appeared on the screen
wv1: waveforms from the head mounted microphone in compressed SPHERE format
wv2: waveforms from the desk mounted microphone in compressed SPHERE format
phn: phoneme alignments in TIMIT format
wrd: word alignments in TIMIT format
In addition to the directories of waveform data listed above, each
CD-ROM contains both a ``wsjcam0/data/primary_microphone/doc'' and a
``wsjcam0/etc'' directory,
whose contents are identical across all discs. The
``wsjcam0/data/primary_microphone/etc''
directory contains the recommended training set and development test
sets. The development test sets are split into 5k closed and 20k open
(really 64k closed) tests in accordance with the word lists (supplied as
files wlist20o.nvp and wlist5c.nvp). Each development test set is
further subdivided into two to give test set sizes of about 370
utterances.
si_tr official training data
si_trx training data that has been excluded due to mispronunciation
si_dt5a the primary development test set for the 5k task
si_dt5b the secondary development test set for the 5k task
si_dt20a the primary development test set for the 20k task
si_dt20b the secondary development test set for the 20k task
si_dt20x development data that has been excluded due to mispronunciation
Four files exist for non-excluded data with the following file suffixes:
dbl: base path names (e.g. wsjcam0/si_tr/c02/c02c0202)
dot: the sum of the .dot files for this partition
ref: reference transcription (in upper case)
prf: .dbl entry followed by reference transcription (in lower case)
See the files wsjcam0/data/primary_microphone/doc/abstract.{tex,ps} and
wsjcam0/data/primary_microphone/doc/wsjcam0.{tex,ps} for more details.
Please send comments reguarding this page to
avm@unagi.cis.upenn.edu