SR4X Corpus
                            Release 1.2

              Center for Spoken Language Understanding


UPDATED: 23 August 2002


Overview
--------
This corpus consists of speech recorded on four different channels of
36 speakers repeating the following eleven words:

  startrek
  supernova
  tektronix
  generation
  nebula
  processing
  singularity
  71523
  abracadabra
  sungeeta
  computer

Each word on each channel was repeated six times by each speaker.  Each
utterance is recorded as a separate file.  The file names appear as:

  SD-1309-tektronix-t2-52.wav

SD is an abbreviation that identifies the corpus (speaker dependent)
1309 is the speaker number
tektronix is the word spoken for this utterance
t2 indicates that this is for channel 2
52 is a serial number assigned during the course of each call.

The four channels used are:

 1 - office phone
 2 - home phone
 3 - carbon microphone telephone
 4 - speaker phone (through speaker)


Gender Information
------------------
The following table shows the gender of each of the participants based
on their speaker number.

1030 m
1063 m 
1111 f 
1159 m 
1227 f 
1234 m 
1305 m 
1309 m 
1348 m 
1381 f 
1430 f 
1436 m 
1561 f 
1584 m 
1637 m 
1648 m 
1683 f 
2222 f 
3333 m 
3335 m 
3745 m 
4444 f 
5555 m 
6666 f 
7011 m 
7308 f 
7315 m 
7329 f 
7339 f 
7341 m 
7382 f 
7488 m 
7496 m 
7502 f 
7523 m 
7876 m 

Male: 22
Female: 14


Verification
------------
We classified each utterance in the corpus as either: good, bad,
noisy, or different.  We made the classifications for the whole corpus
once then redid it.  We compared the results from both passes and
reviewed all the utterances that did not agree from both passes.
Agreement was about 85%.  The following confusion matrix shows where
most of the confusions occurred.

        g       b       n       d

g       6877            314     628
b       31      45              2
n       142     3       414     60
d       119     6       25      305

1330 mismatches out of 8971 files

The four categories are defined in the document speaker.ps that is
included in the /docs directory of this distribution.

The result of the verification process is contain in the four files:

	good.txt
	bad.txt
	noisy.txt
	different.txt

in the /docs directory with this distribution.