User information for the KIDS database


DATABASE CONTENTS

	This database is comprised of sentences read aloud by
children.  It was originally designed in order to create a training
set of children's speech for the SPHINX II automatic speech recognizer
for its used by the LISTEN project at Carnegie Mellon University.
This project uses the recognizer to follow children reading text from
a screen in order to be able to intervene when they are stuck or make
an error.  In the past, the recognizer had been trained on samples of
female speech.

	The children range in age from 6 to 11 (see details below) and
were in first through third grades (the 11-year-old was in 6th grade)
at the time of recording.  There were 24 male and 52 female
speakers.  Although the girls outnumber the boys, we feel that the
small difference in vocal tract length between the two at this age
should make the effect of this imbalance negligible.  There are 5180
utterances in all.

	The speakers come from two separate populations. Since the
LISTEN reading coach needed good examples of reading aloud, it was
decided that the majority of the speakers should be "good" readers.
They were recorded in the summer of 1995, and were enrolled in either
the Chatham College Summer Camp, or the Mount Lebanon Extended Day
Summer Fun program in Pittsburgh.  They were recorded on-site.  This
set will hereafter be called SUM95.  There are 44 speakers and 3333
utterances in this set.  The LISTEN system also needed examples of
errorful reading and dialectic variants.  The readers who supplied
this type of speech come from a school which has a high population of
children who are at risk of growing up poor readers and who could
therefore benefit from any reading tutor or other system built upon
this database.  They come from Fort Pitt School in Pittsburgh and were
recorded in April 1996.  This subset will be referred to as FP.  There
are 32 speakers and 1847 utterances in this set.  The list of speakers,
the set they are in, and the number of sentences per speaker can be
found in the "tables" directory, in the file named "speaker.tbl".

	It should be noted that although there will be some dialectal
variation in the speech of the SUM95 subset, the speech of the FP
subset gives us a very good representation of dialects of the children
that may be targeted for the Listen system.  However, the user should
be aware that the speakers' dialect partly reflects what is locally
called "Pittsburghese".


DATA COLLECTION

	Speech was directly recorded on a NeXT machine using software
developed at Carnegie Mellon to record people reading aloud.  The
sentences were presented on the screen and, after clicking on a
button, the system waited until speech was detected to begin recording
and, when silence was again detected, the recording was automatically
turned off.  Although the function of being able to reread a sentence
exists, it was rarely used with the children.  The microphone used is
a Sennheiser headset.

	When recording children's speech, our expectations must take
into account the characteristics of the speakers.  The first
difference with adult speakers is that the patience and attention
levels are much lower.  Whereas we can correct adults practically as
often as we need to, we can only permit ourselves a limited number of
comments to children before we lose our speaker.  The amounts of text
we can ask an adult to read off a screen also exceeds by far the
amount we can ask for from even a good young reader.

	With the amount of data coming from each speaker being
sparser, we tend to keep sentences that we would throw away, or have
replaced, with an adult.  The quality of the speech data reflects this
and our labeling notations (in the .trn and .pnt files, explained
below) have been constructed in order to take into account all of the
factors contributing to less-than-perfect sentences.

	Rather than finding this bothersome, the user who wants to
construct an application for children will find that the deviations in
the KIDS sentences represent reality: typical reading errors,
hesitations, etc. that children will make.  The user may find the .pnt
and .trn files extremely helpful in pinpointing important events here.

	Another aspect is that adults will read the text, without
adding comments, or changing subject, if asked to.  The children,
especially if they find themselves in difficulty, will make sounds,
sing, or change prosody to alleviate the pressure.  These unforeseen
events have been left in if they are within a sentence.  If they are
at the end of a sentence and there were 500 ms. of silence before
them, they were eliminated in order to gain space.  Details of this
may be found below.

	As we try to get many speakers, recording on site was the best
option.  Children were sent to the "computer room" from their
classes/camp activities.  In the case of the FP recordings, the user
will find that the noise level reflects the school environment where
noise from changing of classes, etc. is sometimes present.

	From past experience, we knew that children tend to touch the
microphone and wires and kick the tables, thus creating much noise over the
speech signal.  The persons carrying out the recording were instructed
to:

 1) tell the children not to touch the microphone, headset, or wires,
 2) watch the children and repeat this instruction if necessary. 

This has resulted in quieter recordings than in the past, but kids are
kids and there is still some noise.  This is unavoidable and the
signals here give a realistic idea of the amount of speaker-generated
noise that must be dealt with when working with children of this age
range.


TEXT USED FOR READING

	The text presented to the children was obtained from Weekly
Reader stories.  Weekly Reader is a four-page color reading supplement
given out to children in many classrooms.  It was chosen so that the
LISTEN project could have materials that teachers were already
familiar with, making the system more acceptable and easier to
integrate into already-existing curricula.  The Weekly Reader has a
different version for each grade.  We used stories for grades 1 to 3,
presented one sentence at a time.  One wavefile therefore corresponds
to one sentence.  There are photographs and drawings that illustrate
the stories.  Since we were told that we could not have permission to
reproduce the images, only the text, some of the sentences were
modified to be able to stand alone without the visual support.  Some
sentences were also modified in order to take into account the
difference between reading to oneself and reading aloud - possible
tongue twisters were eliminated, for example.  An effort was made to
preserve a sense of a story.  Groups of these sentences make up
stories and the persons recording the children were told to make an
effort to start recordings at the beginning of a story and to end a
session at the end of a story.  The grade level of the sentences for
each student was chosen on the fly to reflect the real reading level
of the student (rather than the grade level).  This information was
obtained in advance from the teachers.  The amount of sentences per
speaker is variable.  While the goal was to have about 120 sentences
from the "good" readers and about 40 from the "poor" ones, the
children read until the limits of their attention spans were reached
or they tired from the effort of reading.

	A total of 356 sentences are represented in the recorded data,
and these sentences comprise a total of 3576 words, of which 878 are
unique.  The number of times a given sentence was successfully
recorded may be found in the file "tables/sentence.tbl"; the file
"tables/wordfreq.tbl" contains the list of words and their frequencies
of occurrence in the prompting texts and recordings (excluding
mispronunciations).


ORGANIZATION OF THE CORPUS

	The root directory of each CD-ROM contains the following
items:

  - readme.1st : brief summary of corpus contents and organization
  - kids.doc   : this file
  - data       : directory containing speech and related data files,
			subdivided by speaker and file type
  - tables     : directory containing supporting data in tabular form
			(lists of speakers, sentences, words, etc.)

	The data directory contains subdirectories for each speaker
(fnjs, for example).  In each speaker directory, there are four
subdirectories: signal, trans, label, and point.  These contain,
respectively, the speech files, transcription files, lexical/phonetic
segmentation files, and files providing information about phonetic
variations in the utterances.


FILE NAMING CONVENTION

	Each data file name the type of file (i.e. type of data it
contains), the speaker gender and identification, the grade level and
index of the sentence used as prompting text for that utterance, and
whether the utterance was classified by annotators into "bin1" (for
sentences that were read correctly) or "bin2" (for sentences
containing one or more divergences from the intended utterance).  Here
are sample sets of file names (including their full directory paths)
for a given speaker (fmbb) and two particular sentences (1aa and 1ab):

  data/fmbb/signal/fmbb1aa1.sph	# speech (waveform) data
  data/fmbb/trans/fmbb1aa1.trn	# transcription
  data/fmbb/label/fmbb1aa1.lbl	# lexical/phonetic segments

  data/fmbb/signal/fmbb1ab2.sph	# speech (waveform) data
  data/fmbb/trans/fmbb1ab2.trn	# transcription
  data/fmbb/label/fmbb1ab2.lbl	# lexical/phonetic segments
  data/fmbb/point/fmbb1ab2.pnt	# comments on phonetic divergence

	The speaker identification consists of "m" or "f" (for male or
female), followed by three initials.  The sentence identification
consists of a single digit (for grade level of the sentence), followed
by two letters (ranging in ascending order as follows: aa, ab, ac, ... 
az, ba, bb, bc, ... ex).

	The eighth character of the file name is "1" if the sentence
was read correctly by that speaker; this means that no manual
modification of the transcript was needed for that utterance, and a
label file (for establishing time alignment of word and phonetic
segment boundaries in the speech file) could be generated directly
from the prompting text.  If the eighth character is "2", this means
the speaker produced an utterance with additions, elisions or
alterations in the intended phonetic sequence; these required manual
editing of the transcript file, and for these, a "point" file was
created to summarize the nature and locations of these effects.

	The file name extension (3 letters following the ".") indicate
the type of data contained in the file.  Each data format is described
in the next section.


FILE CONTENT INFORMATION

	The signal files (*.sph) are in SPHERE format, which means
that they contain a fixed-length 1024-byte header to describe the file
contents, in terms of sampling rate (16kHz), sample coding (16-bit
linear PCM), sample byte order (high-byte first), number of channels
(one), number of samples per file and sample min and max values in the
file.  This information is presented in a plain, self-describing ASCII
format according to a specification developed and maintained by the
National Institute of Standards and Technology (NIST).  Users of UNIX
systems can make use of a set of utilities and software libraries for
manipulating these files; this software is developed, maintained and
provided free of charge by NIST, and the source code distribution of
the software package has been included in this CD-ROM publication (in
the "sphere" directory).  Note that the use of this software is not
essential in handling or manipulating the signal files; the header
format is simple enough that a variety of methods can be employed to
use both the header information and the sample data without installing
the NIST SPHERE software package.

	The transcript files are plain ASCII text files, typically
quite small.  In the case of the "bin1" files (correctly read), the
text is simply what would be found in the table of sentence prompts.
In the "bin2" files, there are special bracketed notations for
phonetic spellings of mispronounced words, and for miscellaneous
acoustic events in the recording.  Phonetic spellings are strings of
upper case phonetic segment labels, separated by spaces and bounded by
slashes -- e.g. "/AE N/".  Miscellaneous noises are indicated by
square brackets containing no spaces -- e.g. "[begin_whisper]",
"[noise]", etc.

	The label files provide word boundary and phone boundary
timing information in units of centiseconds, arranged in tabular
form.  The first column of each line gives the file name followed by
":word" or ":phone", the second column gives the word or segment, the
third and fourth columns give the begin and end time points.

	The point files contain comments about phonetic insertions,
deletions and alterations in the utterance, as determined by an
annotator.


THE KIDS LABELING PROCEDURE

	In order to make labeling go faster and be of better quality,
we set up a two-pass labeling system.  During the first pass, the
labeler listens to the speech file to determine whether it follows
the corresponding text (that was displayed on the screen).  If the
speech follows the text, with no extraneous noises, repeats, etc, (and
the pronunciation of each word was the same as in CMUDICT 4.0), it is
sent to "bin1".  The bin1 subset of data was then put aside for
automatic labeling.  The .trn and .lbl files for these utterances
were generated automatically, and there is no corresponding .pnt file.

	If the speech did not conform to the above criteria, it was
put into the "bin2" subset.  The labeler, still in the first pass,
then created a .pnt file with a corresponding filename. The labeler
filled the point file with the context (string of one or more words)
where the speech deviates from the text, and the reason why it
deviated from the text (missing word, word repeated, etc).  It should
be noted that some point files are missing.

	The files in "bin2" were then sent to the second-pass
labelers who created the .trn files, filling them with orthographic
text (words) where the speech followed the text, and phonetic
labeling (with noise and whispers noted) where it did not.  The
SPHINX phone notation system was used.

	All speech files, using the text and .trn files, were sent
through Carnegie Mellon's SPHINX II speech recognition system in
forced alignment mode.  The .lbl files contain the results for the
forced alignment: word and phone alignments.  Sometimes, due to noise
that is simultaneous to the speech, forced alignment failed.  When
this happened, no corresponding .lbl file was produced.  This occurred
on 57 speech files in the corpus.  (A complete listing of the data
file inventory, indicating what file types are present for each
speaker/utterance combination, can be found in the "tables" directory,
in the file named "invntory.tbl".)

	During quality control at the end of the project, we decided
to eliminate inordinately long silences at the end of files.  We never
eliminated silences at the beginning of a file or in the middle of
speech and we left 500 ms. of silence after the end of the speech
signal in every file we truncated.  These long silences were in part
due to low recording levels or crosstalk.  Since we eliminated the end
of these wavefiles after they had gone through forced alignment, the
content of the end of the forced alignment files for these utterances
will not correspond to the end of the wavefile.  The quality of the
forced alignment is in no way affected (the alignment from the
beginning of the file is still valid).

	The user may find the .pnt and .trn files of great interest to
automatically find a subset of the data with: reading errors,
hesitations, false starts, repetitions, and dialectical variants of
given words.


For further information about this database, please contact:

Dr. Maxine Eskenazi (max@cs.cmu.edu) - recording, labeling, database
  structure, final data processing

Dr. Jack Mostow (mostow@cs.cmu.edu) - forced alignment, automatic
  generation of bin1 .trn files, children and reading


ACKNOWLEDGEMENTS

The CMU LISTEN Project was funded by NSF Grant No. IRI-9528984.

CD-ROM publication of the corpus was made possible by the Linguistic
Data Consortium at the University of Pennsylvania.  David Graff at the
LDC provided assistance in adapting the data files and organizing the
corpus for publication; he also created or regenerated the various
table files to be consistent with the published form of the corpus,
and contributed to the documentation.

Special reprint permission granted by Weekly Reader (R), published by
Weekly Reader Corporation Copyright (c) 1994, 1995 by Weekly Reader
Corporation All Rights Reserved.