User information for the KIDS database DATABASE CONTENTS This database is comprised of sentences read aloud by children. It was originally designed in order to create a training set of children's speech for the SPHINX II automatic speech recognizer for its used by the LISTEN project at Carnegie Mellon University. This project uses the recognizer to follow children reading text from a screen in order to be able to intervene when they are stuck or make an error. In the past, the recognizer had been trained on samples of female speech. The children range in age from 6 to 11 (see details below) and were in first through third grades (the 11-year-old was in 6th grade) at the time of recording. There were 24 male and 52 female speakers. Although the girls outnumber the boys, we feel that the small difference in vocal tract length between the two at this age should make the effect of this imbalance negligible. There are 5180 utterances in all. The speakers come from two separate populations. Since the LISTEN reading coach needed good examples of reading aloud, it was decided that the majority of the speakers should be "good" readers. They were recorded in the summer of 1995, and were enrolled in either the Chatham College Summer Camp, or the Mount Lebanon Extended Day Summer Fun program in Pittsburgh. They were recorded on-site. This set will hereafter be called SUM95. There are 44 speakers and 3333 utterances in this set. The LISTEN system also needed examples of errorful reading and dialectic variants. The readers who supplied this type of speech come from a school which has a high population of children who are at risk of growing up poor readers and who could therefore benefit from any reading tutor or other system built upon this database. They come from Fort Pitt School in Pittsburgh and were recorded in April 1996. This subset will be referred to as FP. There are 32 speakers and 1847 utterances in this set. The list of speakers, the set they are in, and the number of sentences per speaker can be found in the "tables" directory, in the file named "speaker.tbl". It should be noted that although there will be some dialectal variation in the speech of the SUM95 subset, the speech of the FP subset gives us a very good representation of dialects of the children that may be targeted for the Listen system. However, the user should be aware that the speakers' dialect partly reflects what is locally called "Pittsburghese". DATA COLLECTION Speech was directly recorded on a NeXT machine using software developed at Carnegie Mellon to record people reading aloud. The sentences were presented on the screen and, after clicking on a button, the system waited until speech was detected to begin recording and, when silence was again detected, the recording was automatically turned off. Although the function of being able to reread a sentence exists, it was rarely used with the children. The microphone used is a Sennheiser headset. When recording children's speech, our expectations must take into account the characteristics of the speakers. The first difference with adult speakers is that the patience and attention levels are much lower. Whereas we can correct adults practically as often as we need to, we can only permit ourselves a limited number of comments to children before we lose our speaker. The amounts of text we can ask an adult to read off a screen also exceeds by far the amount we can ask for from even a good young reader. With the amount of data coming from each speaker being sparser, we tend to keep sentences that we would throw away, or have replaced, with an adult. The quality of the speech data reflects this and our labeling notations (in the .trn and .pnt files, explained below) have been constructed in order to take into account all of the factors contributing to less-than-perfect sentences. Rather than finding this bothersome, the user who wants to construct an application for children will find that the deviations in the KIDS sentences represent reality: typical reading errors, hesitations, etc. that children will make. The user may find the .pnt and .trn files extremely helpful in pinpointing important events here. Another aspect is that adults will read the text, without adding comments, or changing subject, if asked to. The children, especially if they find themselves in difficulty, will make sounds, sing, or change prosody to alleviate the pressure. These unforeseen events have been left in if they are within a sentence. If they are at the end of a sentence and there were 500 ms. of silence before them, they were eliminated in order to gain space. Details of this may be found below. As we try to get many speakers, recording on site was the best option. Children were sent to the "computer room" from their classes/camp activities. In the case of the FP recordings, the user will find that the noise level reflects the school environment where noise from changing of classes, etc. is sometimes present. From past experience, we knew that children tend to touch the microphone and wires and kick the tables, thus creating much noise over the speech signal. The persons carrying out the recording were instructed to: 1) tell the children not to touch the microphone, headset, or wires, 2) watch the children and repeat this instruction if necessary. This has resulted in quieter recordings than in the past, but kids are kids and there is still some noise. This is unavoidable and the signals here give a realistic idea of the amount of speaker-generated noise that must be dealt with when working with children of this age range. TEXT USED FOR READING The text presented to the children was obtained from Weekly Reader stories. Weekly Reader is a four-page color reading supplement given out to children in many classrooms. It was chosen so that the LISTEN project could have materials that teachers were already familiar with, making the system more acceptable and easier to integrate into already-existing curricula. The Weekly Reader has a different version for each grade. We used stories for grades 1 to 3, presented one sentence at a time. One wavefile therefore corresponds to one sentence. There are photographs and drawings that illustrate the stories. Since we were told that we could not have permission to reproduce the images, only the text, some of the sentences were modified to be able to stand alone without the visual support. Some sentences were also modified in order to take into account the difference between reading to oneself and reading aloud - possible tongue twisters were eliminated, for example. An effort was made to preserve a sense of a story. Groups of these sentences make up stories and the persons recording the children were told to make an effort to start recordings at the beginning of a story and to end a session at the end of a story. The grade level of the sentences for each student was chosen on the fly to reflect the real reading level of the student (rather than the grade level). This information was obtained in advance from the teachers. The amount of sentences per speaker is variable. While the goal was to have about 120 sentences from the "good" readers and about 40 from the "poor" ones, the children read until the limits of their attention spans were reached or they tired from the effort of reading. A total of 356 sentences are represented in the recorded data, and these sentences comprise a total of 3576 words, of which 878 are unique. The number of times a given sentence was successfully recorded may be found in the file "tables/sentence.tbl"; the file "tables/wordfreq.tbl" contains the list of words and their frequencies of occurrence in the prompting texts and recordings (excluding mispronunciations). ORGANIZATION OF THE CORPUS The root directory of each CD-ROM contains the following items: - readme.1st : brief summary of corpus contents and organization - kids.doc : this file - data : directory containing speech and related data files, subdivided by speaker and file type - tables : directory containing supporting data in tabular form (lists of speakers, sentences, words, etc.) The data directory contains subdirectories for each speaker (fnjs, for example). In each speaker directory, there are four subdirectories: signal, trans, label, and point. These contain, respectively, the speech files, transcription files, lexical/phonetic segmentation files, and files providing information about phonetic variations in the utterances. FILE NAMING CONVENTION Each data file name the type of file (i.e. type of data it contains), the speaker gender and identification, the grade level and index of the sentence used as prompting text for that utterance, and whether the utterance was classified by annotators into "bin1" (for sentences that were read correctly) or "bin2" (for sentences containing one or more divergences from the intended utterance). Here are sample sets of file names (including their full directory paths) for a given speaker (fmbb) and two particular sentences (1aa and 1ab): data/fmbb/signal/fmbb1aa1.sph # speech (waveform) data data/fmbb/trans/fmbb1aa1.trn # transcription data/fmbb/label/fmbb1aa1.lbl # lexical/phonetic segments data/fmbb/signal/fmbb1ab2.sph # speech (waveform) data data/fmbb/trans/fmbb1ab2.trn # transcription data/fmbb/label/fmbb1ab2.lbl # lexical/phonetic segments data/fmbb/point/fmbb1ab2.pnt # comments on phonetic divergence The speaker identification consists of "m" or "f" (for male or female), followed by three initials. The sentence identification consists of a single digit (for grade level of the sentence), followed by two letters (ranging in ascending order as follows: aa, ab, ac, ... az, ba, bb, bc, ... ex). The eighth character of the file name is "1" if the sentence was read correctly by that speaker; this means that no manual modification of the transcript was needed for that utterance, and a label file (for establishing time alignment of word and phonetic segment boundaries in the speech file) could be generated directly from the prompting text. If the eighth character is "2", this means the speaker produced an utterance with additions, elisions or alterations in the intended phonetic sequence; these required manual editing of the transcript file, and for these, a "point" file was created to summarize the nature and locations of these effects. The file name extension (3 letters following the ".") indicate the type of data contained in the file. Each data format is described in the next section. FILE CONTENT INFORMATION The signal files (*.sph) are in SPHERE format, which means that they contain a fixed-length 1024-byte header to describe the file contents, in terms of sampling rate (16kHz), sample coding (16-bit linear PCM), sample byte order (high-byte first), number of channels (one), number of samples per file and sample min and max values in the file. This information is presented in a plain, self-describing ASCII format according to a specification developed and maintained by the National Institute of Standards and Technology (NIST). Users of UNIX systems can make use of a set of utilities and software libraries for manipulating these files; this software is developed, maintained and provided free of charge by NIST, and the source code distribution of the software package has been included in this CD-ROM publication (in the "sphere" directory). Note that the use of this software is not essential in handling or manipulating the signal files; the header format is simple enough that a variety of methods can be employed to use both the header information and the sample data without installing the NIST SPHERE software package. The transcript files are plain ASCII text files, typically quite small. In the case of the "bin1" files (correctly read), the text is simply what would be found in the table of sentence prompts. In the "bin2" files, there are special bracketed notations for phonetic spellings of mispronounced words, and for miscellaneous acoustic events in the recording. Phonetic spellings are strings of upper case phonetic segment labels, separated by spaces and bounded by slashes -- e.g. "/AE N/". Miscellaneous noises are indicated by square brackets containing no spaces -- e.g. "[begin_whisper]", "[noise]", etc. The label files provide word boundary and phone boundary timing information in units of centiseconds, arranged in tabular form. The first column of each line gives the file name followed by ":word" or ":phone", the second column gives the word or segment, the third and fourth columns give the begin and end time points. The point files contain comments about phonetic insertions, deletions and alterations in the utterance, as determined by an annotator. THE KIDS LABELING PROCEDURE In order to make labeling go faster and be of better quality, we set up a two-pass labeling system. During the first pass, the labeler listens to the speech file to determine whether it follows the corresponding text (that was displayed on the screen). If the speech follows the text, with no extraneous noises, repeats, etc, (and the pronunciation of each word was the same as in CMUDICT 4.0), it is sent to "bin1". The bin1 subset of data was then put aside for automatic labeling. The .trn and .lbl files for these utterances were generated automatically, and there is no corresponding .pnt file. If the speech did not conform to the above criteria, it was put into the "bin2" subset. The labeler, still in the first pass, then created a .pnt file with a corresponding filename. The labeler filled the point file with the context (string of one or more words) where the speech deviates from the text, and the reason why it deviated from the text (missing word, word repeated, etc). It should be noted that some point files are missing. The files in "bin2" were then sent to the second-pass labelers who created the .trn files, filling them with orthographic text (words) where the speech followed the text, and phonetic labeling (with noise and whispers noted) where it did not. The SPHINX phone notation system was used. All speech files, using the text and .trn files, were sent through Carnegie Mellon's SPHINX II speech recognition system in forced alignment mode. The .lbl files contain the results for the forced alignment: word and phone alignments. Sometimes, due to noise that is simultaneous to the speech, forced alignment failed. When this happened, no corresponding .lbl file was produced. This occurred on 57 speech files in the corpus. (A complete listing of the data file inventory, indicating what file types are present for each speaker/utterance combination, can be found in the "tables" directory, in the file named "invntory.tbl".) During quality control at the end of the project, we decided to eliminate inordinately long silences at the end of files. We never eliminated silences at the beginning of a file or in the middle of speech and we left 500 ms. of silence after the end of the speech signal in every file we truncated. These long silences were in part due to low recording levels or crosstalk. Since we eliminated the end of these wavefiles after they had gone through forced alignment, the content of the end of the forced alignment files for these utterances will not correspond to the end of the wavefile. The quality of the forced alignment is in no way affected (the alignment from the beginning of the file is still valid). The user may find the .pnt and .trn files of great interest to automatically find a subset of the data with: reading errors, hesitations, false starts, repetitions, and dialectical variants of given words. For further information about this database, please contact: Dr. Maxine Eskenazi (max@cs.cmu.edu) - recording, labeling, database structure, final data processing Dr. Jack Mostow (mostow@cs.cmu.edu) - forced alignment, automatic generation of bin1 .trn files, children and reading ACKNOWLEDGEMENTS The CMU LISTEN Project was funded by NSF Grant No. IRI-9528984. CD-ROM publication of the corpus was made possible by the Linguistic Data Consortium at the University of Pennsylvania. David Graff at the LDC provided assistance in adapting the data files and organizing the corpus for publication; he also created or regenerated the various table files to be consistent with the published form of the corpus, and contributed to the documentation. Special reprint permission granted by Weekly Reader (R), published by Weekly Reader Corporation Copyright (c) 1994, 1995 by Weekly Reader Corporation All Rights Reserved.