Description of the contents of the "tables" directory ----------------------------------------------------- The "tables" directory contains the following files: alignmnt.dic : pronouncing dictionary used in time-alignment invntory.tbl : list of utterance ids and files present for each point.tbl : list of point file contents sentence.tbl : list of sentences used as prompting texts speaker.tbl : list of speakers and their attributes transcrp.tbl : list of transcripts from all utterances wordfreq.tbl : list of words and their frequencies in the corpus The content and format of each file is explained below. ALIGNMNT.DIC ------------ As explained in the file "doc/alignmnt.doc", this file contains the pronounciation lexicon that was used in the forced-recognition time alignment to establish the boundaries of words and phonetic segments in the utterances. This lexicon was adapted from the CMU SPHINX pronunciation lexicon (CMUDICT.v02) by Jack Mostow, in order to include entries that were needed specifically for this corpus. The file contains one lexical entry per line, and each entry consists of two fields separated by a tab character. The first field is the orthographic form of the word, and in most entries (those which derive from the original CMUDICT file) this field is padded to the right with space characters to normalize the field to a constant width. The second field, following the tab character, is the phonetic reference, which consists of a series of one or more phonetic segments, separated by single space characters. Each phonetic segment is a token consisting of one, two or three upper-case letters. The file "doc/alignmnt.doc" lists the phonetic segments that are used, and explains the meaning and purpose of special notations in the orthographic field, as well as the rationale for the entries that are particular to this corpus. For further information about CMUDICT, please refer to the CMU Web page: http://www.speech.cs.cmu.edu/cgi-bin/cmudict INVNTORY.TBL ------------ For each utterance presented in this publication of the KIDS Corpus, there is, at minimum, a signal file and a transcript file; for all but 57 of the published utterances, there is also a label file. Utterances that were cast into the "bin2" set (because the speaker diverged from the intended phonetic content of the prompt text) also have a corresponding point file. The inventory table lists all utterance identifiers (consisting of speaker-id, sentence-id and bin number), and for each utterance-id, it gives a listing of the file types that are present. The listing is given as a set of space-separated tokens, with the utterance-id first on each line, followed by the file name extensions that are present. For example: fabm2aa1 sph trn lbl fabm2ab2 sph trn lbl pnt There will always be between 3 and 5 tokens per line (i.e. utterance-id plus between 2 and 4 extension strings); the "sph" and "trn" entries always come first, in that order, after the utterance-id; if both "lbl" and "pnt" are present, they always follow in that order. Note that of the 57 files that lack a label file, 50 do have a point file; these 50 entries in the table have "pnt" in the fourth column, immediately after "trn". Also, there are six files that are listed as being in "bin2" but lack a point file. POINT.TBL --------- For each of the 4338 utterances that have "point" files (*.pnt) in this corpus, this table provides a single line containing the file name followed by a tab character followed by the content of the point file. The point file content consists of space-separated tokens, of which the first two are numerics and the remainder are annotator remarks about the discrepancies between the prompting text and the utterance. SENTENCE.TBL ------------ Prompting texts for every sentence present in the utterance set are provided in this table, together with the sentence-id and the number of speakers for whom the sentence is present in the corpus. The table presents each sentence on one line, preceded by the the sentence-id and the number of utterance files present, in that order. The first two elements of each line are separated by tab characters, and the sentence text itself is a space-separated string of words. For example, the first four lines of the table appear as follows: 1aa 43 Storms in the spring can bring lightning 1ab 42 Lightning moves through the air 1ac 40 It sometimes hits the ground 1ad 43 Lightning can hurt people So, the sentence identified as "1aa" in all utterance file names was present in the recordings of 43 speakers, and so on. The sentences range in length between 10 and 122 characters. SPEAKER.TBL ----------- For each of the speakers in the corpus, this table provides the speaker-id, where the speaker was recorded (SUM95 or FP; refer to kids.doc), the grade and age of the speaker, the total number of sentences present in the corpus for the speaker, and the number of sentences categorized into the "bin2" set. The information is presented in that order, as tab-separated text fields. For example: # 76 speakers # ID LOC GR/AGE TOT BIN2 fabm SUM95 3/9 100 62 facs SUM95 2/8 90 55 Note that the first two lines of the file are "comments", as flagged by the initial "#" character. TRANSCRP.TBL ------------ This table simply provides a full collection, in one file, of all 5180 transcripts in the corpus. Each line consists of the utterance-id (that is, the eight-character file name for the utterance), followed by a tab character and then the content of the corresponding ".trn" file. Some of the transcripts are quite long, ranging up to 565 characters in length, so the corresponding lines of this table are quite long as well. WORDFREQ.TBL ------------ This table provides a sorted (all upper-case) listing of the words present in the prompting texts and label files, one word per line. Preceding each word are two numbers, separated by tab characters; the first number indicates how many times that word appears in the full set of prompting texts (i.e. the "text frequency" of the word in the prompting materials); the second number indicates how many times that word occurs in the 5123 label files in the corpus (i.e. the number of times that word was successfully read by the subjects, but excluding the 57 utterances for which no label file exists). The table does not indicate the "expected" number of occurrences in the utterances -- that is, how many times each word ought to have appeared in the label and transcript files if the subjects had always read the word correctly; but this can easily be determined for each word by searching for the word in the "sentence.tbl" file and summing the values of the second column of that table for each occurrence of the word. Here is a sample of the wordfreq.tbl file: 76 1059 A 3 5 ABLE 11 56 ABOUT 1 2 ACHIEVE 1 16 ACROSS 1 2 ADD 0 1 ADDING Entries in this table that have a zero in the first column are cases in which the the reader produced words that were not in the original prompting text. Entries with zeros in the second column are cases of words that were never read successfully by any subject.