Description of the contents of the "tables" directory
-----------------------------------------------------

The "tables" directory contains the following files:

  alignmnt.dic : pronouncing dictionary used in time-alignment
  invntory.tbl : list of utterance ids and files present for each
  point.tbl    : list of point file contents
  sentence.tbl : list of sentences used as prompting texts
  speaker.tbl  : list of speakers and their attributes
  transcrp.tbl : list of transcripts from all utterances
  wordfreq.tbl : list of words and their frequencies in the corpus

The content and format of each file is explained below.


ALIGNMNT.DIC
------------

As explained in the file "doc/alignmnt.doc", this file contains the
pronounciation lexicon that was used in the forced-recognition time
alignment to establish the boundaries of words and phonetic segments
in the utterances.  This lexicon was adapted from the CMU SPHINX
pronunciation lexicon (CMUDICT.v02) by Jack Mostow, in order to
include entries that were needed specifically for this corpus.

The file contains one lexical entry per line, and each entry consists
of two fields separated by a tab character.  The first field is the
orthographic form of the word, and in most entries (those which derive
from the original CMUDICT file) this field is padded to the right with
space characters to normalize the field to a constant width.  The
second field, following the tab character, is the phonetic reference,
which consists of a series of one or more phonetic segments, separated
by single space characters.  Each phonetic segment is a token
consisting of one, two or three upper-case letters.  The file
"doc/alignmnt.doc" lists the phonetic segments that are used, and
explains the meaning and purpose of special notations in the
orthographic field, as well as the rationale for the entries that are
particular to this corpus.  For further information about CMUDICT,
please refer to the CMU Web page:

	http://www.speech.cs.cmu.edu/cgi-bin/cmudict


INVNTORY.TBL
------------

For each utterance presented in this publication of the KIDS Corpus,
there is, at minimum, a signal file and a transcript file; for all but
57 of the published utterances, there is also a label file.
Utterances that were cast into the "bin2" set (because the speaker
diverged from the intended phonetic content of the prompt text) also
have a corresponding point file.  The inventory table lists all
utterance identifiers (consisting of speaker-id, sentence-id and bin
number), and for each utterance-id, it gives a listing of the file
types that are present.  The listing is given as a set of
space-separated tokens, with the utterance-id first on each line,
followed by the file name extensions that are present.  For example:

fabm2aa1 sph trn lbl
fabm2ab2 sph trn lbl pnt

There will always be between 3 and 5 tokens per line (i.e.
utterance-id plus between 2 and 4 extension strings); the "sph" and
"trn" entries always come first, in that order, after the
utterance-id; if both "lbl" and "pnt" are present, they always follow
in that order.  Note that of the 57 files that lack a label file, 50
do have a point file; these 50 entries in the table have "pnt" in the
fourth column, immediately after "trn".  Also, there are six files
that are listed as being in "bin2" but lack a point file.


POINT.TBL
---------

For each of the 4338 utterances that have "point" files (*.pnt) in
this corpus, this table provides a single line containing the file
name followed by a tab character followed by the content of the point
file.  The point file content consists of space-separated tokens, of
which the first two are numerics and the remainder are annotator
remarks about the discrepancies between the prompting text and the
utterance. 


SENTENCE.TBL
------------

Prompting texts for every sentence present in the utterance set are
provided in this table, together with the sentence-id and the number
of speakers for whom the sentence is present in the corpus.  The table
presents each sentence on one line, preceded by the the sentence-id
and the number of utterance files present, in that order.  The first
two elements of each line are separated by tab characters, and the
sentence text itself is a space-separated string of words.  For
example, the first four lines of the table appear as follows:

1aa     43      Storms in the spring can bring lightning
1ab     42      Lightning moves through the air
1ac     40      It sometimes hits the ground
1ad     43      Lightning can hurt people

So, the sentence identified as "1aa" in all utterance file names was
present in the recordings of 43 speakers, and so on.  The sentences
range in length between 10 and 122 characters.


SPEAKER.TBL
-----------

For each of the speakers in the corpus, this table provides the
speaker-id, where the speaker was recorded (SUM95 or FP; refer to
kids.doc), the grade and age of the speaker, the total number of
sentences present in the corpus for the speaker, and the number of
sentences categorized into the "bin2" set.  The information is
presented in that order, as tab-separated text fields.  For example:

# 76 speakers
# ID    LOC     GR/AGE  TOT     BIN2
fabm    SUM95   3/9     100     62
facs    SUM95   2/8     90      55

Note that the first two lines of the file are "comments", as flagged
by the initial "#" character.


TRANSCRP.TBL
------------

This table simply provides a full collection, in one file, of all 5180
transcripts in the corpus.  Each line consists of the utterance-id
(that is, the eight-character file name for the utterance), followed
by a tab character and then the content of the corresponding ".trn"
file.  Some of the transcripts are quite long, ranging up to 565
characters in length, so the corresponding lines of this table are
quite long as well.


WORDFREQ.TBL
------------

This table provides a sorted (all upper-case) listing of the words
present in the prompting texts and label files, one word per line.
Preceding each word are two numbers, separated by tab characters; the
first number indicates how many times that word appears in the full
set of prompting texts (i.e. the "text frequency" of the word in the
prompting materials); the second number indicates how many times that
word occurs in the 5123 label files in the corpus (i.e. the number of
times that word was successfully read by the subjects, but excluding
the 57 utterances for which no label file exists).

The table does not indicate the "expected" number of occurrences in
the utterances -- that is, how many times each word ought to have
appeared in the label and transcript files if the subjects had always
read the word correctly; but this can easily be determined for each
word by searching for the word in the "sentence.tbl" file and summing
the values of the second column of that table for each occurrence of
the word.  Here is a sample of the wordfreq.tbl file:

76      1059    A
3       5       ABLE
11      56      ABOUT
1       2       ACHIEVE
1       16      ACROSS
1       2       ADD
0       1       ADDING

Entries in this table that have a zero in the first column are cases
in which the the reader produced words that were not in the original
prompting text.  Entries with zeros in the second column are cases of
words that were never read successfully by any subject.