My Science Tutor (MyST) Children's Conversational Speech
========================================================

Boulder Learning, Inc.                           (v1.0)


Corpus Overview
===============

The Children's Speech Corpus was created as part of the My Science Tutor
(MyST) project. We will refer to it as the MyST corpus. It consists of
473 hours of speech collected from 1,371 students in the 3rd, 4th and
5th grades. Students conversed with a virtual science tutor in 8 areas
of science, resulting in a total of 10,496 sessions and a total of
228,874utterances. 45% of the utterances have been transcribed at the
word level.


---------------------------------------------------
Students     Sessions     Total        Transcribed
                          Utterances   Utterances
---------------------------------------------------

1,371          10,496       227,567       102,433
                          (473 hours)
---------------------------------------------------


Expected Usage
==============

We expect users of the corpus to mine the data and conduct research. The
MyST corpus is ideally suited for training recognizers and classifiers
and evaluating speech recognition performance on evaluation data. It is
our hope and that researchers will publish their results. The MyST
Corpus is an excellent resource for evaluation recognition of children's
speech and monitoring advances of new approaches to speech recognition,
as it contains about an order of magnitude more transcribed speech data
of all currently available combined.

Practitioner -- Application Developer
-------------------------------------

Application developers can use the corpus to train recognizers for a
variety of applications. The corpus was collected in educational
settings at 16 kHz, with students using noise-cancelling microphones.
Below we list a few applications that could benefit from this corpus:

Researcher -- Speech Recognition Researcher
-------------------------------------------

The corpus enables researchers to conduct research to improve automatic
speech recognition (ASR) technology. Our hope is that the 
MyST corpus will stimulate research and enable researchers to
compare recognition results after training on a large corpus of speech
data using the same evaluation data sets. To facilitate this research,
we have partitioned the data into separate categories as elaborated in
the following subsection.

Data Partitioning for ASR Evaluation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

For the convenience of the ASR community, we partitioned and structured
the corpus into training, development and test sets. The partitions
were generated ensuring that they reasonably represent speech data
gathered across the 8 science modules students talked about, and that
each student's data in only one of the three partitions. These three
data sets are in three separate directories in the corpus release.  (See
Corpus Structure, below.)

Data Provenance
---------------

The University of Colorado's Institutional Review Board and an
independent agent, WestEd, appointed by the IES, reviewed and approved
all components of the My Science Tutor project to assure student
privacy.

Consent and Assent
~~~~~~~~~~~~~~~~~~

The review board approved the Parental Consent forms and the Student
Assent forms. All utterances in the corpus were signed by a student's
parent or guardian, and by the student. The final Parental Consent and
Student Assent forms approved by the IRB explicitly provide permission
for anonymous student speech data and transcriptions to be distributed
for both research and commercial use. We manually verified that we had
parental consent and student assent for every student in the corpus.


Data Collection
===============

The following section describes the process that was used to collect
this data.

Methodology
-----------

The MyST corpus was collected in 2 stages--Phase I and Phase II--over the
years 2008-2017. In both phases, spoken dialogs with the virtual tutor
were aligned to classroom instruction using the Full Option Science
System (FOSS) system. The 8 FOSS science modules consisted of an average
of 16 small-group classroom science investigations. Following the
investigations, students conversed with the virtual science tutor for 15
to 20 minutes. The tutor asked open-ended questions about media
presented on-screen, and students produced spoken answers. For example,
the FOSS Magnetism and Electricity modules included 4 classroom
investigations (and 4 discussions with the virtual tutor. The speech
data collected during these conversations comprise the MyST Corpus (Ward
et al., 2011, 2013; Pradhan et al., 2016).

The MyST conversations were strictly turn-taking; the tutor presented
information, asked a question and waited for the student to respond. To
respond, the student pressed the spacebar on the laptop, held it down
while speaking, and released it when done. Each student turn was
recorded as a separate audio file. When transcribed, an utterance level
transcript file was created for each audio file. No identifying
information was stored with the data except for anonymized IDs of
schools and students. All students and their parents signed consent
forms allowing Boulder Learning Inc. to enter and distribute their
anonymous speech data.

Descriptive Statistics
----------------------

Some characteristics of the data collected in the two phases is
described below.

Phase I
~~~~~~~

The Phase I corpus contains sessions from students in grades 3-5. All of
the sessions from this phase have been transcribed. The following
modules were included in this phase.

1. ME - Magnetism and Electricity

2. MS - Mixtures and Solutions

3. VB - Variables

4. WA - Water

     Number of Students:  421
     Number of Sessions: 1509 (109 hours)
   Transcribed Sessions: 1509 (109 hours)
 Untranscribed Sessions: -

During this phase, there was no attempt to have any individual student
cover all of the parts for a module. The focus of the collection during
this phase was to get a wide variety of students rather than try to get
complete coverage of material for individual students.

Phase II
~~~~~~~~

The Phase II corpus contains sessions from students in grades 4-5. It
included the following 5 modules, with an average of 10 parts each

1. EE - Energy and Electromagnetism

2. MX - Mixtures

3. SMP - Sun, Moon and Planets

4. SRL - Soil, Rocks and Landforms

5. LS - Living Systems

     Number of Students:   950
     Number of Sessions: 8,987 (364 hours)
   Transcribed Sessions: 2,063 (115 hours)
 Untranscribed Sessions: 6,924 (249 hours)

In this collection, teachers were asked to have students complete all
parts for 2 modules, however, many teachers did not want to cover 2
modules and whatever data was collected was kept, even if students
didn't complete the sequence.


Transcription Guidelines
========================

During Phase I of the project we used rich (slow, expensive)
transcription guidelines--the ones typically used by speech recognition
researchers. However, we realized that for the purposes of this project,
we did not need to get that level of richness in the transcriptions, and
therefore during Phase II, we decided to use a reduced (quick, cheaper)
version of those guidelines which allowed us to transcribe more data. We
have included the guidelines used for manual transcription in the
release documentation.


Corpus Structure
================

The directory structure for the corpora is as shown in the figure below.
Variables are enclosed in angle-brackets (<variable>) and can take
values as described immediately after.


myst_child_conv_speech/
├── docs
│   ├── BLI-pronunciation-lexicon-v0.0.10-061470a.dict
│   ├── BLI-speech-transcription-guidelines.v0.1.6.pdf
│   ├── checksums
│   │   ├── ffps.txt
│   │   └── md5sums.txt
│   └── MyST-corpus-README.txt
...
...
├── data
│   ├── <partition>
│   │   ├── <student_id>
│   │   │   ├── <session_id>
│   │   │   │   ├── <corpus>_<student_id>_<date>_<time>_<module>_<investigation>_<part>.<file-extension>
...
...

Where,

<partition> is one of train, development or test.

<student_id> is a 6-digit ID with the first 3 digits representing
the school code and the next 3 digits the student number.

<session_id> is the ID for a particular session and is further
represented as
<corpus>_<student_id>_<date>_<time>_<module>_<investigation>.<part>

<date> is represented as <YYYY>-<MM>-<DD>

<time> is represented as <hh>-<mm>-<ss>. Wherein,
<hh> represents the hour, <mm> represents minute, and <ss>
represents seconds. In Phase I, we did not capture hour/minute/second
for each session, so the corresponding fields for sessions in Phase I
are set to 00

<module> is a two- or three-character string enumerated in the Phase
sections above.

<investigation> is a decimal number representing the respective
investigation for a module.

<part> is the utterance ID within a session. Numbers 001 onward
represent the index of each utterance in a session.

<file-extension> is one of the following:

.flac  - The audio file, originally a .wav compressed using FLAC.
         Each file represents a single utterance.
 .trn  - Transcription of the corresponding audio file


Below is a small snapshot that shows the actual values for these slots filled in
and showing a part of each of the three partitions---train, development and test

myst_child_conv_speech/
├── docs
│   ├── BLI-pronunciation-lexicon-v0.0.10-061470a.dict
│   ├── BLI-speech-transcription-guidelines.v0.1.6.pdf
│   ├── checksums
│   │   ├── ffps.txt
│   │   └── md5sums.txt
│   └── MyST-corpus-README.txt
...
...
├── data
│   ├── train
│   │   ├── 001082
│   │   │   ├── myst_001082_2014-03-12_07-53-58_MX_1.1
│   │   │   │   ├── myst_001082_2014-03-12_07-53-58_MX_1.1_001.flac
│   │   │   │   ├── myst_001082_2014-03-12_07-53-58_MX_1.1_001.trn
...
...
│   ├── development
│   │   ├── 004029
│   │   │   ├── myst_004029_2013-11-18_09-03-05_EE_1.1
│   │   │   │   ├── myst_004029_2013-11-18_09-03-05_EE_1.1_001.flac
│   │   │   │   └── myst_004029_2013-11-18_09-03-05_EE_1.1_001.trn
...
...
│   └── test
│       ├── 002116
│       │   ├── myst_002116_2014-02-27_09-29-01_LS_1.1
│       │   │   ├── myst_002116_2014-02-27_09-29-01_LS_1.1_001.flac
│       │   │   ├── myst_002116_2014-02-27_09-29-01_LS_1.1_001.trn
...
...


Data Format
===========

The audio files in the release are compressed using the FLAC audio codec
and have the extension .flac. The original audio files were
collected as .wav files and the audio signal was captured in a
format that is a standard across most ASR corpora. The main
characteristics of the audio signal are as listed below:

      Channels : 1
   Sample Rate : 16000
     Precision : 16-bit
Sample Encoding: 16-bit Signed Integer PCM

Even though the transcripts do not contain any non-ASCII characters, we
have converted the text transcripts use the UTF-8 character encoding
standard as it is the standard for text data stored on disk.


Data Cleanup and Pre-processing
===============================

We did a pass over the corpus to clean up various types of errors that
could be identified using statistics on the underlying audio and
potentially erroneous data collection. Following are the various
criteria, along with a short description for each, that we used during
this process.

Session Quality
---------------

Bad -- empty or corrupted sessions were removed using simple heuristics
and based on missing data.


Session Length
~~~~~~~~~~~~~~

Sessions that were less than a certain minimal threshold (< 10 minutes
long), or longer than a certain maximum threshold (> 1 hour long) were
inspected and corrected or removed.


Missing audio files
~~~~~~~~~~~~~~~~~~~

Sessions that were missing audio files for a significant number of
utterances were deleted.

Audio Quality
-------------

All utterances were processed to identify all possible unacceptable
recordings and were removed from the database. We performed the
following checks for audio quality.


Clipping Rate
~~~~~~~~~~~~~

If there was a significant number of frames (exceeding a certain
threshold) that were clipped, we removed or marked the audio file. We
removed them if it impacted more than a certain fraction of utterances
in a session. In which case we also removed the session from the
release. If only a small number of files had large fraction of clipping,
we tagged them in a report file, so that the users can determine whether
to include or exclude that data from their study.

Silence
~~~~~~~

Sometimes there were significant amounts of leading and trailing silence
in the audio files. We trimmed all such silence. We did not, however,
remove or compress silence that occurred within an utterance.

Background Noise
~~~~~~~~~~~~~~~~

Utterances with a significant amount of noise or cross talk were
removed. This was only possible for the cases that were transcribed or
fell in the fraction of sample utterances that we manually verified.

Transcription Quality
---------------------

We fixed obvious spelling errors in the transcriptions. We tried to
retain explicitly mispronounced words as much as possible.


Additional Metadata
===================

In addition to the data in this release, we have also included relevant
metadata that can be useful to the user

Checksums
---------

It is usually a good practice to ensure that there has not been any data
loss in the process of obtaining the corpora. Typically, for text files,
this is done by computing md5sums of the files and compared with the
automatically generated sums on the receiving end. The file md5s.txt
under the docs/checksums directory in the release contains the
md5sums of all the transcripts. Since the original audio files contain
metadata that in addition to the binary (audio) data, md5sums is not the
best option as the sums can change with some superficial changes to the
data or the metadata. For this reason, we use the shntool to generate
flac fingerprints which are very similar to md5sums except that they
only use the binary data for computing the sums. These are stored in a
similar fashion to the md5sums in a file called ffps.txt, which is
sibling to the md5sums.txt file.


Updated Pronunciation Dictionary
================================

We also make available an updated pronunciation dictionary -- The .dict
file under docs directory.  We used CMU's pronunciation dictionary
as a starting point and manually added/corrected pronunciations for
words that were new to this corpus.


References
==========

Ward, W., Cole, R., Bolaños, D., Buchenroth-Martin, C., Svirsky, E.,
Vuuren, S. V., Timothy Weston, Jing Zheng, and Becker, L. (2011). My
Science Tutor: A Conversational Multimedia Virtual Tutor for Elementary
School Science. ACM Transactions on Speech and Language Processing (TSLP),
7 (4), 1-29.

Ward, W., Cole, R., Bolaños, D., Buchenroth-Martin, C., Svirsky, E., and
Weston, T. (2013). My Science Tutor: A Conversational Multimedia Virtual
Tutor. Journal of Educational Psychology, 105 (4), 1115.

Pradhan, S., Cole, R., and Ward, W. (2016, August). My Science
Tutor -- Learning Science with a Conversational Virtual Tutor.
In Proceedings of ACL-2016 System Demonstrations (pp. 121-126).