My Science Tutor (MyST) Children's Conversational Speech ======================================================== Boulder Learning, Inc. (v1.0) Corpus Overview =============== The Children's Speech Corpus was created as part of the My Science Tutor (MyST) project. We will refer to it as the MyST corpus. It consists of 473 hours of speech collected from 1,371 students in the 3rd, 4th and 5th grades. Students conversed with a virtual science tutor in 8 areas of science, resulting in a total of 10,496 sessions and a total of 228,874utterances. 45% of the utterances have been transcribed at the word level. --------------------------------------------------- Students Sessions Total Transcribed Utterances Utterances --------------------------------------------------- 1,371 10,496 227,567 102,433 (473 hours) --------------------------------------------------- Expected Usage ============== We expect users of the corpus to mine the data and conduct research. The MyST corpus is ideally suited for training recognizers and classifiers and evaluating speech recognition performance on evaluation data. It is our hope and that researchers will publish their results. The MyST Corpus is an excellent resource for evaluation recognition of children's speech and monitoring advances of new approaches to speech recognition, as it contains about an order of magnitude more transcribed speech data of all currently available combined. Practitioner -- Application Developer ------------------------------------- Application developers can use the corpus to train recognizers for a variety of applications. The corpus was collected in educational settings at 16 kHz, with students using noise-cancelling microphones. Below we list a few applications that could benefit from this corpus: Researcher -- Speech Recognition Researcher ------------------------------------------- The corpus enables researchers to conduct research to improve automatic speech recognition (ASR) technology. Our hope is that the MyST corpus will stimulate research and enable researchers to compare recognition results after training on a large corpus of speech data using the same evaluation data sets. To facilitate this research, we have partitioned the data into separate categories as elaborated in the following subsection. Data Partitioning for ASR Evaluation ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ For the convenience of the ASR community, we partitioned and structured the corpus into training, development and test sets. The partitions were generated ensuring that they reasonably represent speech data gathered across the 8 science modules students talked about, and that each student's data in only one of the three partitions. These three data sets are in three separate directories in the corpus release. (See Corpus Structure, below.) Data Provenance --------------- The University of Colorado's Institutional Review Board and an independent agent, WestEd, appointed by the IES, reviewed and approved all components of the My Science Tutor project to assure student privacy. Consent and Assent ~~~~~~~~~~~~~~~~~~ The review board approved the Parental Consent forms and the Student Assent forms. All utterances in the corpus were signed by a student's parent or guardian, and by the student. The final Parental Consent and Student Assent forms approved by the IRB explicitly provide permission for anonymous student speech data and transcriptions to be distributed for both research and commercial use. We manually verified that we had parental consent and student assent for every student in the corpus. Data Collection =============== The following section describes the process that was used to collect this data. Methodology ----------- The MyST corpus was collected in 2 stages--Phase I and Phase II--over the years 2008-2017. In both phases, spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System (FOSS) system. The 8 FOSS science modules consisted of an average of 16 small-group classroom science investigations. Following the investigations, students conversed with the virtual science tutor for 15 to 20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers. For example, the FOSS Magnetism and Electricity modules included 4 classroom investigations (and 4 discussions with the virtual tutor. The speech data collected during these conversations comprise the MyST Corpus (Ward et al., 2011, 2013; Pradhan et al., 2016). The MyST conversations were strictly turn-taking; the tutor presented information, asked a question and waited for the student to respond. To respond, the student pressed the spacebar on the laptop, held it down while speaking, and released it when done. Each student turn was recorded as a separate audio file. When transcribed, an utterance level transcript file was created for each audio file. No identifying information was stored with the data except for anonymized IDs of schools and students. All students and their parents signed consent forms allowing Boulder Learning Inc. to enter and distribute their anonymous speech data. Descriptive Statistics ---------------------- Some characteristics of the data collected in the two phases is described below. Phase I ~~~~~~~ The Phase I corpus contains sessions from students in grades 3-5. All of the sessions from this phase have been transcribed. The following modules were included in this phase. 1. ME - Magnetism and Electricity 2. MS - Mixtures and Solutions 3. VB - Variables 4. WA - Water Number of Students: 421 Number of Sessions: 1509 (109 hours) Transcribed Sessions: 1509 (109 hours) Untranscribed Sessions: - During this phase, there was no attempt to have any individual student cover all of the parts for a module. The focus of the collection during this phase was to get a wide variety of students rather than try to get complete coverage of material for individual students. Phase II ~~~~~~~~ The Phase II corpus contains sessions from students in grades 4-5. It included the following 5 modules, with an average of 10 parts each 1. EE - Energy and Electromagnetism 2. MX - Mixtures 3. SMP - Sun, Moon and Planets 4. SRL - Soil, Rocks and Landforms 5. LS - Living Systems Number of Students: 950 Number of Sessions: 8,987 (364 hours) Transcribed Sessions: 2,063 (115 hours) Untranscribed Sessions: 6,924 (249 hours) In this collection, teachers were asked to have students complete all parts for 2 modules, however, many teachers did not want to cover 2 modules and whatever data was collected was kept, even if students didn't complete the sequence. Transcription Guidelines ======================== During Phase I of the project we used rich (slow, expensive) transcription guidelines--the ones typically used by speech recognition researchers. However, we realized that for the purposes of this project, we did not need to get that level of richness in the transcriptions, and therefore during Phase II, we decided to use a reduced (quick, cheaper) version of those guidelines which allowed us to transcribe more data. We have included the guidelines used for manual transcription in the release documentation. Corpus Structure ================ The directory structure for the corpora is as shown in the figure below. Variables are enclosed in angle-brackets () and can take values as described immediately after. myst_child_conv_speech/ ├── docs │ ├── BLI-pronunciation-lexicon-v0.0.10-061470a.dict │ ├── BLI-speech-transcription-guidelines.v0.1.6.pdf │ ├── checksums │ │ ├── ffps.txt │ │ └── md5sums.txt │ └── MyST-corpus-README.txt ... ... ├── data │ ├── │ │ ├── │ │ │ ├── │ │ │ │ ├── ___