MyST Children's Conversational Speech

Item Name: MyST Children's Conversational Speech
Author(s): Sameer Pradhan, Ronald Allan Cole, Wayne Ward
LDC Catalog No.: LDC2021S05
ISBN: 1-58563-967-2
ISLRN: 848-818-101-134-5
Release Date: June 15, 2021
Member Year(s): 2021
DCMI Type(s): Sound, Text
Sample Type: pcm
Sample Rate: 16000
Data Source(s): microphone conversation
Application(s): machine reading, phonetics, pronunciation modeling, speech recognition, spoken dialogue modeling, spoken dialogue systems
Language(s): English
Language ID(s): eng
License(s): MyST Children’s Conversational Speech Agreement
Online Documentation: LDC2021S05 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Pradhan, Sameer, Ronald Cole, and Wayne Ward. MyST Children's Conversational Speech LDC2021S05. Web Download. Philadelphia: Linguistic Data Consortium, 2021.
Related Works: View


MyST (My Science Tutor) Children's Conversational Speech was developed by Boulder Learning Inc. It is comprised of approximately 470 hours of English speech from 1371 students in grades 3-5 conversing with a virtual science tutor in eight areas of science instruction, along with transcripts and a pronunciation dictionary.

Data was collected in two phases between 2008 and 2017. In both phases, spoken dialogs with the virtual tutor were aligned to classroom instruction using the Full Option Science System (FOSS) system, a research-based science curriculum for grades K-8. The eight FOSS science modules represented in this data set consisted of an average of 16 small-group classroom science investigations. Following the investigations, students conversed with the virtual science tutor for 15-20 minutes. The tutor asked open-ended questions about media presented on-screen, and students produced spoken answers.


Speech data was collected in 10,496 sessions for a total of 227,567 utterances. Approximately 45% of those utterances (102,433) were transcribed. All data collected in Phase I was transcribed using rich transcription guidelines; data collected in Phase II was partially transcribed using a reduced version of those guidelines. The transcription guidelines are included in this release.

Data is divided into development, test, and train partitions for use with ASR systems

Speech is presented in single channel, 16kHz, 16-bit flac compressed wav format. Transcripts are UTF-8 encoded plain text.


Please view this student answer audio sample (FLAC) and transcript sample (TXT).


None at this time.

Available Media

View Fees

Login for the applicable fee