SRI Speech-Based Collaborative Learning Corpus
Item Name: | SRI Speech-Based Collaborative Learning Corpus |
Author(s): | Colleen Richey, Cynthia D'Angelo, Nonye Alozie, Harry Bratt, Elizabeth Shriberg |
LDC Catalog No.: | LDC2019S01 |
ISBN: | 1-58563-870-6 |
ISLRN: | 199-041-455-836-2 |
DOI: | https://doi.org/10.35111/1jsy-0150 |
Release Date: | January 15, 2019 |
Member Year(s): | 2019 |
DCMI Type(s): | Sound, Text |
Sample Type: | pcm |
Sample Rate: | 16000 |
Data Source(s): | microphone conversation |
Application(s): | discourse analysis, speech recognition |
Language(s): | English |
Language ID(s): | eng |
License(s): |
SRI Speech-Based Collaborative Learning Corpus Agreement |
Online Documentation: | LDC2019S01 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Richey, Colleen, et al. SRI Speech-Based Collaborative Learning Corpus LDC2019S01. Web Download. Philadelphia: Linguistic Data Consortium, 2019. |
Related Works: | View |
Introduction
SRI Speech-Based Collaborative Learning Corpus was developed by SRI International and is comprised of approximately 120 hours of English speech from 134 US middle school students working collaboratively. The data set also contains orthographic transcriptions, manual annotation of collaboration, log files, and supporting documentation.
This collection was part of a project investigating the utility of a speech-based learning analytics approach to collaborative learning. The goal was to determine whether detectable patterns exist in student speech that correlate with collaborative learning indicators and to provide a means of assessing collaboration quality. The participants were students in middle schools (grades six, seven and eight) located in California. Students worked in groups of three on sets of short mathematics problems based on the "cloze" task in which each student was assigned one blank and each problem required the students to work together and talk to each other to coordinate their three answers. The problems were presented on iPads with a custom software application.
Data
The audio data was captured by both head-mounted and table-top microphones and is released as 16 kHz, 16-bit flac compressed pcm wav.
Recording sessions were manually annotated with codes that mark indicators of collaboration (I codes) and that assess the overall collaboration quality of the interaction (Q codes). Annotations are presented as UTF-8 csv files.
Also included in this corpus are orthorgraphic transcripts for a subset of the audio recordings and log files for iPad usage; both are released as UTF-8 encoded plain text.
Samples
Please view this speech sample and transcript sample.
Updates
None at this time.