========================================================= SRI's Speech-Based Collaborative Learning Corpus (SBCLC) ========================================================= 1.0 INTRODUCTION The SRI Speech-Based Collaborative Learning Corpus (SBCLC) was collected as part of a project investigating the utility of a speech-based learning analytics approach to collaborative learning. The goal is to determine whether detectable patterns exist in student speech that correlate with collaborative learning indicators and that provide a means of assessing collaboration quality. The corpus contains audio recordings, orthographic transciptions, manual annotation of collaboration, log files, and supporting documentation. The corpus contains audio recordings from 21 visits to middle schools and the participation of 134 students. The schools were all in California, United States and the students were in grades 6, 7, or 8. Most students participated in two sessions of problem solving. Students worked together in groups of three on sets of short mathematics problems (items) that require collaboration to solve. The items were a collaborative variation of the cloze task (fill in the blank), in which each student was assigned one blank and each problem required the students to work together and talk to each other to coordinate their three answers. The collaborative mathematics problems were delivered on iPads with a custom-built software application. In order to correctly respond to an item, all three students had to enter correct answers at the same time. All sessions were manually annotated with codes that (1) mark indicators of collaboration and (2) assess the overall collaboration quality of the interaction. When citing SBCLC please cite the following paper: C. Richey, C. D'Angelo, N. Alozie, H. Bratt, E. Shriberg, "The SRI speech-based collaborative learning corpus", in INTERSPEECH 2016 - 17th Annual Conference of the International Speech Communication Association, Proceedings, San Francisco, California, USA, September 8-12, 2016. 2.0 NAMING CONVENTIONS The corpus is organized into "sessions". The name for each session contains the following information: (1) corpus name (SBCLC) (2) code for school visit (3) problem set (1 or 2) (4) student group (A or B) SBCLC contains a total of 80 sessions. The naming convention for each session is: ___ 2.0 DATA SET ORGANIZATION SBCLC | ------------------------------- | | data docs | -------------------------- | | | | speech logs annotations transcripts 3.0 CONTENTS Directory: data/speech ====================== This directory contains the audio recordings. Audio recordings are grouped into subdirectories by session. Each session subdirectory contains 4 audio files: _1.flac audio from head-mounted mic worn by student sitting on left (speaker 1) _2.flac audio from head-mounted mic worn by student sitting in the middle (speaker 2) _3.flac audio from head-mounted mic worn by student sitting on the right (speaker 3) _S.flac audio from table-top stereo microphone placed in front of the students During the data collection, each student wore a head-mounted noise-cancelling microphone (Audio-Technica PRO 8HEx). Students 1 and 2 wore the headset so that microphone pointed away from the group in order to cancel as much audio from the other two students as possible. A ZOOM H6 portable digital recorder was used to record audio from these three microphones plus its built-in stereo microphone. All audio channels were digitized at 48 kHz with 24-bit PCM encoding using a shared clock (and were therefore sample synchronous). The audio released has a 16 kHz sample rate and 16-bit PCM encoding. Directory: data/logs ==================== This directory contains log files for a subset of the sessions. When students worked on problem set 1, all controller use, clicks on the screen, and responses selected were saved in a time-stamped log file. Summaries of the log files are provided in comma-separated format (csv). The naming convention is: .logs.csv Each summary log file starts with a header line that defines the fields. Each following line represents a time when all students submitted their responses simultaneously. Each line consists of the following fields: 1. Session name 2. Time elapsed (in seconds) since the iPad application was started 3. Difference in start times (in seconds) between the iPad application and the audio recording (if known) 4. Problem number (01-12) 5. Whether the students submitted the correct set of responses (correct, incorrect) 6. Response submitted by speaker 1 7. Response submitted by speaker 2 8. Response submitted by speaker 3 There were 39 sessions with problem set 1. However, two of the log files had errors and are not included. Directory: data/transcripts =========================== This directory contains time-aligned orthographic transcriptions for 13 of the sessions. The transcription files are simple text files. The naming convention is: .trans.txt Each line of a transcription file corresponds to a span of speech that was surrounded by short pauses. Each line starts with an identification code for the span of speech followed by the orthographic transcription. The identification codes have the following format: ___ The start and end times are given in milliseconds. For details about the segmentation of speech and the transcription conventions, see the transcription conventions file in the data/docs directory. Directory: data/annotations =========================== This directory contains 3 text files with manual annotations. Each session was manually annotated with "Icodes" (indicators of collaboration) and "Qcodes" (overall collaboration quality of the interaction). For Icodes, the annotators themselves determined the region in the audio corresponding to a particular indicator of collaboration. Thus, the duration of Icodes varies. For Qcodes, the audio recordings were divided into the math problems or "items" and annotators assigned a Qcode for the entire time. The duration depends on how quickly the students worked through the item. Additionally, each item was divided into 30-second windows and annotators assigned a Qcode to each window. SBCLC_Icodes.txt annotation of indicators of collaboration SBCLC_Qcodes_items.txt annotation of collaboration quality for each item SBCLC_Qcodes_30sec_windows.txt annotation of collaboration quality at roughly 30-second intervals See header of each file for information about the fields. For more information see definition files in data/docs directory or the full corpus paper. Directory: docs/ ========================= This directory contains several files providing background and meta-information about the SBCLC corpus. SBCLC_session_info.txt date, problem set, group, and speakers for each session SBCLC_speaker_info.txt grade and gender of each speaker SBCLC_Icodes_definitions.txt definitions of I-codes SBCLC_Qcodes_definitions.txt definition of Q-codes SBCLC_transcription_conventions.pdf transcription conventions in pdf format SBCLC_transcription_conventions.txt transcription conventions in txt format POINT OF CONTACT: Colleen Richey SRI International colleen.richey@sri.com 650-859-4741