Interactive Systems Laboratories Carnegie Mellon University 5000 Forbes Avenue Pittsburgh, PA 15213 ========================= ISL MEETING CORPUS Part 1 (ISL MC1) 2000 - 2001 ========================= ======= AUTHORS ======= Susanne Burger (sburger@cs.cmu.edu) Victoria MacLaren (vicky@cs.cmu.edu) Alex Waibel (ahw@cs.cmu.edu) ======= CONTACT ======= Susanne Burger (sburger@cs.cmu.edu) ======== OVERVIEW ======== The ISL Meeting Corpus Part I (ISL-MC1) is the first published subset of the ISL Meeting Corpus (112 meetings). It contains 18 meetings collected at the Interactive Systems Laboratories at Carnegie Mellon University in Pittsburgh, PA during the years 2000-2001. The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection but still had real topics and tasks. The meeting durations of the ISL-MC1 meetings range in length from 8 to 64 minutes, and average at 34 minutes. This document contains an overview of the ISL Meeting Data collection, including the collection, transcription, and data preparation process. Further details are provided in the other documentation in this directory. As part of this release, we provide: m035 m036 m038 m039 (part m039a and m039b) m042 m043 m045 m046 m048 m051 m052 m053 m054 m055 m057 m061 m063 m064 * Audio -- for 18 meetings, a directory containing simultaneous recordings of up to 8 channels: lapel microphone channels for each participant, plus a single wav file containing a mix of all channels tracks (Note that meeting m039 has two parts, m039a and m039b) File format: MS wav files (PCM, 16 kHz, 16 Bit, mono, little endian) * Transcripts -- for 18 meetings: a word-level orthographic transcription, containing additional annotations for spontaneous speech events and disfluencies; available in the form of a "trl" file; transcription system according to VERBMOBIL_II conventions, see also isl_transcr_lex.pdf File format: ASCII text files * Time stamps - for each meeting time stamps of begin and end of a speaker turn were segmented and saved in a "mar" file. "mar" file turns and "trl" file turns can be linked via turn identification. (Note that meeting m039 has two parts, m039a.mar and m039b.mar) File format: ASCII text files * islmc1_readme.txt - this file File format: ASCII text files * icslp2002_islmeetingcorpus.pdf - technical paper about the ISL Meeting Corpus published at ICSLP 2002 File format: pdf file * isl_transcr_lex.pdf - lexicon of transcription conventions File format: pdf file * islmc1_vocabulary.txt - a list of word types with total numbers of word occurrences for all 18 meetings File format: ASCII text file * islmc1_speakerinfo.txt - list of all speaker identifications of all speakers who participated in the 18 meetings + gender + native language + meeting(s) where they participated File format: ASCII table separated by tabs * islmc1_meetinginfo.txt - detailed meeting information (recording format, size, duration, topic, additional information etc.) File format: ASCII table separated by tabs * islmc1_channelinfo.txt - channel information: who spoke on which channel (mar file name, speaker identification, turn number, channel identification, audio file name) File format: ASCII table separated by tabs ==================== The Recording Set-up ==================== During meeting recordings, each speaker wore an individual lapel microphone and was recorded via multi-channel mix board and multi-channel sound card. This setup was devised to obtain a consumer- or application-style sound quality. All meetings were recorded in the same instrumented meeting area. The audio was collected at a 16 kHz sample-rate. Audio files for each meeting are provided as separate time-synchronous recordings for each channel, encoded as 16-bit (little-endian) wave files. All meetings were recorded in an open-plan office and lab environment with the typical background noises and artificial light. The meeting area (roughly, 3m x 5m) is separated from the larger office area by three cubicle walls. Participants sat around an oval table, and sometimes made use of a smart board, white board, wall projectors or a TV. ============================ Meeting Names, Meeting Types ============================ Meeting Names are initialized by "m" for meeting and then counted in the order of recording date, e.g. m100.wav. All data linked to a particular recording starts with the same identification. Turn identification within transcriptions contain: Meeting_name_ChannelId_Turnnumber_speakerId_00 The Turn numbers counting starts with 0000. "mar" Time stamp files contain: samplepoint_begin sample_point_end SpeakerId_Turnnumber_ChannelId Meeting types: The recorded meetings were either natural meetings where participants needed to meet in the real world, or artificial meetings, which were designed explicitly for the purposes of data collection. The natural meetings were work-related; the participants had either scheduled a meeting in the meeting space of the recording lab or had been invited there. The meeting agenda was always real, unrelated to the ISL recording, and known beforehand. The artificial meetings provided topics to the participants. The topic could be a controversial discussion subject (controversial subjects were used to elicit the most active discussion), or it could be an open-ended instruction to 'just chat' about whatever came to mind. Participants were also given games to play: board games, card games, and role-playing all appear in the ISL corpus. Typically, participants needed to solve problems, answer questions or role-play by acting out characters in a made-up situation. ISL-MC1 contains: Project meeting (2): Participants: Project teams working on parts of a larger project Hierarchy: Team leaders, team members Speaking Style: Slow dynamic (few turns per minute and words per minute), many very long turns and very short turns Vocabulary: Domain-dependent (one topic) Discussion (9): Participants: Individuals Hierarchy: Pro and con positions, speaker alliances, active speakers, eager speakers Speaking Style: Fast dynamic (many turns per minute and words per minute), many very short turns Vocabulary: Topic-centered, open domain Chatting (1): Participants: Individuals Hierarchy: Balanced Speaking Style: Fast dynamic, laughing, few of the very short and very long turns Vocabulary: Open domain Game playing (6): Participants: Individuals and loose alliances Hierarchy: Game-dependent Speaking Style: Laughing, few of the very short and long turns Vocabulary: Topic (game)-dependent ==================== Meeting Participants ==================== There are a total of 31 unique speakers in the corpus. Meetings involved anywhere from 3 to 9 participants, averaging 5. The corpus contains a significant proportion of non-native English speakers, varying in fluency but still understandable. Each speaker was asked to complete a speaker questionnaire of basic demographic information (including sex, age, regional dialect, education level, etc.). Some information was required, some optional. The speaker data are archived at ISL and are not shareable, beside the information available in islmc1-speakerinfo.txt. Participants were project partners, groups from other labs, students and co-workers. Speakers are identified in the corpus via a 3 or 6-letter code which is given randomly by the speaker database. While speakers are identified exclusively by ID throughout the corpus documentation, we made no effort to eliminate names that occurred naturally in the meeting discussions (although all speakers were given the opportunity of suppressing any speech they wished removed, including identifiers: see the "Participant Approval" note below). ============================ Transcriptions and "trl" Files ============================ Complete word-level orthographic transcriptions are provided for all meetings. These transcriptions were generated by a team of transcribers listening to the lapel microphone channels. In addition to the spoken words, the transcriptions include annotations of non-lexical vocalizations (laughs, coughs, ...), and notations regarding mangled pronunciations, use of non-English words, unintelligible speech, marks for repeated, interrupted and corrected phrases, and other qualifications and comments. The transcripts are provided in a derivate of the VERBMOBIL II Conventions for the transcription of spontaneous speech. A description of these conventions is provided in isl_transcr_lex.pdf. Transcripts were prepared by means of the TransEdit transcription application. This application was developed for the transcription of multi-channel recordings and displays a synchronized multi-track view for all channels of a meeting with listening and segmentation function for each single channel separately. For more information about TransEdit send email to sburger@cs.cmu.edu. Transcriptions are saved as "trl" file. The begin and end time stamps in sample points for speaker turns are produced during the transcription process by the same application, and saved in a separate file with the same filename and the extension "mar". They can be linked to the transcribed turns by means of the turn identification (appropriate scripts are available at ISL). ================================ Participant Approval & Censoring ================================ All participants of the ISL meeting Corpus have agreed in having the recordings of their meeting contributions used in research. They signed a permission sheet approved by Carnegie Mellon University's Institutional Review Board. The permission sheets are archived at ISL. Speakers were given the opportunity to review the meetings in which they participated in order to approve (or request modifications to or deletions in) the transcriptions generated. Very few requests were made to have content expunged. We did not censor any data except as explicitly requested by the participants, including identifying names, etc. that may have occurred in the speech stream. Segments of meetings that participants wished deleted were replaced by a white noise tone on all channels (a necessary step due to potential leakage across channels). The content was removed from the transcripts and replaced by a segment containing only the label for not understandable "<%>". ============================ Known Problems, Useful Facts ============================ Corpus users should be warned of several known problems with this material: off-mike speakers -- In general, we tried to ensure that all meeting participants wore lapel microphones, but there are occasional exceptions to this rule, noted in the headers of individual meetings. While in most meetings the off-mike speaker was the recording person, who gave very few instructions, some meetings do contain significant amounts of off-mike speech from meeting participants (due to microphone failures, off-mike observers offering comments, etc.). The table islmc1_channelinfo.txt links a list of all speakers per meeting, the number of turns a speaker said and the audio files which served as basis for transcription. Some audio files have more than one speaker. These additional speakers are the off-mike speakers linked to the audio file which served as source for transcription. One meeting has an audio file of a speaker who did not say something during the entire meeting (m039b_2). M039: the recording stopped during the meeting and was then continued again. Therefore, the meeting is separated into two parts, m039a and m039b. Transcription was made also in two parts, to match the time stamps which started at 0000 again for the second part. Note, that the turn IDs within the transcription files do not contain an additional letter showing which part of the meeting it is. This could confuse further data processing on turn level, because there are be several turns with the same identification. ======================= For More Information... ======================= More information is available on the ISL Meeting_Room website: www.is.cs.cmu.edu/meeting_room/ Please send us comments, corrections, and other feedback to sburger@cs.cmu.edu. ================ Acknowledgments ================ The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the GENOA project and through ROAR and CPoF. Many thanks to the transcription and data collection team at ISL: Daniel Schneider, Debra Vlasak, John Helman, Nils Hammer, Denise Hill, Robert Isenberg, David Chekan, Rodolfo Vega, Raina Jones Finally and especially thanks to all the meeting participants for allowing us to record their meetings and to participate in our discussion and game meetings.