ICSI Meeting Transcripts


Item Name: ICSI Meeting Transcripts
Authors: Adam Janin, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Peskin, Thilo Pfau, Elizabeth Shriberg, Andreas Stolcke, and Chuck Wooters
LDC Catalog No.: LDC2004T04
ISBN: 1-58563-286-4
Release Date: Jan 30, 2004
Data Type: text
Data Source(s): meeting speech
Application(s): discourse analysis, speech recognition
Language(s): English
Language ID(s): eng
Distribution: Web Download
Member fee: $0 for 2004 members
Non-member Fee: US $600.00
Reduced-License Fee: US $300.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Adam Janin, et al.
2004
ICSI Meeting Transcripts
Linguistic Data Consortium, Philadelphia

Introduction

ICSI Meeting Transcripts was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T04 and ISBN 1-58563-286-4.

The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each. The speech files are available as ICSI Meeting Speech.

Data

This corpus consists of 75 word-level transcripts (one transcript file per meeting), time-synchronized to digitized audio recordings. There are approximately 795 K-words and 13K unique words in the transcripts.

The meetings were recorded with close-talking and far-field microphones. The transcripts were based mostly on the close-talking microphones, either separately or blended together in a so-called "mixed" channel. The focus of the transcripts was on capturing the flow of audible events, especially the words which were spoken, and who spoke them.

Transcripts were prepared by means of the "Channeltrans" interface. Channeltrans is an extension of the "Transcriber" interface.

There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from three to 10 participants, averaging six. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe.

Sponsorship

The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the Communicator project and through a ROAR "seedling," the Swiss IM2 project (National Centre of Competence in Research, sponsored by the Swiss National Science Foundation), and a supplementary award from IBM.

Updates

There are no updates available at this time. More information is available at http://www.ICSI.Berkeley.EDU/Speech/mr.

Content Copyright

Portions 2000-2003 International Computer Science Institute, 2004 Trustees of the University of Pennsylvania