ICSI Meeting Speech


Introduction

ICSI Meeting Speech was produced by Linguistic Data Consortium (LDC) catalog number LDC2004S02 and ISBN 1-58563-285-6.

The ICSI Meeting corpus is a collection of 75 meetings collected at the International Computer Science Institute in Berkeley during the years 2000-2002. The meetings included are "natural" meetings in the sense that they would have occurred anyway: they are generally regular weekly meetings of various ICSI working teams, including the team working on the ICSI Meeting Project. In recording meetings of this type, we hoped to capture meeting dynamics and speaking styles that are as natural as possible given that speakers are wearing close-talking microphones and are fully cognizant of the recording process. The speech files range in length from 17 to 103 minutes, but generally run just under an hour each. Word-level orthographic transcriptions are available as ICSI Meeting Transcripts.

Data

The collection includes 922 speech files, for a total of approximately 72 hours of Meeting Room speech. The speech is structured as one subdirectory per meeting, containing wavefiles for each channel (and possible .blp file, specifying any censored intervals).

The audio was collected at a 48 kHZ sample-rate, downsampled on the fly to 16 kHz. Audio files for each meeting are provided as separate time-synchronous recordings for each channel, encoded as 16-bit linear (big-endian) wavefiles, shorten-compressed in NIST SPHERE format.

The meetings were simultaneously recorded using close-talking microphones for each speaker (generally head-mounted, but early meetings contain some lapel microphones), as well as six table-top microphones: 4 high-quality omnidirectional PZM microphones arrayed down the center of the conference table, and 2 inexpensive microphone elements mounted on a mock PDA. All meetings were recorded in the same instrumented meeting room.

There are a total of 53 unique speakers in the corpus. Meetings involved anywhere from 3 to 10 participants, averaging 6. The corpus contains a significant proportion of non-native English speakers, varying in fluency from nearly-native to challenging-to-transcribe.

The following documentation files are provided:
overview.txt overview of the corpus
trans_guide.txt description of the transcription process and transcription conventions
Meeting1_annotated.dtd text file specifying the MRT format, including annotations and examples of use
Meeting1_annotated_dtd.html html file specifying the MRT format, including annotations and examples of use
icsi1.spk a compilation of speaker information (XML format)
naming.txt text file describing the naming conventions for meetings, participant IDs, microphone types, and transmission types
naming.html html file describing the naming conventions for meetings, participant IDs, microphone types, and transmission types
seatingchart.txt a simple diagram of the meeting table with rough indications of seat and table-top microphone placement
all.blp a listing of all intervals where speech is censored in plain text format

For more information please consult the overview.txt file in the doc directory.

For a complete listing of the files please see file.tbl in the doc directory.

Sponsorship

The collection and preparation of this corpus was made possible in large part through funding from DARPA, both through the Communicator project and through a ROAR "seedling", the Swiss IM2 project (National Centre of Competence in Research, sponsored by the Swiss National Science Foundation), and a supplementary award from IBM.

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2004S02.

Please contact mrcontact@icsi.berkeley.edu with any questions regarding this corpus.

Content Copyright

Portions © 2000-2003 International Computer Science Institute, © 2004 Trustees of the University of Pennsylvania


Contact: ldc@ldc.upenn.edu
© 2004 Linguistic Data Consortium, Trustees of the University of Pennsylvania. All Rights Reserved.