Czech Broadcast Conversation Speech


Item Name: Czech Broadcast Conversation Speech
Authors: Jachym Kolar, Jan Svec, Josef Psutka
LDC Catalog No.: LDC2009S02
ISBN: 1-58563-519-7
Release Date: Jul 17, 2009
Data Type: speech
Sample Rate: 22050 Hz
Sampling Format: 16 bit PCM
Data Source(s): broadcast conversation
Application(s): speaker identification, speech recognition
Language(s): Czech
Language ID(s): ces
Distribution: 2 DVD
Member fee: $0 for 2009 members
Non-member Fee: US $1400.00
Reduced-License Fee: US $700.00
Extra-Copy Fee: US $400.00
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Jachym Kolar, Jan Svec, Josef Psutka
2009
Czech Broadcast Conversation Speech
Linguistic Data Consortium, Philadelphia

Introduction

Czech Broadcast Conversation Speech was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic, and consists of 40 hours of speech recorded from Czech Radio 1 in 2003. Transcripts corresponding to the audio files in this corpus are provided in Czech Broadcast Conversation MDE Transcripts (LDC2009T20). These corpora join LDC's other Czech broadcast data sets: Czech Broadcast News Speech (LDC2004S01), Czech Broadcast News Transcripts (LDC2004T01), Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89), and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53).

Czech Broadcast Conversation Speech consists of 72 single channel recordings of Radioforum, a live talk program broadcast by Czech Radio 1 (CRo1) every weekday evening. Its format consists of invited guests (most often politicians) spontaneously answering topical questions posed by one or two interviewers. The number of interviewees in a single program varies from one to three, but typically, one interviewer and two interviewees appear in the program. The material includes passages of interactive dialogue, but longer stretches of monologue-like speech comprise the majority of the collected data. Radioforum also has an interactive segment where listeners call the studio and ask their own questions. That telephony speech was not transcribed in the current release.

Data

Individual recordings range from 27 minutes to 36 minutes each. The recordings were collected during the period from February 12, 2003 through June 26, 2003. The signal is mono, sampled at 22.05 kHZ with 16-bit resolution, stored in Windows PCM waveform format. The names of the audio files refer to the broadcast date (rfYYMMDD.wav).

The table below contains details about the audio files and the transcripts:

Number of shows 72
Number of word tokens 292.6k
Number of unique words 30.5k
Duration of transcribed speech 33.0h
Total number of speakers 128
Male speakers 108
Female speakers 20

Samples

Sponsorship

The completion of this corpus was facilitated by funding provided by the Ministry of Education of the Czech Republic under projects No. ME909 and 2C06020.

Content Copyright

Portions 2003 Cesky rozhlas 1 Radiozurnal, 2009 Trustees of the University of Pennsylvania