Czech Broadcast News MDE Transcripts

Item Name: Czech Broadcast News MDE Transcripts
Author(s): Jachym Kolar, Jan Svec
LDC Catalog No.: LDC2010T02
ISBN: 1-58563-534-0
ISLRN: 539-629-573-162-3
Release Date: January 20, 2010
Member Year(s): 2010
DCMI Type(s): Text
Data Source(s): broadcast news
Application(s): natural language processing, spoken dialogue modeling
Language(s): Czech
Language ID(s): ces
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2010T02 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Kolar, Jachym, and Jan Svec. Czech Broadcast News MDE Transcripts LDC2010T02. Web Download. Philadelphia: Linguistic Data Consortium, 2010.

Introduction

Czech Broadcast News MDE Transcripts, Linguistic Data Consortium (LDC) catalog number LDC2010T02 and isbn 1-58563-534-0, was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic. It consists of metadata extraction (MDE) annotations for the approximately 26 hours of transcribed broadcast news speech in Czech Broadcast News Transcripts (LDC2004T01). The audio files corresponding to the transcripts in this corpus are contained in Czech Broadcast News Speech (LDC2004S01). Czech Broadcast News MDE Transcripts joins LDC's other holdings of Czech broadcast data: Czech Broadcast Conversation Speech (LDC2009S02), Czech Broadcast Conversation MDE Transcripts (LDC2009T20), Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89) and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53).

The audio recordings were collected from February 1, 2000 through April 22, 2000 from three Czech radio stations (Cesky rozhlas 1 Radiozurnal - CRo1, Cesky rozhlas 2 Praha - CRo2 and Cesky rozhlas 3 Vlatva - CRo3) and two television stations (Ceska televize - CTV and Prima TV). The broadcasts included both public and commercial subjects and were presented in various styles, ranging from a formal style to a colloquial style more typical for commercial broadcast companies that do not primarily focus on news.

The goal of MDE research is to take raw speech recognition output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, standardized spelling and sensible conventions for representing speaker turns and identity are further elements in the readable transcript.

The transcripts and annotations in this corpus are stored in two formats: QAn (Quick Annotator), and RTTM. Character encoding in all files is ISO-8859-2.

More information can be found on the website Structural Metadata Annotation for Czech.

Sponsorship

The completion of this corpus was facilitated by funding provided by the Ministry of Education of the Czech Republic under projects No. 2C06020 and ME909.

Samples

Available Media

View Fees





Login for the applicable fee