Czech Broadcast News MDE Transcripts, Linguistic Data Consortium (LDC) catalog number LDC2010T02 and isbn 1-58563-534-0, was prepared by researchers at the University of West Bohemia, Pilsen, Czech Republic. It consists of metadata extraction (MDE) annotations for the approximately 26 hours of transcribed broadcast news speech in Czech Broadcast News Transcripts (LDC2004T01). The audio files corresponding to the transcripts in this corpus are contained in Czech Broadcast News Speech (LDC2004S01). Czech Broadcast News MDE Transcripts joins LDC's other holdings of Czech broadcast data: Czech Broadcast Conversation Speech (LDC2009S02), Czech Broadcast Conversation MDE Transcripts (LDC2009T20), Voice of America (VOA) Czech Broadcast News Audio (LDC2000S89) and Voice of America (VOA) Czech Broadcast News Transcripts (LDC2000T53).
The audio recordings were collected from February 1, 2000 through April 22, 2000 from three Czech radio stations (Cesky rozhlas 1 Radiozurnal - CRo1, Cesky rozhlas 2 Praha - CRo2 and Cesky rozhlas 3 Vlatva - CRo3) and two television stations (Ceska televize - CTV and Prima TV). The broadcasts included both public and commercial subjects and were presented in various styles, ranging from a formal style to a colloquial style more typical for commercial broadcast companies that do not primarily focus on news.
The goal of MDE research is to take raw speech recognition output and refine it into forms that are of more use to humans and to downstream automatic processes. In simple terms, this means the creation of automatic transcripts that are maximally readable. This readability might be achieved in a number of ways: removing non-content words like filled pauses and discourse markers from the text; removing sections of disfluent speech; and creating boundaries between natural breakpoints in the flow of speech so that each sentence or other meaningful unit of speech might be presented on a separate line within the resulting transcript. Natural capitalization, punctuation, standardized spelling and sensible conventions for representing speaker turns and identity are further elements in the readable transcript.
The transcripts and annotations in this corpus are stored in two formats: QAn (Quick Annotator), and RTTM. Character encoding in all files is ISO-8859-2.
More information can be found on the website Structural Metadata Annotation for Czech.
The completion of this corpus was facilitated by funding provided by the Ministry of Education of the Czech Republic under projects No. 2C06020 and ME909.
- Quick Annotator Transcript
- RTTM Annotation
Portions © 2000 Ceska televize, © 2000 Cesky rozhlas 1 Radiozurnal, © 2000 Cesky rohlas 2 Praha, © 2000 Cesky rozhlas 3 Vlatva, © 2000 FTV Primiera, © 2004, 2010 Trustees of the University of Pennsylvania