CSR-IV HUB4


Item Name: CSR-IV HUB4
Authors: John Garofolo, Jonathan Fiscus, William Fisher, and David Pallett
LDC Catalog No.: LDC96S31
NIST Catalog No.: 26-1.1, 26-2.1, 26-6.1
ISBN: 1-58563-087-X
Data Type: speech
Sample Rate: 16000 Hz
Sampling Format: 1-channel pcm
Data Source(s): broadcast news
Project(s): DARPA-CSR
Application(s): speech recognition
Language(s): English
Language ID(s): eng
Distribution: 1 DVD
Member fee: $0 for 1996 members
Non-member Fee: US $4000.00
Reduced-License Fee: US $2000.00
Extra-Copy Fee: US $450.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: John Garofolo, et al.
1996
CSR-IV HUB4
Linguistic Data Consortium, Philadelphia

This set of CD-ROMs contains all of the speech data provided to sites participating in the DARPA CSR November 1995 HUB4 (Radio) Broadcast News tests. The data consists of digitized waveforms of MarketPlace (tm) business news radio shows provided by KUSC through an agreement with the Linguistic Data Consortium and detailed transcriptions of those broadcasts. The software NIST used to process and score the output of the test systems is also included.

The data is organized as follows:

CD26-1: Training Data-Ten complete half-hour broadcasts with minimal-verified transcripts. The transcripts are time aligned with the waveforms at the story-boundary level.

CD26-2: Development-Test Data-Six complete half-hour broadcasts with verified transcripts. The transcripts are time aligned with the waveforms at the story- and turn-boundary level. Index files have been included which specify how the data may be partitioned into 2 test sets.

CD26-6 Evaluation-Test Data-Five complete half-hour broadcasts with verified/adjudicated transcripts. The transcripts are time aligned with the waveforms at the story-, turn- and music-boundary level. An index file has been included which specifies how the data was partitioned into the test set used in the CSR 1995 HUB4 tests.

Samples

Content Copyright