Home › Language Resources › Data

CSR-II (WSJ1) Sennheiser

Item Name:	CSR-II (WSJ1) Sennheiser
Author(s):	Linguistic Data Consortium, NIST Multimodal Information Group, Janet M. Baker
LDC Catalog No.:	LDC94S13B
ISBN:	1-58563-031-4
ISLRN:	418-053-774-232-3
DOI:	https://doi.org/10.35111/5jkw-xt28
Member Year(s):	1994, 1997
DCMI Type(s):	Sound
Sample Type:	1-channel pcm compressed
Sample Rate:	16000
Data Source(s):	microphone speech
Project(s):	DARPA-CSR
Application(s):	speech recognition
Language(s):	English
Language ID(s):	eng
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC94S13B Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Linguistic Data Consortium, NIST Multimodal Information Group, and Janet Baker. CSR-II (WSJ1) Sennheiser LDC94S13B. Web Download. Philadelphia: Linguistic Data Consortium, 1994.
Related Works: Hide	View isPartOf LDC94S13A CSR-II (WSJ1) Complete isPartWith LDC94S13C CSR-II (WSJ1) Other isSimilarWith LDC93S6A CSR-I (WSJ0) Complete LDC93S6B CSR-I (WSJ0) Sennheiser LDC93S6C CSR-I (WSJ0) Other LDC95S23 CSR-III Speech LDC95T6 CSR-III Text LDC96S31 CSR-IV HUB4 LDC96S33 CSR-IV HUB3

LDC94S13A - Complete CSR-II corpus

LDC94S13B - CSR-II Sennheiser speech

LDC94S13C - CSR-II Other speech

Data

The complete WSJ1 corpus contains approximately 78,000 training utterances (73 hours of speech), 4,000 of which are the result of spontaneous dictation by journalists with varying degrees of experience in dictation. The corpus contains approximately 8,200 conventional development test utterances (eight hours of speech), 6,800 of which are from spontaneous dictation. As with the pilot corpus, the entire corpus was collected using two microphones, so the amount of speech in the entire corpus is about 162 hours.

In early 1993, a Hub and Spoke test paradigm was designed, calling for eleven test sets, each a specific variation of the basic or hub condition. The eleven Hub and Spoke Development and Evaluation Test sets each contain approximately 7,500 waveforms (eleven hours of speech).

WSJ1 waveforms have been compressed by about 2:1 using the SPHERE-embedded Shorten compression algorithm developed at Cambridge University.

Updates

Please note that even tho the file wsj1/doc/lng_modl/base_lm/tcb20onp.z (WSJ1/DOC/LNG_MODL/BASE_LM/TCB20ONP.Z on a Windows OS) has the .z extension, it is not a compressed file. In order to use the file, simply ignore the .z extension.

CSR-II (WSJ1) Sennheiser

Data

Updates

Copyright

Available Media

View Fees