Home › Language Resources › Data

West Point Russian Speech

Item Name:	West Point Russian Speech
Author(s):	Stephen A. LaRocca, Christine Tomei
LDC Catalog No.:	LDC2003S05
ISBN:	1-58563-277-5
ISLRN:	741-782-638-900-9
DOI:	https://doi.org/10.35111/7rt8-8x28
Release Date:	December 18, 2003
Member Year(s):	2003
DCMI Type(s):	Sound
Sample Type:	1-channel pcm
Sample Rate:	22050
Data Source(s):	microphone speech
Application(s):	speech recognition
Language(s):	Russian
Language ID(s):	rus
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2003S05 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	LaRocca, Stephen A., and Christine Tomei. West Point Russian Speech LDC2003S05. Web Download. Philadelphia: Linguistic Data Consortium, 2003.
Related Works: Hide	View isSimilarWith LDC2002S02 West Point Arabic Speech LDC2005S28 West Point Croatian Speech LDC2005S30 West Point Company G3 American English Speech LDC2006S37 West Point Heroico Spanish Speech LDC2006S36 West Point Korean Speech LDC2008S04 West Point Brazilian Portuguese Speech relatesTo LDC2008S08 LDC Spoken Language Sampler LDC2013S06 LDC Spoken Language Sampler - Second Release

Introduction

West Point Russian Speech was developed at the Department of Foreign Languages (DFL) and the Center for Technology Enhanced Language Learning (CTELL) at the United States Military Academy at West Point. The purpose of the corpus is to provide a set of recordings for the training and development of speaker-independent speech recognition systems for use by West Point cadets enrolled in the Russian language program.

Data

The corpus consists of 4,181 speech files in SPHERE format, totalling approximately four hours of speech. Approximately 2,290 files are from native informants and 1,891 are from non-native informants.

The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers.

Number of speakers:

	male	female	total
native	13	16	29
non-native	16	10	26
totals	29	26	55

Number of speech files:

	male	female	total
native	1027	1263	2290
non-native	1103	788	1891
totals	2130	2050	4181

The speech data was collected using laptop computers running Windows NT. Recordings were captured at a sampling rate of 16-bit at 22,050 Hz pcm using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. A visual display of the sentence, along with a digital recording of the sentence as read by a native speaker, was presented. The informant pressed the Enter key to record the utterance. The informant's recording was played back for review and the utterance was re-recorded if necessary.

The collection script consists of 96 sentences with a total of 528 tokens and 351 types.

Each waveform file has a monophone and word level master label file transcription in HTK-format. A concatenated version of the master label files at both the word level and the phone level is provided.

The lexicon contains 690 distinct orthographic word forms, including all words found in the collection script.

West Point Russian Speech

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees