West Point Croatian Speech

Item Name: West Point Croatian Speech
Author(s): Stephen A. LaRocca, Christine Tomei, Milan Sokolich
LDC Catalog No.: LDC2005S28
ISBN: 1-58563-359-3
ISLRN: 531-836-688-808-6
DOI: https://doi.org/10.35111/e542-fj42
Release Date: October 15, 2005
Member Year(s): 2005
DCMI Type(s): Sound, Text
Sample Type: pcm
Sample Rate: 22050
Data Source(s): microphone speech
Language(s): Croatian
Language ID(s): hrv
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2005S28 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: LaRocca, Stephen A., Christine Tomei, and Milan Sokolich. West Point Croatian Speech LDC2005S28. Web Download. Philadelphia: Linguistic Data Consortium, 2005.
Related Works: View


West Point Croatian Speech was developed by the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) and contains approximately 21 hours of read and free response Croatian speech.

The corpus was collected by staff and faculty of DFL and CTELL to develop acoustic models for speech recognition systems. The US government uses these systems to provide speech recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. In addition, parts of this corpus were designed to model question-answer dialogues for use in domain-specific speech to speech translation systems.

It consists of two subcorpora collected in 2000 and 2001 in Zagreb, Croatia. Informants were recruited from the English department at the University of Zagreb and the Croatian Military Academy. The 2000 subcorpus consists entirely of read speech, while the 2001 corpus includes free response answers to questions in addition to read speech.


The read speech in the two subcorpora were elicited from two different prompt scripts. The scripts used to record read speech contain a total of 6,329 distinct sentences. Each informant in 2000 attempted to read 100 sentences from a total of 200 carefully designed sentences written by Dr. Christine Tomei. Informants in 2001 read short text passages extracted from Croatian language webpages. The script used to elicit free response answers contains 143 questions. Each speaker in the 2001 subcorpus attempted to record 105 utterances by reading 75 sentences and giving 35 free response answers to 35 questions.

These recordings were transcribed by Milan Sokolich, who also wrote a pronounciation lexicon that includes grammatical tags.

Speech data was collected using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re-recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment.


For an example of the data in this corpus, please listen to this sample (WAV).


None at this time.

Available Media

View Fees

Login for the applicable fee