West Point Croatian Speech

Item Name: West Point Croatian Speech
Author(s): Stephen LaRocca, Christine Tomei, Milan Sokolich
LDC Catalog No.: LDC2005S28
ISBN: 1-58563-359-3
ISLRN: 531-836-688-808-6
Release Date: October 15, 2005
Member Year(s): 2005
DCMI Type(s): Sound
Sample Type: pcm
Sample Rate: 22050
Data Source(s): microphone speech
Language(s): Croatian
Language ID(s): hrv
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2005S28 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: LaRocca, Stephen, Christine Tomei, and Milan Sokolich. West Point Croatian Speech LDC2005S28. Web Download. Philadelphia: Linguistic Data Consortium, 2005.


This file contains documentation on West Point Croatian Speech, Linguistic Data Consortium (LDC) catalog number LDC2005S28 and ISBN 1-58563-359-3.

West Point Croatian Speech is a database of digital recordings of spoken Croatian . It was collected by staff and faculty of the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) to develop acoustic models for speech recognition systems. The US government uses These systems to provide speech recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. In addition, parts of this corpus were designed to model question-answer dialogues for use in domain-specific speech to speech translation systems.

The corpus consists of two subcorpora collected in 2000 and 2001 in Zagreb Croatia. Informants were recruited from the English department at the University of Zagreb and the Croatian Military Academy. The 2000 subcorpus consists entirely of read speach, while the 2001 corpus includes free response answers to questions in addition to read speech.

The read speech in the two subcorpora were elicited from two different prompt scripts. Each informant in 2000 attempted to read 100 sentences from a total of 200 carefully designed sentences. These sentences were written by Christine Tomei. Dr. Tomei's design analysis can be found in the file design-2000.txt. Informants in 2001 read short text passages extracted from Croatian language webpages. Thus the scripts used to record read speech contain a total of 6,329 distinct sentences. The read speech prompts are listed in the files read-200[01].txt in the transcripts directory. Each line of these files has two fields separated by a tab, the first denoting the base name of the waveform file, and the second the prompt used in recording the utterence. The read speech data are stored under the Recordings Croatian directory.

The script used to elicit free response answers contains 143 questions. The text that was actually presented to the informants is in the file named questions.txt in the transcripts directory. Data recorded from these prompts are stored in the Answers Croatian directory.

The human-performed transcriptions of the informant's answers are listed in the answers.txt file in the transcripts directory. Again, each line of this file has two fields separated by a tab, the first field contains two numbers separated by a slash. The first number is an identification index for the speaker. The second number is an index to the question. The second field on the line contains a word level transcription of the informants's answer to the question indexed by the second number in the first field. So, for example, in the line:

1/15 eh roena je u splitu eh roena je u splitu is a transcription of the response speaker one gave to question 15. The corresponding waveform file is stored in the file 15.wav in the directory Answers Croatian1.

These recordings were transcribed by Milan Sokolich. Mr. Sokoloch also wrote a pronouncing dictionary that includes grammatical tags. His work is stored in the file named raw-lexicon.txt. The file lexicon.txt contains a processed version of the raw-lexicon.txt file.

Each speaker in the 2001 subcorpus attempted to record 105 utterances by reading 75 sentences and giving 35 free response answers to 35 questions.

Speech data was collected using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re-recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment.


For an example of the speech in this corpus, please listen to this audio sample.

Available Media

View Fees

Login for the applicable fee