West Point Croatian Speech
Item Name: | West Point Croatian Speech |
Author(s): | Stephen A. LaRocca, Christine Tomei, Milan Sokolich |
LDC Catalog No.: | LDC2005S28 |
ISBN: | 1-58563-359-3 |
ISLRN: | 531-836-688-808-6 |
DOI: | https://doi.org/10.35111/e542-fj42 |
Release Date: | October 15, 2005 |
Member Year(s): | 2005 |
DCMI Type(s): | Sound, Text |
Sample Type: | pcm |
Sample Rate: | 22050 |
Data Source(s): | microphone speech |
Language(s): | Croatian |
Language ID(s): | hrv |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2005S28 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | LaRocca, Stephen A., Christine Tomei, and Milan Sokolich. West Point Croatian Speech LDC2005S28. Web Download. Philadelphia: Linguistic Data Consortium, 2005. |
Related Works: | View |
Introduction
West Point Croatian Speech was developed by the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) and contains approximately 21 hours of read and free response Croatian speech.
The corpus was collected by staff and faculty of DFL and CTELL to develop acoustic models for speech recognition systems. The US government uses these systems to provide speech recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. In addition, parts of this corpus were designed to model question-answer dialogues for use in domain-specific speech to speech translation systems.
It consists of two subcorpora collected in 2000 and 2001 in Zagreb, Croatia. Informants were recruited from the English department at the University of Zagreb and the Croatian Military Academy. The 2000 subcorpus consists entirely of read speech, while the 2001 corpus includes free response answers to questions in addition to read speech.
Data
The read speech in the two subcorpora were elicited from two different prompt scripts. The scripts used to record read speech contain a total of 6,329 distinct sentences. Each informant in 2000 attempted to read 100 sentences from a total of 200 carefully designed sentences written by Dr. Christine Tomei. Informants in 2001 read short text passages extracted from Croatian language webpages. The script used to elicit free response answers contains 143 questions. Each speaker in the 2001 subcorpus attempted to record 105 utterances by reading 75 sentences and giving 35 free response answers to 35 questions.
These recordings were transcribed by Milan Sokolich, who also wrote a pronounciation lexicon that includes grammatical tags.
Speech data was collected using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re-recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment.
Samples
For an example of the data in this corpus, please listen to this sample (WAV).
Updates
None at this time.