THE WEST POINT CROATIAN SPEECH
                              CORPUS

   The Center For Technology Enhanced Language Learning
                  United States Military Academy
                 Department Of Foreign Languages
                          745 Brewerton Road
                        West Point, NY 10996
                       Email: gj8285@usma.edu
                         Phone: 845-938-4077
                           Fax: 845-938-3585

                           September 4, 2003

   The West Point Croatian Speech Corpus is a data base of digital record-
ings of spoken Croatian . It was collected by staff and faculty of the De-
partment of Foreign Languages (DFL) and Center for Technology Enhanced
Language Learning (CTELL) to develop acoustic models for speech recog-
nition systems. The US government uses These systems to provide speech-
recognition enhanced language learning courseware to government linguists
and students enrolled in various government language programs. In addition,
parts of this corpus were designed to model question answer dialogues for use
in domain-specific speech to speech translation systems.
   The corpus consists of two subcorpora collected in 2000 and 2001 in Za-
greb Croatia. Informants were recruited from the English department at the
University of Zagreb and the Croatian Military Academy. The 2000 sub-
corpus consists entirely of read speach, while the 2001 corpus includes free
response answers to questions in addition to read speech.
   The read speech in the two subcorpora were elicited from two different
prompt scripts. Each informant in 2000 attempted to read 100 sentences from
a total of 200 carefully designed sentences. These sentences were written

                                     1



by Christine Tomei. Dr. Tomei's design analysis can be found in the file
design-2000.txt. Informants in 2001 read short text passages extracted from
Croatian language webpages. Thus the scripts used to record read speech
contain a total of 6329 distinct sentences. The read speech prompts are listed
in the files read-200[01].txt in the transcripts directory. Each line of these
files has two fields separated by a tab, the first denoting the base name of the
waveform file, and the second the prompt used in recording the utterence.
The read speech data are stored under the Recordings Croatian directory.
   The script used to elicit free response answers contains 143 questions.
The text that was actually presented to the informants is in the file named
questions.txt in the transcripts directory. Data recorded from these prompts
are stored in the Answers Croatian directory.
   The human-performed transcriptions of the informant's answers are listed
in the answers.txt file in the transcripts directory. Again, each line of this
file has two fields separated by a tab, the first field contains two numbers
separated by a slash (). The first number is and identification index for the
speaker. The second number is an index to the question. The second field
on the line contains a word level transcription of the informants's answer to
the question indexed by the second number in the first field. So for example
in the line:
   1/15 eh roena je u splitu
   eh roena je u splitu is a transcription of the response speaker 1 gave to
question 15. The corresponding waveform file is stored in the file 15.wav in
the directory Answers Croatian1.
   These recordings were transcribed by Milan Sokolich. Mr. Sokoloch also
wrote a pronouncing dictionary that includes grammatical tags. His work
is stored in the file named raw-lexicon.txt. The file lexicon.txt contains a
processed version of the raw-lexicon.txt file.
   Each speaker in the 2001 subcorpus attempted to record 105 utterances
by reading 75 sentences and giving 35 free response answers to 35 questions.
   Speech data was collected using Pentium 450 mHz laptop computers run-
ning Windows 2000 with a 16 bit data size and sampling rate of 22050 Hz.
The recording script presented a visual display of the sentence to be recorded.
The informant pressed a key and spoke the sentence. The recording was
played back for review allowing the utterance to be re-recorded. A mem-
ber of the data collection team was on hand during the recording session to
verify recordings and provide technical assistance in case of malfunctioning
equipment.

                                       2