THE WEST POINT CROATIAN SPEECH CORPUS The Center For Technology Enhanced Language Learning United States Military Academy Department Of Foreign Languages 745 Brewerton Road West Point, NY 10996 Email: gj8285@usma.edu Phone: 845-938-4077 Fax: 845-938-3585 September 4, 2003 The West Point Croatian Speech Corpus is a data base of digital record- ings of spoken Croatian . It was collected by staff and faculty of the De- partment of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) to develop acoustic models for speech recog- nition systems. The US government uses These systems to provide speech- recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. In addition, parts of this corpus were designed to model question answer dialogues for use in domain-specific speech to speech translation systems. The corpus consists of two subcorpora collected in 2000 and 2001 in Za- greb Croatia. Informants were recruited from the English department at the University of Zagreb and the Croatian Military Academy. The 2000 sub- corpus consists entirely of read speach, while the 2001 corpus includes free response answers to questions in addition to read speech. The read speech in the two subcorpora were elicited from two different prompt scripts. Each informant in 2000 attempted to read 100 sentences from a total of 200 carefully designed sentences. These sentences were written 1 by Christine Tomei. Dr. Tomei's design analysis can be found in the file design-2000.txt. Informants in 2001 read short text passages extracted from Croatian language webpages. Thus the scripts used to record read speech contain a total of 6329 distinct sentences. The read speech prompts are listed in the files read-200[01].txt in the transcripts directory. Each line of these files has two fields separated by a tab, the first denoting the base name of the waveform file, and the second the prompt used in recording the utterence. The read speech data are stored under the Recordings Croatian directory. The script used to elicit free response answers contains 143 questions. The text that was actually presented to the informants is in the file named questions.txt in the transcripts directory. Data recorded from these prompts are stored in the Answers Croatian directory. The human-performed transcriptions of the informant's answers are listed in the answers.txt file in the transcripts directory. Again, each line of this file has two fields separated by a tab, the first field contains two numbers separated by a slash (). The first number is and identification index for the speaker. The second number is an index to the question. The second field on the line contains a word level transcription of the informants's answer to the question indexed by the second number in the first field. So for example in the line: 1/15 eh roena je u splitu eh roena je u splitu is a transcription of the response speaker 1 gave to question 15. The corresponding waveform file is stored in the file 15.wav in the directory Answers Croatian1. These recordings were transcribed by Milan Sokolich. Mr. Sokoloch also wrote a pronouncing dictionary that includes grammatical tags. His work is stored in the file named raw-lexicon.txt. The file lexicon.txt contains a processed version of the raw-lexicon.txt file. Each speaker in the 2001 subcorpus attempted to record 105 utterances by reading 75 sentences and giving 35 free response answers to 35 questions. Speech data was collected using Pentium 450 mHz laptop computers run- ning Windows 2000 with a 16 bit data size and sampling rate of 22050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re-recorded. A mem- ber of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment. 2