22 Language Corpus Release Version 1.2 Center for Spoken Language Understanding UPDATED: 3 June 2002 Overview -------- The 22 Language corpus consists of telephone speech from 21 languages: Eastern Arabic, Cantonese, Czech, Farsi, German, Hindi, Hungarian, Japanese, Korean, Malay, Mandarin, Italian, Polish, Portuguese, Russian, Spanish, Swedish, Swahili, Tamil, Vietnamese, and English. The corpus contains fixed vocabulary utterances (e.g. days of the week) as well as fluent continuous speech. Each of the 50191 utterances is verified by a native speaker to determine if the caller followed instructions when answering the prompts. For this release, approximately 19758 utterances have corresponding orthographic transcriptions. Recording Details ----------------- All of the data in this corpus were collected over digital telephone lines. The digital data were recorded with the CSLU T1 digital data collection system. These files were sampled at 8 khz 8-bit and stored as ulaw files. All of the data are stored in standad 16-bit linear RIFF format. Verification ------------ Each utterance included in the 22 Language Corpus has gone through a process of verification. Verification was done by native speakers of each language. The verifiers were asked to listen to each utterance and decide if the speaker responded appropriately to the prompt. In addition, the verifiers made judgements about the age, gender, and dialect of each speaker. Two native talkers verified the utterances in each language independently. Protocol -------- The protocol is the "script" of questions and prompts that our recording system played for the callers. The protocol was reproduced in each language by native speakers so that when callers called the system they heard the prompts and questions in their own language. The prompts and questions should be the same in all languages but sometime there were subtle differences due to translation idosyncracies. The English version Thank you for calling the Oregon Graduate Institute language database. We are currently recording speech in . We are studying the different languages of the world. To do this, we need to record samples of speech from fluent speakers of . Please respond to the following questions and instructions in only. This will take about seven minutes. Please wait for the beep before speaking. 1. What is your native language? 2. What language do you speak most of the time? 3. What language do you speak at home? 4. What other languages do you speak and understand? 5. How old are you? 6. What is your date of birth? 7. Are you male or female? 8. How long have you been in the United States? 9. What city and state did you spend most of your childhood? 10. What is your zipcode? 11. What area code are you calling from? 12. What day is today? 13. What time is it? 14. Say a familiar telephone number? 15. How would you ask someone if they speak ? 16. Give us the greeting you usually use when answering the phone. 17. For each of the following descriptions, we will record the first ten seconds of your answer. Begin speaking at the beep. A second beep will indicate when we have finished recording your answer to each question. (pause) 18. Describe the route you take to work or to the store. 19. Tell us something that you like about your hometown. 20. Tell us about the climate in your hometown. 21. Describe the room you are calling from. 22. Describe your most recent meal. 23. We now want you to talk for a longer period of time. We do not care what you say as long as you keep talking. You can tell us anything about yourself, your hobbies and interests, the city that you live in, and the sports that you like. Or you can make up a story, tell a fairy-tale or recite a poem. You will have one minute to speak. We will now give you ten seconds to think about what to say. Please do not read anything, we would prefer you make something up. (pause) 24. Please begin talking at the beep. You will hear a second beep when you have ten seconds left. 25. For the last question, we would like you to tell us something about yourself in English. If you do not speak English, you may push any button on your phone, or simply wait for twenty seconds. At the beep, please tell us something about yourself in English. 26. If you are calling from a touch tone phone, please push the number two button. 27. Please tell us where you heard about this number. 28. In appreciation for your call, we are conducting a drawing for 5000 dollars. The odds of winner are approximately 1 in 1600. Would you like to be entered into the drawing for 5000 dollars? 29. Thank you for your participation. Please leave your name, address, and telephone number so we may contact you if you are the winner of the drawing. Your name and address will be kept confidential. You may hang up when you are through.