===================================================== West Point Arabic Corpus (Project SANTIAGO) ===================================================== Developers: COL Stephen A. LaRocca, Rajaa Chouairi, John J. Morgan Authors: COL Stephen A. LaRocca and Rajaa Chouairi The Center For Technology Enhanced Language Learning Department Of Foreign Languages United States Military Academy 745 Brewerton Road West Point, NY 10996 Email: gs0416@usma.edu Phone: 845-938-5286 Fax: 845-938-3585 Introduction: Staff and Faculty of the Department of Foreign Languages (DFL) and the Center for Technology Enhanced Language Learning (CTELL) designed the SANTIAGO Arabic corpus to provide a set of recordings for the training and development of speaker-independent speech recognition systems for use by West Point cadets enrolled in the Arabic language program. The Collection Scripts The scripts directory contains two portable document format (*.pdf) files for each of four different prompting scripts. The files named "t1" through "t4" are Romanized transliterations of the Arabic sentences (reading left-to-right), while the files labeled "s1" through "s4" contain Arabic script orthography (reading right-to-left). * Collection Script 1: 155 sentences, used by all 74 native Arabic speakers: Script 1 has a total of 1152 tokens and 724 types. The prompts are labeled "s001" through "s155" in the PDF listing of the script. * Collection Script 2: 40 sentences, used by 23 of the non-native speakers: Script 2 has a total of 150 tokens and 124 types. These sentences are labeled "s01" through "s40" in the PDF listing. (Note that the sentences with three-character labels "s01" through "s40" here are distinct from those with four-character labels "s001" through "s040" in Script 1.) * Collection Script 3: 41 sentences, used by 4 of the non-native speakers: Script 3 has a total of 138 tokens and 84 types. These 41 sentences are labeled "T01" through "T41" in the PDF listing. * Collection Script 4: 22 sentences, used by 9 of the non-native speakers, all of them third year Arabic students at the USMA: Script 4 has a total of 72 tokens and 59 types. These 22 sentences are labeled "0" through "22" in the PDF listing. In all cases, each recorded utterance (each sentence of from a prompt sheet) is saved in a separate data file whose name corresponds to the prompt label as follows: s1_001.sph - s1_155.sph : utterances from Script 1 s001 - s155 s2_01.sph - s2_40.sph : utterances from Script 2 s01 - s40 s3_01.sph - s3_41.sph : utterances from Script 3 T01 - T41 s4_01.sph - s4_22.sph : utterances from Script 4 0 - 22 In the word counts associated with the scripts, a "word" is defined as the set of characters delimited by white space. So "wa-man" ("and who" in English) is considered to be a single word. All scripts were written with Modern Standard Arabic (MSA) as the target language. Text was encoded in a 7-bit ASCII dialect of LaTeX known as ArabTeX. The scripts were also formatted in Unicode. This encoding was used to develop an automated recording program for Scripts 1, 2 and 3 using WinCALIS, a multimedia courseware authoring system. Each WinCALIS data collection program corresponded to one of the three scripts and contained both a visual and aural representation of the prompt to be read. Script 4 was recorded using an automated recording script written in PERL on a computer running Linux as its operating system. The Lexicon The "lexicon" directory contains the file "santiago.dct", which has 1128 distinct orthographic word forms, including all words found in the prompting scripts. Each line of the lexicon contains one word entry: the ArabTeX orthography is given first, followed by a tab character, then the phone string for the word, with space characters separating the individual phone symbols. All phone strings end with the "sp" (short pause) segment. The Transcriptions Each waveform file has a monophone and word level master label file (*.mlf) transcription in HTK-format. These files contain a multi-line entry for every speech file in the corpus -- the first line of each entry gives the file name, and the phones are provided in sequence on the following lines, one phone per line. Master label files are provided at both the word level and the phone level. Phone level labels are provided both with and without "sp". All sentence transcripts begin and end with the "sil" (silence) segment. These files are in the labels directory. Note: The label data without the "short pause" (sp) segment represents a direct phonemic transcription of the prompting text, replacing each Arabic orthographic form with the exact phoneme sequence provided for the word in the dictionary file, whereas the "+sp" version involves the addition of "short pause" segments and hand labeling of some utterances. For example, a phonological rule that deletes a word initial glottal stop and coalesces the preceding and following vowels into a single phone was applied in some cases. / iy # Q ah l / -> / ih l / That is, the sequence of word-final high-front tense vowel followed by the definite article "al-" is pronounced as a single syllable with a high-front lax vowel. This hand labeling is not standardized and was applied in some instances and not in others. The Phones symbol description ----------------------------------------------------- C voiced pharyngeal fricative D velarized voiced alveolar stop G voiced velar fricative H voiceless pharyngeal fricative Q voiceless glottal stop S velarized voiceless alveolar fricative T velarized voiceless alveolar stop TH velarized voiced interdental fricative Z voiced interdental fricative ae low front vowel ah low back vowel aw back upgliding diphthong ay front upgliding diphthong b bilabial voiced stop d voiced alveolar stop ey upper mid front vowel f voiceless labiodental fricative g voiced velar stop h voiceless glottal fricative ih high front lax vowel iy high front tense vowel j voiced palato-alveolar fricative k voiceless velar stop l voiced alveolar lateral m voiced bilabial nasal n voiced alveolar nasal q voiceless uvular stop r voiced alveolar flap s voiceless alveolar fricative sh voiceless palato-alveolar fricative sil silence sp short pause t voiceless alveolar stop th voiceless interdental fricative uw high back rounded vowel w voiced bilabial approximant x voiceless velar fricative y voiced palatal approximant z voiced alveolar fricative The Data The "speech" directory on each CD-ROM contains a set of sub-directories, one for each speaker in the collection. The names of the speaker directories indicate the speaker's sex, identification number, native language, and the prompting script that was used. For example: m01arabic1 : male, ID#01, native Arabic speaker, reading script 1 f26english2 : female, ID#26, native English speaker, reading script 2 Each directory contains the SPHERE-formatted speech files, with one recorded utterance in each file. Note that individual speech file names are _not_ unique across speaker directories; e.g. many of the "script 1" directories contain files named "s1_001.sph", "s1_002.sph", etc. The data was collected between July 1997 and August 2001 at 5 different sites. Native Arabic speech was collected at the Defense Language Institute English Language School in San Antonio, Texas and an Arabic community in Toronto, Canada. Army linguists from Fort Bragg, North Carolina and the George C. Marshall Center in Garmisch, Germany and West Point cadets enrolled in advanced Arabic courses contributed to the non-native speech corpus. Speech data for Scripts 1, 2 and 3 were collected using Pentium 133 MHz laptop computers running Windows NT. Recordings were captured at a sampling rate of 16 bit @ 22050 Hz using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. The WinCALIS script presented a visual display in ArabTex of the sentence to be recorded, along with a digital recording of the sentence as read by a native speaker. The informant pressed the Enter key to record their utterance. The informant's recording was played back for review, and the utterance was re-recorded, if necessary. Several different versions of the WinCALIS script, corresponding to the 3 data collection scripts, were used. Some native informants read all 155 sentences from Script 1, however, most read a 90-sentence subset of this script. The non-native informants attempted to read all 40 of the prompts from either Script 2 or 3. The speech data for Script 4 was recorded on Pentium 166 MHz desktop computers running Red Hat Linux 6.2. Recordings were captured at the same sampling rate with the same microphone as Scripts 1, 2 and 3. Informants for Script 4 attempted to read all 22 sentences in the script. The corpus consists of 8,516 speech files. Approximately 7200 are from native informants and 1200 files are from non-native informants. The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers. Table 1: number of speakers male female total native: 41 34 75 non-native: 25 10 35 totals: 66 44 110 Table 2: hours of data male female total native: 6.0 4.4 10.4 non-native 0.74 0.28 1.02 totals: 6.74 4.68 11.42 Table 3: megabytes of data male female total native: 913 663 1576 non-native: 111 42.4 153.4 totals: 1024 705.4 1729.4 Table 4: number of speech files male female total native: 4107 3163 7270 non-native: 883 363 1246 totals: 4990 3526 8516 Many of the recording sessions include a handful of utterances that were cut short, due to pronunciation mistakes or unexpected interruptions (e.g. phones ringing, doors slamming, etc); these partial utterances have been retained in the waveform directories, and are distinguished from the full-sentence recordings by having "-u" appended to the file names, just before the ".sph" file extension (e.g. "s1_102-u.sph" instead of "s1_102.sph"). More information on the waveform data is available in the file "recording-summary.pdf". The Acoustic Models Acoustic models were created using the Hidden Markov Model Tool Kit (HTK) developed at Cambridge University: http://htk.eng.cam.ac.uk The file "21m_0001.mmf", located in the doc directory, contains the HTK binary-formatted mixture model data. Except for the short pause (sp), all of the phones described above were modeled with the standard 3-emitting-state Hidden Markov Model (HMM). The mean vectors and covariance matrices have 39 dimensions corresponding to the 12 Mel Frequency Cepstral Coefficients, energy, delta, and delta-delta components extracted from the waveform data. The training procedure was stopped after incrementing to 21-mixtures. The sp model is a 1-emitting-state HTK t-model whose center state is tied to the center state of the silence (sil) model. Acknowledgments: Dr. Kathleen Egan, Department of Defense, provided encouragement and support for this and other projects concerning the use of speech technologies for language learning. Lieutenant General (Retired) Claudia Kennedy, while serving as the US Army's Deputy Chief of Staff - Intelligence, provided encouragement and support for this project. Dr. Jonathan Kaplan and Dr. Robert Siedel of the Army Research Institute for sponsoring the project and using the USMA acoustic models in the Military Language Tutor (MILT). Planning, execution and development of the SANTIAGO corpus was performed by the following members of the Center for Technology Enhanced Language Learning: John J. Morgan, COL Stephen A. LaRocca, LTC Kevin Kenny, Maj David Resendez, Charles Ruscelli, Linda Asmann and Sherri Bellinger. Colonel Terrence M. Potter, Department of Foreign Languages, provided invaluable linguistic insight for the design and development of this corpus. The following members of the Department Of Foreign Languages at West Point are acknowledged: Rajaa Chouairi, LTC David and Noelle Jesmer, LTC David Bartlett, and LTC Mark English. The Linguistic Data Consortium provided encouragement, continued assistance and technical guidance in the publication of this corpus. In particular, Chris Cieri, Dave Graff and Andy Cole are acknowledged. John Morgan thanks Dr. T.V. Raman for his audio interface to emacs called emacspeak. The author of ArabTeX is Prof. Klaus Lagally, Institut fuer Informatik Universitaet Stuttgart Breitwiesenstrasse 20-22 D-70565 Stuttgart, Germany. Email: lagally@informatik.uni-stuttgart.de ArabTeX is Copyright (c) 1990 - 1998, Klaus Lagally