=====================================================
West Point Arabic Corpus (Project SANTIAGO)
=====================================================

Developers:   COL Stephen A. LaRocca, Rajaa Chouairi, John J. Morgan

Authors:  COL Stephen A. LaRocca and Rajaa Chouairi

The Center For Technology Enhanced Language Learning
Department Of Foreign Languages
United States Military Academy
745 Brewerton Road
West Point, NY 10996
Email: gs0416@usma.edu
Phone: 845-938-5286
Fax: 845-938-3585


Introduction:

Staff and Faculty of the Department of Foreign Languages (DFL) and the
Center for Technology Enhanced Language Learning (CTELL) designed the
SANTIAGO Arabic corpus to provide a set of recordings for the training
and development of speaker-independent speech recognition systems for
use by West Point cadets enrolled in the Arabic language program.


The Collection Scripts

The scripts directory contains two portable document format (*.pdf)
files for each of four different prompting scripts.  The files named
"t1" through "t4" are Romanized transliterations of the Arabic
sentences (reading left-to-right), while the files labeled "s1"
through "s4" contain Arabic script orthography (reading
right-to-left).

* Collection Script 1: 155 sentences, used by all 74 native Arabic speakers:

Script 1 has a total of 1152 tokens and 724 types.  The prompts are
labeled "s001" through "s155" in the PDF listing of the script.
	
* Collection Script 2: 40 sentences, used by 23 of the non-native speakers:

Script 2 has a total of 150 tokens and 124 types.  These sentences are
labeled "s01" through "s40" in the PDF listing.  (Note that the
sentences with three-character labels "s01" through "s40" here are
distinct from those with four-character labels "s001" through "s040"
in Script 1.)

* Collection Script 3: 41 sentences, used by 4 of the non-native speakers:

Script 3 has a total of 138 tokens and 84 types. These 41 sentences
are labeled "T01" through "T41" in the PDF listing.

* Collection Script 4: 22 sentences, used by 9 of the non-native speakers,
		       all of them third year Arabic students at the USMA:

Script 4 has a total of 72 tokens and 59 types.  These 22 sentences
are labeled "0" through "22" in the PDF listing.

In all cases, each recorded utterance (each sentence of from a prompt
sheet) is saved in a separate data file whose name corresponds to the
prompt label as follows:

  s1_001.sph - s1_155.sph : utterances from Script 1 s001 - s155
  s2_01.sph  - s2_40.sph  : utterances from Script 2  s01 - s40
  s3_01.sph  - s3_41.sph  : utterances from Script 3  T01 - T41
  s4_01.sph  - s4_22.sph  : utterances from Script 4    0 - 22


In the word counts associated with the scripts, a "word" is defined as
the set of characters delimited by white space.  So "wa-man" ("and
who" in English) is considered to be a single word.  All scripts were
written with Modern Standard Arabic (MSA) as the target language.
Text was encoded in a 7-bit ASCII dialect of LaTeX known as ArabTeX.
The scripts were also formatted in Unicode.  This encoding was used to
develop an automated recording program for Scripts 1, 2 and 3 using
WinCALIS, a multimedia courseware authoring system.  Each WinCALIS
data collection program corresponded to one of the three scripts and
contained both a visual and aural representation of the prompt to be
read.  Script 4 was recorded using an automated recording script
written in PERL on a computer running Linux as its operating system.


The Lexicon

The "lexicon" directory contains the file "santiago.dct", which has
1128 distinct orthographic word forms, including all words found in
the prompting scripts.  Each line of the lexicon contains one word
entry: the ArabTeX orthography is given first, followed by a tab
character, then the phone string for the word, with space characters
separating the individual phone symbols.  All phone strings end with
the "sp" (short pause) segment.


The Transcriptions

Each waveform file has a monophone and word level master label file
(*.mlf) transcription in HTK-format.  These files contain a multi-line
entry for every speech file in the corpus -- the first line of each
entry gives the file name, and the phones are provided in sequence on
the following lines, one phone per line.  Master label files are
provided at both the word level and the phone level.  Phone level
labels are provided both with and without "sp".  All sentence
transcripts begin and end with the "sil" (silence) segment.  These
files are in the labels directory.

Note: The label data without the "short pause" (sp) segment represents
a direct phonemic transcription of the prompting text, replacing each
Arabic orthographic form with the exact phoneme sequence provided for
the word in the dictionary file, whereas the "+sp" version involves
the addition of "short pause" segments and hand labeling of some
utterances.  For example, a phonological rule that deletes a word
initial glottal stop and coalesces the preceding and following vowels
into a single phone was applied in some cases.

          / iy # Q ah l / -> / ih l /

That is, the sequence of word-final high-front tense vowel followed by
the definite article "al-" is pronounced as a single syllable with a
high-front lax vowel.

This hand labeling is not standardized and was applied in some
instances and not in others.

The Phones

symbol             description
-----------------------------------------------------
  C        	voiced pharyngeal fricative
  D         	velarized voiced alveolar stop
  G         	voiced velar fricative
  H        	voiceless pharyngeal fricative
  Q        	voiceless glottal stop
  S         	velarized voiceless alveolar fricative
  T         	velarized voiceless alveolar stop
  TH        	velarized voiced interdental fricative
  Z         	voiced interdental fricative
  ae        	low front vowel
  ah       	low back vowel
  aw        	back upgliding diphthong
  ay       	front upgliding diphthong
  b         	bilabial voiced stop
  d         	voiced alveolar stop
  ey        	upper mid front vowel
  f         	voiceless labiodental fricative
  g         	voiced velar stop
  h         	voiceless glottal fricative
  ih        	high front lax vowel
  iy        	high front tense vowel
  j         	voiced palato-alveolar fricative
  k         	voiceless velar stop
  l         	voiced alveolar lateral
  m         	voiced bilabial nasal
  n         	voiced alveolar nasal
  q         	voiceless uvular stop
  r         	voiced alveolar flap
  s         	voiceless alveolar fricative
  sh        	voiceless palato-alveolar fricative
  sil       	silence
  sp        	short pause
  t         	voiceless alveolar stop
  th        	voiceless interdental fricative
  uw        	high back rounded vowel
  w         	voiced bilabial approximant
  x         	voiceless velar fricative
  y         	voiced palatal approximant
  z         	voiced alveolar fricative


The Data

The "speech" directory on each CD-ROM contains a set of
sub-directories, one for each speaker in the collection.  The names of
the speaker directories indicate the speaker's sex, identification
number, native language, and the prompting script that was used.  For
example:

 m01arabic1  : male, ID#01, native Arabic speaker, reading script 1
 f26english2 : female, ID#26, native English speaker, reading script 2

Each directory contains the SPHERE-formatted speech files, with one
recorded utterance in each file.  Note that individual speech file
names are _not_ unique across speaker directories; e.g. many of the
"script 1" directories contain files named "s1_001.sph", "s1_002.sph",
etc.

The data was collected between July 1997 and August 2001 at 5
different sites.  Native Arabic speech was collected at the Defense
Language Institute English Language School in San Antonio, Texas and
an Arabic community in Toronto, Canada.  Army linguists from Fort
Bragg, North Carolina and the George C. Marshall Center in Garmisch,
Germany and West Point cadets enrolled in advanced Arabic courses
contributed to the non-native speech corpus.

Speech data for Scripts 1, 2 and 3 were collected using Pentium 133
MHz laptop computers running Windows NT.  Recordings were captured at
a sampling rate of 16 bit @ 22050 Hz using a Shure SM10A microphone
and a RANE Model MS1 pre-amplifier.  The WinCALIS script presented a
visual display in ArabTex of the sentence to be recorded, along with a
digital recording of the sentence as read by a native speaker.  The
informant pressed the Enter key to record their utterance.  The
informant's recording was played back for review, and the utterance
was re-recorded, if necessary.  Several different versions of the
WinCALIS script, corresponding to the 3 data collection scripts, were
used.  Some native informants read all 155 sentences from Script 1,
however, most read a 90-sentence subset of this script.  The
non-native informants attempted to read all 40 of the prompts from
either Script 2 or 3.  The speech data for Script 4 was recorded on
Pentium 166 MHz desktop computers running Red Hat Linux 6.2.
Recordings were captured at the same sampling rate with the same
microphone as Scripts 1, 2 and 3.  Informants for Script 4 attempted
to read all 22 sentences in the script.

The corpus consists of 8,516 speech files.  Approximately 7200 are
from native informants and 1200 files are from non-native
informants. The following tables show the breakdown of corpus content
in terms of male, female, native and non-native speakers.

     Table 1: number of speakers

           male   female   total
 native:        41       34       75
 non-native:    25       10       35
 totals:        66       44       110


     Table 2: hours of data

               male   female   total
 native:       6.0     4.4     10.4
 non-native    0.74    0.28     1.02
 totals:       6.74    4.68    11.42


     Table 3: megabytes of data

                male   female    total
 native:        913      663     1576
 non-native:    111       42.4    153.4
 totals:       1024      705.4   1729.4


     Table 4: number of speech files

                male   female   total
 native:        4107    3163     7270
 non-native:     883     363     1246
 totals:        4990    3526     8516

Many of the recording sessions include a handful of utterances that
were cut short, due to pronunciation mistakes or unexpected
interruptions (e.g. phones ringing, doors slamming, etc); these
partial utterances have been retained in the waveform directories, and
are distinguished from the full-sentence recordings by having "-u"
appended to the file names, just before the ".sph" file extension
(e.g. "s1_102-u.sph" instead of "s1_102.sph").  More information on
the waveform data is available in the file "recording-summary.pdf".


The Acoustic Models

Acoustic models were created using the Hidden Markov Model Tool Kit
(HTK) developed at Cambridge University:

       http://htk.eng.cam.ac.uk

The file "21m_0001.mmf", located in the doc directory, contains the
HTK binary-formatted mixture model data.  Except for the short pause
(sp), all of the phones described above were modeled with the standard
3-emitting-state Hidden Markov Model (HMM).  The mean vectors and
covariance matrices have 39 dimensions corresponding to the 12 Mel
Frequency Cepstral Coefficients, energy, delta, and delta-delta
components extracted from the waveform data.  The training procedure
was stopped after incrementing to 21-mixtures.  The sp model is a
1-emitting-state HTK t-model whose center state is tied to the center
state of the silence (sil) model.



Acknowledgments:

Dr. Kathleen Egan, Department of Defense, provided encouragement and
support for this and other projects concerning the use of speech
technologies for language learning.

Lieutenant General (Retired) Claudia Kennedy, while serving as the US
Army's Deputy Chief of Staff - Intelligence, provided encouragement
and support for this project.

Dr. Jonathan Kaplan and Dr. Robert Siedel of the Army Research
Institute for sponsoring the project and using the USMA acoustic
models in the Military Language Tutor (MILT).

Planning, execution and development of the SANTIAGO corpus was
performed by the following members of the Center for Technology
Enhanced Language Learning:

John J. Morgan, COL Stephen A. LaRocca, LTC Kevin Kenny, Maj David
Resendez, Charles Ruscelli, Linda Asmann and Sherri Bellinger.

Colonel Terrence M. Potter, Department of Foreign Languages, provided
invaluable linguistic insight for the design and development of this
corpus.

The following members of the Department Of Foreign Languages at West
Point are acknowledged: Rajaa Chouairi, LTC David and Noelle Jesmer,
LTC David Bartlett, and LTC Mark English.

The Linguistic Data Consortium provided encouragement, continued
assistance and technical guidance in the publication of this corpus.
In particular, Chris Cieri, Dave Graff and Andy Cole are acknowledged.

John Morgan thanks Dr. T.V. Raman for his audio interface to emacs
called emacspeak.

The author of ArabTeX is Prof. Klaus Lagally, Institut fuer Informatik
Universitaet Stuttgart Breitwiesenstrasse 20-22 D-70565 Stuttgart,
Germany.  Email:  lagally@informatik.uni-stuttgart.de
ArabTeX is Copyright (c) 1990 - 1998, Klaus Lagally