Home › Language Resources › Data

West Point Arabic Speech

Item Name:	West Point Arabic Speech
Author(s):	Stephen A. LaRocca, Rajaa Chouairi
LDC Catalog No.:	LDC2002S02
ISBN:	1-58563-199-x
ISLRN:	223-969-897-944-9
DOI:	https://doi.org/10.35111/b12f-w956
Release Date:	August 20, 2002
Member Year(s):	2002
DCMI Type(s):	Sound
Sample Type:	1-channel pcm
Sample Rate:	22050
Data Source(s):	microphone speech
Application(s):	speech recognition
Language(s):	Arabic
Language ID(s):	ara
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2002S02 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	LaRocca, Stephen A., and Rajaa Chouairi. West Point Arabic Speech LDC2002S02. Web Download. Philadelphia: Linguistic Data Consortium, 2002.
Related Works: Hide	View isSimilarWith LDC2003S05 West Point Russian Speech LDC2005S28 West Point Croatian Speech LDC2005S30 West Point Company G3 American English Speech LDC2006S37 West Point Heroico Spanish Speech LDC2006S36 West Point Korean Speech LDC2008S04 West Point Brazilian Portuguese Speech

Introduction

West Point Arabic Speech was produced by the Linguistic Data Consortium (LDC), catalog number LDC2002S02 and ISBN 1-58563-199-x.

West Point Arabic Speech contains speech data that was collected and processed by members of the Department of Foreign languages at the United States Military Academy at West Point and the Center For Technology Enhanced Language Learning (CTELL) as part of an effort called "Project Santiago." The original purpose of this corpus was to train acoustic models for automatic speech recognition that could be used as an aid in teaching Arabic to West Point cadets.

Data

The corpus consists of 8,516 speech files, totaling 1.7 gigabytes or 11.42 hours of speech data. Each speech file represents one person reciting one prompt from one of four prompt scripts. The utterances were recorded using a Shure SM10A microphone and a RANE Model MS1 pre-amplifier. The files were recorded as 16-bit PCM low-byte-first ("little-endian") raw audio files, with a sampling rate of 22.05 KHz. They were then converted to NIST sphere format.

Approximately 7,200 of the recordings are from native informants and 1200 files are from non-native informants. The following tables show the breakdown of corpus content in terms of male, female, native and non-native speakers.

number of speakers

male	female	total
native:	41	34	75
non-native:	25	10	35
totals:	66	44	110

hours of data

male	female	total
native:	6.0	4.4	10.4
non-native:	0.74	0.28	1.02
totals:	6.74	4.68	11.42

megabytes of data

male	female	total
native:	918	667	1585
non-native:	111.9	42.8	154.7
totals:	1029.9	709.8	1739.7

number of speech files

male	female	total
native:	4107	3163	7270
non-native:	883	363	1246
totals:	4990	3526	8516

Some of the recording sessions include a handful of utterances that were cut short due to pronunciation mistakes or unexpected interruptions (e.g. phones ringing, doors slamming, etc). These partial utterances have been retained in the waveform directories and are distinguished from the full-sentence recordings by having a trailing "-u" in the filename, before the extension (e.g. "s1_080-u.sph" instead of "s1_080.sph"). The above tables describe all data; both the complete and partial utterances are accounted for. 168 of the 8,516 speech files are partial utterances, and the remaining 8,348 are complete.

Updates

There are no updates at this time.

Copyright

Portions © 2002 United States Military Academy, © 2002 Trustees of the University of Pennsylvania The SANTIAGO Arabic corpus was developed at the United States Military Academy. All information contained herein is the sole and exclusive property of the United States Military Academy.

West Point Arabic Speech

Introduction

Data

Updates

Copyright

Available Media

View Fees