West Point Heroico Spanish Speech

Item Name: West Point Heroico Spanish Speech
Author(s): John Morgan
LDC Catalog No.: LDC2006S37
ISBN: 1-58563-391-7
ISLRN: 331-222-724-302-4
DOI: https://doi.org/10.35111/6nac-6589
Release Date: October 25, 2006
Member Year(s): 2006
DCMI Type(s): Sound, Text
Sample Type: pcm
Sample Rate: 22050
Data Source(s): microphone speech
Application(s): speech recognition
Language(s): Spanish
Language ID(s): spa
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2006S37 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Morgan, John. West Point Heroico Spanish Speech LDC2006S37. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: View


West Point Heroico Spanish Speech was developed by the Department of Foreign Languages (DFL) and the Center for Technology Enhanced Language Learning (CTELL) and contains approximately 19,000 audio files of prompted Spanish speech with associated transcripts.

This corpus was designed and collected by staff and faculty of DFL and CTELL to develop acoustic models for speech recognition systems. The U.S. government uses these systems to provide speech-recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. Additionally, parts of this corpus were designed to model question/answer dialogues for use in domain-specific speech-to-speech translation systems. The corpus consists of two subcorpora, one collected in September 2001 at El Heroico Colegio Militar (HEROICO), the Mexican Military Academy in Mexico City, and the other at the United States Military Academy (USMA), also known as West Point, at different times since 1997. The USMA subcorpus includes data from non-native speakers and data collected through a throat microphone.


Two kinds of prompt scripts were used, one to elicit read speech and one for free-response answers to questions. The scripts used to record read speech have a total of 724 distinct sentences, 205 short, simple sentences used in typical language learning scenarios, and 519 sentences extracted from lecture notes used at USMA in a military readings course. The script used to elicit free-response answers contains 143 questions. The corpus includes .txt files of all the read sentences, questions, and transcriptions of subjects' answers. The files are separated by recording location and named accordingly.

Speech data was collected at HEROICO using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re- recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment.

The data from USMA was collected using several different microphones and formats. Most of the data were recorded on Pentium computers running Linux through an Shure SM10 head-mounted microphone. Entropics ESPS programs were used in most cases, especially when both head-mounted and throat microphones were used.


For an example of the data in this corpus, please listen to this audio sample (WAV).


None at this time.

Available Media

View Fees

Login for the applicable fee