West Point Heroico Spanish Speech

Author(s): John Morgan
LDC Catalog No.: LDC2006S37
ISBN: 1-58563-391-7
ISLRN: 331-222-724-302-4
Release Date: October 25, 2006
Member Year(s): 2006
DCMI Type(s): Sound
Sample Type: pcm
Sample Rate: 22050
Data Source(s): microphone speech
Application(s): speech recognition
Language(s): Spanish
Language ID(s): spa
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2006S37 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Morgan, John. West Point Heroico Spanish Speech LDC2006S37. Web Download. Philadelphia: Linguistic Data Consortium, 2006.


This file contains documentation on West Point Heroico Spanish Speech, Linguistic Data Consortium (LDC) catalog number LDC2006S37 and ISBN 1-58563-391-7.

West Point Heroico Spanish Speech is a database of digital recordings of spoken Spanish. It was designed and collected by staff and faculty of the Department of Foreign Languages (DFL) and Center for Technology Enhanced Language Learning (CTELL) to develop acoustic models for speech recognition systems. The U.S. government uses these systems to provide speech-recognition enhanced language learning courseware to government linguists and students enrolled in various government language programs. Additionally, parts of this corpus were designed to model question/answer dialogues for use in domain-specific speech-to-speech translation systems. The corpus consists of two subcorpora, one collected in September 2001 at El Heroico Colegio Militar (HEROICO), the Mexican Military Academy in Mexico City, and the other at USMA at different times since 1997. The USMA subcorpus includes data from non-native speakers and data collected through a throat microphone.


Two kinds of prompt scripts were used, one to elicit read speech and one for free-response answers to questions. The read speech prompts are also divided into two groups, one designed to elicit speech typical of language learning scenarios and the other for speech from educated native speakers. The scripts used to record read speech have a total of 724 distinct sentences. This number includes 205 short, simple sentences used in typical language learning scenarios. The other 519 sentences were extracted from lecture notes used at USMA in a military readings course. All of the read speech prompts are listed in two files in the transcripts directory: HEROICO- Recordings.txt and USMA-prompts.txt, containing the sentences read by informants at the Mexican Military Academy and USMA, respectively. Each line of these files has two fields separated by a tab, the first denoting the base name of the waveform file, and the second the prompt used in recording the utterence.

The read speech data collected from informants at HEROICO are stored in the HEROICO/Recordings Spanish directory. The script used to elicit free-response answers contains 143 questions. The text that was actually presented to the informants is in the file named questions.txt in the transcripts directory. Data recorded from these prompts are stored in the HEROICO/Answers Spanish directory. The human-performed transcriptions of the informants answers are listed in the HEROICO-Answers.txt file in the transcripts directory. Again, each line of this file has two fields separated by a tab the first field contains two numbers separated by a slash. The first number is an identification index for the speaker. The second number is an index to the question. The second field on the line contains a word level transcription of the informants answer to the question indexed by the second number in the first field. So for example in the line: 100/10 no ella no tiene barba ni bigote no ella no tiene barba ni bigote is a transcription of the response speaker 100 gave to question 10. The corresponding waveform file is stored in the file 10.wav in the directory HEROICOAnswers Spanish100. Each speaker in the HEROICO subcorpus attempted to record 100 utter- ances by reading 75 sentences and giving 25 free-response answers to questions.

Both native and non-native USMA informatnts read from the list of 205 simple sentences. The prompts used in the USMA subcorpus are listed in the file USMA-prompts.txt in the transcripts directory. This file has the same two-field format as the above transcription files. Some of the USMA informants wore an additional throat microphone. That data was recorded in a separate stream and stored in files whose names begin with the letter t. Data collected at USMA are stored under the USMA directory. The names of the directories under the USMA directory indicate whether the speaker was native or non-native. The speakers native country is also indicated in the case of native speakers.

Speech data was collected at HEROICO using Pentium 450 mHz laptop computers running Windows 2000 with a 16-bit data size and sampling rate of 22,050 Hz. The recording script presented a visual display of the sentence to be recorded. The informant pressed a key and spoke the sentence. The recording was played back for review allowing the utterance to be re- recorded. A member of the data collection team was on hand during the recording session to verify recordings and provide technical assistance in case of malfunctioning equipment.

The data from USMA was collected using several different microphones and formats. Most of the data were recorded on Pentium computers running Linux through an m-10 Shuer head-mounted microphone. Entropics ESPS programs were used in most cases, especially when both head-mounted and throat microphones were used.


