VOICE ACROSS HISPANIC AMERICA ----------------------------- Corpus Documentation The Voice Across Hispanic America consists of 38,740 utterances from 570 female and 345 male native speakers of American Spanish. Each speaker provided between 5 and 45 utterances. There were a total of 31,066 read utterances and 7,674 spontaneous utterances (including 3,468 yes/no responses). Details of the corpus design, collection and development are given in the final report. Each utterance is stored in a separate waveform file; all the files from each speaker are in a separate directory. Speaker directories are arranged in "train" and "test" directories, where the "test" set is a representative sample of 100 calls drawn from the overall collection. Each speaker is identified by a 4-digit number, based on the order in which he/she called the data collection system. This number ranges from 0001 to 1446. There are several gaps in the speaker numbers, as several speakers had to be discarded from the corpus either because they hung up after one or two utterances or because they did not provide valid speech. Speaker directory names are of the form: spk_<4_digit_speaker_number> Within each speaker directory are the all the speech waveform files in NIST SPHERE format. The header of the speech file contains important speaker and utterance information, as well as information about the sample data format. (The sample data are stored in single-channel 8-bit mu-law form, starting at byte offset 1024 in each file.) See the final report and the transcription conventions document for a description of the header items; also, refer to the documentation in the "sphere" directory for information about the file header format. The waveform file names are of the form: .sph where - 2-digit number from 01 through 45 - 'r' for read speech, 's' for spontaneous speech - 1-letter code indicating the utterance type (explained below) - 4-digit speaker number from 0001 through 1446 Examples: 01sy0053.sph, 15rw0972.sph, etc. The list below shows the order and numbering of the utterances and the description of the utterance types elicited from each speaker. Utt. # Utt. & Mode Type Description ------ ---- ----------- 01 s y yes/no 02 r i 5-digit Caller Id Number (CIN) 03 r p phone number 04 r w application word 05 r r phonetically rich sentence 06 r c credit-card number 07 r w application word 08 r r phonetically rich sentence 09 r w application word 10 r r phonetically rich sentence 11 r m money item (dollar amount) 12 r w application word 13 r r phonetically rich sentence 14 r q quantity item 15 r w application word 16 r r phonetically rich sentence 17 r u unsegmented 8-digit string 18 r w application word 19 r r phonetically rich sentence 20 r a unsegmented 8-character alphanumeric string 21 r w application word 22 r c credit-card number 23 r w application word 24 r r phonetically rich sentence 25 r p phone number 26 r n "name-at-dept." phrase 27 r w application word 28 r p phone number 29 r w application word 30 r p phonetically rich sentence 31 r w application word 32 r n "name-at-dept." phrase 33 r d date item 34 r w application word 35 r p phone number 36 r o spelled word 37 r l list of 6 digits 38 s t spontaneous time item 39 s p spontaneous phone number 40 s y yes/no 41 s o spontaneous spelled word 42 s s spontaneous speech 43 s s spontaneous speech 44 s y yes/no 45 s y yes/no