SYLLABLE-FINAL /S/ LENITION IN THE LDC'S CALLHOME SPANISH CORPUS
Michelle A. Fox
For the current data corpus, all occurrences of syllable-final /s/ were coded.All occurrences of word-final /s/ are treated as though they are syllable-final, even though when immediately followed by a vowel, a particular -/s/ may be re-syllabified in fast speech. In addition, in Spanish, surface /z/ is actually an underlying /s/, so all syllable-final instances of /z/ in the LDC Spanish Lexicon were treated as /s/.
Two students at the University of Pennsylvania performed the coding. The first coder is a female native speaker of English who is proficient in Spanish and a linguistics student. The second coder is a male bilingual speaker of English and Puerto Rican Spanish not familiar with linguistics. Both were familiar with the -/s/ lenition phenomenon before beginning the project.
For each token of -/s/ to be coded, the coder was shown the orthographic transcription of the entire sentence, along with an indication of which-/s/ to code. An automatic alignment of the speech files was used to determine the approximate start and end times of the given word; from this alignment a window of speech starting 20ms before the hypothesized beginning of the word and ending 20ms after the end of the word was played. The coder was able to replay the speech and to change the window of speech as needed.Spectrograms were not used during the coding process. The coders were instructed to make a selection for each occurrence of -/s/ unless the recording quality was poor. However, when the coders felt uncertain about a classification, they were able to indicate that the classification had low confidence.
The coding categories available were:
·s:the /s/ was retained;
·z: the /s/ was retained and voiced;
·h:the /s/ was retained, but only as aspiration;
·Ø: the /s/ was deleted;
·R: the recording was distorted and so analysis could not be made
·f: the following segment was also /s/, so the -/s/ in question could not be categorized
·t: the entire syllable was truncated
·T: the original transcript was incorrect and there was no word with a syllable-final /s/
Token id
Each different occurrence of syllable-final /s/ in the CallHome Spanish corpus has a unique token id. Two codings of the same syllable-final /s/ have the same token id, so that it is easy to identify those tokens that were coded more than once.
Code
Each token was given one of the following codes:
-s: the /s/ was retained;
-z: the /s/ was retained and voiced;
-h: the /s/ was retained, but only as aspiration;
-o: the /s/ was deleted;
-R: the recording was distorted and so analysis could not be made;
-f: the following segment was also /s/, so the -/s/ in question could not be categorized;
-t: the entire syllable was truncated;
-T: the original transcript was incorrect and there was no word with a syllable-final /s/
Confidence level
The coding task was a difficult one. The coders were instructed to make a selection for each occurrence of -/s/ unless the recording quality was poor. However, when the coders felt uncertain about a classification, they were able to indicate that the classification had low confidence. "Normal" classifications have a confidence value of 1, while the classifications in which the coder felt unsure of have a confidence value of 0.
Speaker id
Each speaker in the corpus has a unique speaker id. The speaker id consists of the number of the speech/transcript files in the CallHome Spanish corpus followed by the channel (A/B). In the cases where there is more than one speaker in on one of the channels in a speech file, the channel letter is followed by number indicating which of the speakers on that channel (this follows the numbering as given in the CallHome Spanish corpus).
Header of the line in the transcript
This identifies the line in the transcript.All
information preceding the colon of the turn is included.For
example, for the line
312.99 314.36 A: Y cómo están por allá.
The header of the line is “312.99 314.36 A”.
Words from the transcript
This includes the two words preceding, the word coded, and the two words following. The word coded is in capital letters and the other words are lower case, so the word in question can be identified even if there are no preceding/following words in the speaker's turn. Note that since the identification of all syllable-final -/s/ in the CallHome Spanish corpus was done on a previous release of the corpus, there may be some discrepancies from the current release.
Location of word in the speaker's turn
Indicates the word number in the speaker's turn (if it is the first word in the turn, this is "1", if the second word, "2", etc.). Note that since the identification of all syllable-final -/s/ in the CallHome Spanish corpus was done on a previous release of the corpus, there may be some discrepancies from the current release.However, when combined with the previous two fields (“header of the line in the transcript” and “words from the transcript”), the proper occurrence of the word should be easily identified.
Location of /s/ in the word
Indicates if the -/s/ that was coded is word-final ("final") or word-internal ("nonfinal"). When a word contains more than one syllable-final or word-final /s/, this information is needed to determine which /s/ is coded. In the rare cases where there are two syllable-final word-internal -/s/, the second one is coded "nonfinal2". For example, for the word es1tadís2ticas3, s1 is “nonfinal” s2 is “nonfinal2” s3 is “final”.
Preceding segment
The segment preceding the -/s/, using the same phone set as that used in the Spanish lexicon. The preceding segment was determined from the canonical pronunciation of the word from the Spanish lexicon. The CallHome Spanish corpus contains several loan words that begin with a consonant cluster starting with /s/ ("Smith"); this field is empty for such words.
Following segment
The segment immediately following the -/s/.If the /s/ is word-internal, this was determined from the canonical pronunciation of the word from the Spanish lexicon.If the /s/ is word-final and the word is immediately followed by another word, the following segment was determined from the canonical pronunciation of the following word.
Word stress pattern
Stress pattern of the word in question, from the Spanish lexicon.
Following word stress pattern
The stress of the following word, from the Spanish lexicon, if the /s/ is word-final and is immediately followed by another word.
Word start time
Approximate starting time of the word, as determined by automatic alignment of the speech, if the automatic alignment was deemed to be adequate by the person coding the syllable-final /s/. If the word didn't seem to be aligned properly when the window of speech was played, the coders were instructed to indicate that the alignment was incorrect, and then this field in the data is blank.
Word end time
Approximate ending time of the word, as determined by automatic alignment of the speech, if the automatic alignment was deemed to be adequate by the person coding the syllable-final /s/.If the automatic alignment was determined to be incorrect, this field is blank.
Length of pause following word
Amount of time following the word before the beginning
of the next word, as determined by the automatic alignment. If the word
was final in the speaker's turn, this value is -1. A value of 0.01 is the
smallest value possible, and indicates that there was no pause between
the word and the following word.
Coder
Indication of the person doing the coding: "m" for the male coder or "f" for the female coder.
Speaker's Dialect
Dialect information for the speaker as listed in the CallHome Spanish corpus (often just the country).
Speaker's Sex
Sex (female/male) of the speaker as listed in the CallHome Spanish corpus.
Speaker's Age
Age information (elderly/juvenile/adult) as listed in the CallHome Spanish corpus.
Corrected following word
The correct following word, if the transcript was deemed to be incorrect by the coder.
Comment
Any comment entered by the coder (there are only a handful of these).
Morphological information
Morphological information for the word, taken from the Spanish lexicon.