SYLLABLE-FINAL /S/ LENITION IN THE LDC'S CALLHOME SPANISH CORPUS

Michelle A. Fox

1. INTRODUCTION

This data corpus codes lenition of syllable-final /s/ in Latin American Spanish in the LDC's CallHome Spanish corpus. It is a well-known fact that syllable-final /s/ is subject to lenition in many Latin American Spanish dialects. Lenition of -/s/ is a variable phonological process in which an-/s/ may be aspirated (pronounced [h]) or deleted altogether (Ø). Lenition of -/s/ has been widely studied by sociolinguists, who have identified various linguistic and extralinguistic factors that favor the process. Since syllable-final /s/ is frequent in Spanish, lenition has a great effect on overall pronunciation.

2. SPEECH DATA THAT HAS BEEN CODED

The speech data used as the basis for this syllable-final /s/ corpus is from the CallHome Spanish corpus published by the Linguistic Data Consortium (LDC), which contains 120 telephone conversations between native speakers of Spanish. This corpus is especially well-suited to the task of studying variation in -/s/ lenition because it contains informal speech by a large number of speakers from many different dialects. General information regarding each of the speakers, including dialect, is identified, so that dialectal studies can be performed with the data.

Each of the telephone calls in the CallHome Spanish corpus was transcribed orthographically, with no pronunciation information, so instances of underlying -/s/ were easily identified by searching through the transcriptions using the pronunciations given in the LDC Spanish Lexicon [1]. This lexicon includes the canonical pronunciation, stress pattern, and morphological information for each word.Although syllabification is not explicitly given in the lexicon, each vowel in Spanish heads a syllable, and all instances of word-internal /s/ followed by a consonant are syllable-final.

For the current data corpus, all occurrences of syllable-final /s/ were coded.All occurrences of word-final /s/ are treated as though they are syllable-final, even though when immediately followed by a vowel, a particular -/s/ may be re-syllabified in fast speech. In addition, in Spanish, surface /z/ is actually an underlying /s/, so all syllable-final instances of /z/ in the LDC Spanish Lexicon were treated as /s/.

3. CODING PROCEDURE

First a list of the occurrences of syllable-final -/s/ was made from the orthographic transcript and the LDC Spanish Lexicon.Once this list was compiled, a large amount of redundancy was added in order to measure the repeatability of coding. Since the task is a difficult one, it is important to measure each coder’s consistency in coding, and to see whether the two coders used the same criteria. A total of 24,473 different instances of -/s/ from the training and development test files of the CallHome Spanish corpus were coded. 4,727 of these tokens were coded twice, and 843 of these were coded three times.

The list of tokens was then randomized to prevent the coders from being affected by listening to multiple tokens by the same speaker, either (1) by expecting the speaker to retain or delete an -/s/, and hearing what they expected, or (2) by adjusting the coding criteria to the speaker (e.g. if a particular speaker pronounced /s/ very strongly in most cases, a weaker /s/ might be mis-coded as a deletion). In a further attempt to retain constant criteria, samples of -/s/ pronounced as [s], [h], and Ø were presented to the coders at regular intervals during the coding process.

Two students at the University of Pennsylvania performed the coding. The first coder is a female native speaker of English who is proficient in Spanish and a linguistics student. The second coder is a male bilingual speaker of English and Puerto Rican Spanish not familiar with linguistics. Both were familiar with the -/s/ lenition phenomenon before beginning the project.

For each token of -/s/ to be coded, the coder was shown the orthographic transcription of the entire sentence, along with an indication of which-/s/ to code. An automatic alignment of the speech files was used to determine the approximate start and end times of the given word; from this alignment a window of speech starting 20ms before the hypothesized beginning of the word and ending 20ms after the end of the word was played. The coder was able to replay the speech and to change the window of speech as needed.Spectrograms were not used during the coding process. The coders were instructed to make a selection for each occurrence of -/s/ unless the recording quality was poor. However, when the coders felt uncertain about a classification, they were able to indicate that the classification had low confidence.

The coding categories available were:

·s:the /s/ was retained;

·z: the /s/ was retained and voiced;

·h:the /s/ was retained, but only as aspiration;

·Ø: the /s/ was deleted;

·R: the recording was distorted and so analysis could not be made

·f: the following segment was also /s/, so the -/s/ in question could not be categorized

·t: the entire syllable was truncated

·T: the original transcript was incorrect and there was no word with a syllable-final /s/

4. DATA FORMAT

Each individual coding is contained on one line in the file, with the fields tab delimited. The fields are as follows:

Token id

Each different occurrence of syllable-final /s/ in the CallHome Spanish corpus has a unique token id. Two codings of the same syllable-final /s/ have the same token id, so that it is easy to identify those tokens that were coded more than once.

Code

Each token was given one of the following codes:

-s: the /s/ was retained;

-z: the /s/ was retained and voiced;

-h: the /s/ was retained, but only as aspiration;

-o: the /s/ was deleted;

-R: the recording was distorted and so analysis could not be made;

-f: the following segment was also /s/, so the -/s/ in question could not be categorized;

-t: the entire syllable was truncated;

-T: the original transcript was incorrect and there was no word with a syllable-final /s/

Confidence level

The coding task was a difficult one. The coders were instructed to make a selection for each occurrence of -/s/ unless the recording quality was poor. However, when the coders felt uncertain about a classification, they were able to indicate that the classification had low confidence. "Normal" classifications have a confidence value of 1, while the classifications in which the coder felt unsure of have a confidence value of 0.

Speaker id

Each speaker in the corpus has a unique speaker id. The speaker id consists of the number of the speech/transcript files in the CallHome Spanish corpus followed by the channel (A/B). In the cases where there is more than one speaker in on one of the channels in a speech file, the channel letter is followed by number indicating which of the speakers on that channel (this follows the numbering as given in the CallHome Spanish corpus).

Header of the line in the transcript

This identifies the line in the transcript.All information preceding the colon of the turn is included.For example, for the line

312.99 314.36 A: Y cómo están por allá.

The header of the line is “312.99 314.36 A”.

Words from the transcript

This includes the two words preceding, the word coded, and the two words following. The word coded is in capital letters and the other words are lower case, so the word in question can be identified even if there are no preceding/following words in the speaker's turn. Note that since the identification of all syllable-final -/s/ in the CallHome Spanish corpus was done on a previous release of the corpus, there may be some discrepancies from the current release.

Location of word in the speaker's turn

Indicates the word number in the speaker's turn (if it is the first word in the turn, this is "1", if the second word, "2", etc.). Note that since the identification of all syllable-final -/s/ in the CallHome Spanish corpus was done on a previous release of the corpus, there may be some discrepancies from the current release.However, when combined with the previous two fields (“header of the line in the transcript” and “words from the transcript”), the proper occurrence of the word should be easily identified.

Location of /s/ in the word

Indicates if the -/s/ that was coded is word-final ("final") or word-internal ("nonfinal"). When a word contains more than one syllable-final or word-final /s/, this information is needed to determine which /s/ is coded. In the rare cases where there are two syllable-final word-internal -/s/, the second one is coded "nonfinal2". For example, for the word es₁tadís₂ticas₃, s₁is “nonfinal” s₂is “nonfinal2” s₃is “final”.

Preceding segment

The segment preceding the -/s/, using the same phone set as that used in the Spanish lexicon. The preceding segment was determined from the canonical pronunciation of the word from the Spanish lexicon. The CallHome Spanish corpus contains several loan words that begin with a consonant cluster starting with /s/ ("Smith"); this field is empty for such words.

Following segment

The segment immediately following the -/s/.If the /s/ is word-internal, this was determined from the canonical pronunciation of the word from the Spanish lexicon.If the /s/ is word-final and the word is immediately followed by another word, the following segment was determined from the canonical pronunciation of the following word.

Word stress pattern

Stress pattern of the word in question, from the Spanish lexicon.

Following word stress pattern

The stress of the following word, from the Spanish lexicon, if the /s/ is word-final and is immediately followed by another word.

Word start time

Approximate starting time of the word, as determined by automatic alignment of the speech, if the automatic alignment was deemed to be adequate by the person coding the syllable-final /s/. If the word didn't seem to be aligned properly when the window of speech was played, the coders were instructed to indicate that the alignment was incorrect, and then this field in the data is blank.

Word end time

Approximate ending time of the word, as determined by automatic alignment of the speech, if the automatic alignment was deemed to be adequate by the person coding the syllable-final /s/.If the automatic alignment was determined to be incorrect, this field is blank.

Length of pause following word

Amount of time following the word before the beginning of the next word, as determined by the automatic alignment. If the word was final in the speaker's turn, this value is -1. A value of 0.01 is the smallest value possible, and indicates that there was no pause between the word and the following word.

Coder

Indication of the person doing the coding: "m" for the male coder or "f" for the female coder.

Speaker's Dialect

Dialect information for the speaker as listed in the CallHome Spanish corpus (often just the country).

Speaker's Sex

Sex (female/male) of the speaker as listed in the CallHome Spanish corpus.

Speaker's Age

Age information (elderly/juvenile/adult) as listed in the CallHome Spanish corpus.

Corrected following word

The correct following word, if the transcript was deemed to be incorrect by the coder.

Comment

Any comment entered by the coder (there are only a handful of these).

Morphological information

Morphological information for the word, taken from the Spanish lexicon.

5. REFERENCES

1.Garrett, S., Morton, T. and McLemore, C. LDC Spanish Lexicon. Linguistic Data Consortium, University of Pennsylvania, Philadelphia, 1997.