VOICE ACROSS HISPANIC AMERICA TRANSCRIPTION =========================================== Yeshwant Muthusamy, Barb Wheatley and Joseph Picone Personal Systems Laboratory, Texas Instruments INTRODUCTION ------------ This document describes the conventions used to validate and transcribe Spanish speech data collected as part of the Voice Across Hispanic America project. Validation and transcription consists of checking whether each utterance is according to specifications, making subjective judgments about the quality of the speech, transcribing the speech, and noting down all events that co-occur with the speech. This occurs after the utterances have been passed through a preliminary validation stage described in the final report. Utterances with the following characteristics are not included with the official release of the corpus: - truncated utterances - extremely noisy files - empty or silent files - inappropriate responses (e.g., singing, prank responses, speech that is not in response to the prompt) The transcription is done using an interactive tool, based on UNIX curses, that for each utterance, allows the transcriber to: - listen to the utterance as many times as needed - see the prompting text (for read utterances) - type in or change the transcription - enter speaker and utterance judgments into designated information fields (speaker rate, signal quality, etc.) For read items, the prompting text is the default transcription. The transcriber just has to modify it, if needed. The information fields are not part of the transcription per se. Rather, they provide additional information about the speaker or the speech. The values for these fields are determined by the supervisors before transcription began and will be modified as the project progresses to handle all cases. There are 3 speaker-specific fields, i.e., fields that have the same value for all utterances of a speaker, and 6 utterance-specific fields. Allowable values for each field are described below: Speaker Information Fields -------------------------- s1) Speaker sex: unidentified (default value) female male s2) Speaker age: unidentified (default value) juvenile adult elderly s3) Speaker accent: spanish (default value) non_native unidentified Utterance Information Fields ---------------------------- u1) Signal Condition: unidentified (default value) complete partial void truncated unintelligible response unintelligible word inappropriate Each of these items is explained below. complete - The caller gave a complete response to the prompt. partial - The caller gave a partial response to the prompt. E.g. If asked to say 3 numbers, the caller says only one or two. void - Silence (empty file) or response in another language or other invalid response (e.g. song, laugh) truncated - Caller was cut off while speaking. unintelligible response - Cannot make out what caller says at all. unintelligible word - Specific word is unintelligible. inappropriate - Caller did not follow instructions, but said something else in Spanish. u2) Speaker Effort: unidentified (default value) normal high low where normal - normal loudness high - very loud low - soft u3) Speaker Articulation: unidentified (default value) normal deliberate poor u4) Speaker Rate: unidentified (default value) normal fast slow u5) Speaker Quality: unidentified (default value) normal abnormal u6) Signal Quality: unidentified (default value) normal echo distortion line noise mouth noise background noise background speech intelligible background speech These values are explained in detail elsewhere in this document. These fields are in the header of the NIST format speech files. A sample NIST file header is given in the final report. TRANSCRIPTION CONVENTIONS ------------------------- For the sake of consistency among POLYPHONE corpora, we have attempted to follow the conventions used in the Macrophone American English corpus as far as possible. Wherever appropriate, examples are given in Spanish, with English translations. CASE: All transcription is done in lexical case (i.e., proper names will begin with uppercase letters) PUNCTUATION: No punctuation is used, except for apostrophes. Periods are not used. Hyphenated words are not common in Spanish. If English hyphenated words occur, they are either transcribed as a single compound word, if appropriate, or split into two separate words. DIACRITICS: The following conventions will be followed for transcribing diacritic marks and special characters unique to Spanish. Spanish Transcription ------- ------------- á a' é e' í i' ó o' ú u' ñ n~ ¿ ?? ¡ !! ü u" ABBREVIATIONS: No abbreviations are used. Titles such as sen~or, sen~ora, and sen~orita are spelled out fully rather than being transcribed as sr, sra and srta respectively. Words such as doctor and saint (Spanish: santo - masculine, santa - feminine) are spelled out as complete words, rather than being transcribed as dr or sto or sta. ACRONYMS: Acronyms are transcribed as words, if they are said as words. They are spelled out (see below) if the subject says the names of the letters in the acronym rather than saying it as a word. For example, "NIST", if pronounced as /nist/ would be transcribed as a word: nist SPELLED WORDS: Spelled words and acronyms where the subject says the letter names, such as "IBM", are transcribed by leaving spaces around each letter. For example, "IBM" would be transcribed as: i b m And "U. S. A." would be transcribed: u s a SPECIAL NOTE ABOUT "Q" and "W": Many of the read spelled words and some of the spontaneous sentences contain the letters "w" and "q". Since there is no "w" in Spanish, there is variability among speakers in how it is read: some said it as "doble u" (double u) and others in some other arbitrary manner. As for "q", when it is followed by "u", Spanish assigns it the phoneme "k", but when "q" occurs alone (as in a spelled word), its pronunciation, again, can be arbitrary. As a result, files containing these letters may be poor candidates for use in training acoustic models, without closer attention to their actual spoken content. The files affected by this issue can be identified from the transcription table (the file transcrp.tbl in the doc directory), by searching for lines containing " q " or " w " -- that is, the isolated letters, bounded on both sides by spaces. (The transcript table has been prepared to assure that each full utterance is bounded by space characters, so a search for space-bounded patterns will find occurrences in initial and final positions, as well as medial positions within the utterance.) For convenience, a list of affected file names has been prepared in advance, called "q_and_w.lst". INSERTIONS, DELETIONS and SUBSTITUTIONS: If the subject misreads a sentence, producing words that differ from the prompt text, the transcription is changed to match what the subject actually said. If the subject leaves out a word, that word is removed from the transcription. If the subject changes a word, saying a word that is different from the one in the prompt text, the word that was actually produced replaces the one from the prompt text. If the subject inserts an extra word, adding a "the", for example, when there wasn't one in the prompt text, that word is inserted in the transcription. Note that this rule applies only to correctly formed words. Also, in the case of substitutions, if the substituted word does not make sense in the local context, it is included as long as it appears to be a reasonable pronunciation of an alternative word. We will not worry about semantic coherence. Mispronunciations, word fragments and other disfluencies are handled as described below. DISFLUENCIES: Mispronunciations: ------------------ Obviously mispronounced words are marked by placing a "*" both immediately before and immediately after the word. For example: *personalise* The transcription of the word itself is not modified in any way. There is no attempt made to produce a phonetic level transcription. In general, there is a high degree of tolerance for pronunciation variants: Dialectical variants, such as "doh" for "dos" ('two' in Spanish) are not marked as mispronunciations. Spanish does not allow multiple pronunciations of a word, so it is not an issue here. Finally, a fair degree of latitude is given in judging the pronunciation of less common words and names with which the subject may not be familiar. If the subject produces a reasonable pronunciation based on the spelling of the word or name, whether correct or not, it is not normally marked as a mispronunciation. For instance, any reasonable attempt at pronouncing "Deng Xiao Ping" would be accepted, while non-standard pronunciations of "Tierra del Fuego" or "Velasquez" would be marked as mispronunciations. Word fragments and stutters: ---------------------------- Partial words are transcribed by entering the portion of the word that was said, immediately followed by a "=" to show that some portion of the word is missing. If it is clear what the intended word was, the missing portion of the word can optionally be shown inside of parentheses, prior to the "=". If the first part of the word is missing, the "=" would appear in front of the portion of the word that was produced. This is not very common, except in the case of truncations, which are discarded anyway. For example, the following would both be legitimate transcriptions for the word fragment "ame'" when the subject was attempting to say "ame'rica": ame'= ame'(rica)= Because the transcription is done on a word level, and not on a phonetic level, the portion of the word that is shown is that which comes closest to matching what the subject said given the spelling of the word. Often, especially with stutters, only a single phone is produced and there is no way of knowing what the intended word was. In this case the letter that comes closest to representing that phone is used, followed by the "=", as in "s=". Verbal deletions and full word stutters: ---------------------------------------- Verbal deletions of full words and full word stutters are not marked. Each instance of the word is entered in the transcription. For example, if the subject stuttered while saying "Nueva York", but produced the full word "Nueva" as the stutter, the transcription would be: Nueva Nueva York For another example, assume the subject was reading the number "2578" and misread the "5" as "4", then corrected himself. The transcription would be: dos cuatro cinco siete ocho Or, if he inserted other words or verbal hesitations at the point where he realized the mistake, the transcription might be something like: dos cuatro [eh] no cinco siete ocho Or: dos cuatro no dos cinco siete ocho Prosodic Markings: ----------------- Pauses: Pauses are not marked in any way. Emphatic or abnormal stress: Stress is not marked in any way. Lengthening: Lengthening is marked only in a few extreme cases. The convention for marking lengthening is to append a ":" immediately following the lengthened sound (or the letter in the word that most closely represents that sound). For example, if the subject says: "nnnnnno", drawing the "n" out, the transcription would be: "n:o" Speech Style: ------------ Different speaking styles are not annotated in any way. Unintelligible speech: --------------------- If the entire speech in a file is unintelligible, it is marked as [unintelligible] in the transcription. If only one or more WORDS are unintelligible, then the marker [unintelligible] replaces those words in the transcription. For example, if the complete utterance was 'modificar lista' and the speaker said something unintelligible for the first word, then the transcription would be: [unintelligible] lista Duration is not considered: unintelligible speech of any length is marked with only one [unintelligible] marker. NON-SPANISH WORDS OR PHRASES: If one or more non-Spanish (e.g., English) words occur in the utterance, the foreign words are enclosed in "<" and ">" along with the abbreviated name of the language. For example, cinco cuatro siete seis dos indicates the person lapsed into English because he said 7 instead of 6 while reading the number string 5462. These utterances are still included in the corpus as they contain useful speech. The foreign word delimiters will help users of the corpus to set aside such bilingual utterances whenever they require utterances with only Spanish speech. EXTRANEOUS OR NON-SPEECH EVENTS: Non-speech events are marked in the Signal Quality information field mentioned above. Sometimes when the extraneous event is completely localized in the speech (i.e. does not co-occur with speech), it is useful to indicate its location in the transcription as well, so that recognition systems can model it. IMPORTANT EXCEPTION to this rule: 'normal', 'echo' and 'distortion' are only marked in the Signal Quality field. They never appear in the transcription. Each of the Signal Quality field values are described below. Where appropriate, sub-categories are defined. These sub-categories are marked within the transcription using square brackets. echo - Used to indicate echoing of the speech (a feature of some bad phone lines). Marked only in the Signal Quality field. distortion - Used to indicate distortion of the speech due to a bad phone line. Marked only in the Signal Quality field. background noise - Used to mark background noise. Includes noise from any source. Possible noise sources include but are not limited to: dogs barking, bird or other pet noises, phone cord tapping, paper rustling, finger tapping, and TV or radio noise (including TV or radio speech). [background_noise] is a "catch-all" marker that is used for any noise that is notable but does not fall into one of the other categories. If the noise source is localized and easily identifiable, the following sub-categories may be marked in the transcription: [paper rustle] [handset noise] [click] [bg_laughter] (If there are two clicks, mark them as [click] [click], if there are three clicks, mark them as [click] [click] [click], and so on). The transcribers are instructed not to spend too much effort identifying the noise sources. The above sub-categories are to be used only if the source is obvious. [bg_laughter] is used to mark laughter produced by people in the background. Laughter by the subject is dealt with in the 'mouth noise' section (see below). Any other source of background noise is just marked as [background_noise]. Examples: If there was handset noise after the person said "tercer nu'mero", then it is transcribed as: tercer nu'mero [handset_noise] However, if instead of handset noise, there was a dog bark after the phrase, then it is transcribed as tercer nu'mero [background_noise] because a dog bark is not one of the four common sub-categories. background speech - Used to mark background speech, or cross-talk that is not intelligible enough to be transcribed. Background speech is defined as audible speech from other talkers in the area where the subject is calling from. This DOES NOT include speech by the subject that is directed to someone else and is not in response to a prompt from our system. Such speech is to be transcribed. For example, if the subject turned to someone else and said something like: "shh estoy en el tele'fono" (shh i'm on the phone). and then responded to the prompt with 'ochenta y siete' (87), the transcription would be: //shh estoy en el tele'fono// ochenta y siete with '//' as markers. Audible speech from other talkers is defined any speech that is loud enough to be identified as speech, but is not intelligible. intelligible background speech - Used to mark audible speech in the background that is intelligible enough to be transcribed. Such speech will be distinguished from the subject's speech by the left and right markers '[bg' and 'bg]' respectively. For example, if someone in the background said '¿Quien esta en el teléfono?' (Who is on the telephone?) after the subject finished saying 'quitar lista', then the transcription would be: quitar lista [bg ??quien esta en el tele'fono? bg] line noise - Used to mark noticeable popping or static from the telephone lines. 'line noise' may be used to mark the popping noises that result from dropped packets in transmission of digitized speech over phone lines. It is also used to mark files that have noticeable static. The 'line noise' marker is to be used sparingly, and only when the noise is clearly due to telephone lines. If there is a doubt as to the noise source, the 'background noise' marker should be used instead. Normal levels of telephone noise and low levels of static that would probably not be noticed by a telephone user are not marked. mouth noise The following self-explanatory sub- categories are used to mark mouth or nose noises produced by the subject, excluding verbalized hesitations. [cough] [throat_clear] [sniff] [sneeze] [breath_noise] [inhalation] [exhalation] [lip_smack] [tongue_click] [laughter] For example, if the subject coughed before responding, the transcription would be: "[cough] si me gustari'a" (yes I would like to) It is to be emphasized again that the above sub-categories are provided so that recognition systems can model them if they occur often enough. However, such a detailed marking is time-consuming. If the transcriber is not sure which of the above sub-categories the mouth noise falls into, she justs mark it as 'mouth noise' in the Signal Quality field and moves on. normal - Used when none of the above values apply, i.e. when the utterance is OK in all respects. Marked only in the Signal Quality field. Verbalized Hesitations: ---------------------- [eh] is used to mark verbal hesitations within the transcription. It has no corresponding Signal Quality field value. All verbalized hesitations, whether the subject actually says "eh", or says something different, such as "um", "mm", "uh","este", etc., are marked as [eh]. There is no attempt made to distinguish between the different possible verbalized hesitations, or to characterize them on a phonetic level. Duration is also not considered: verbalized hesitations of any length are marked simply with one [eh] marker. PLACEMENT OF EVENT DESCRIPTORS: Events that do not co-occur with speech: --------------------------------------- Events that do not conflict or co-occur with speech should be marked by providing the appropriate value to the Signal Quality field and where possible, by placing the appropriate sub-category descriptor in the transcription at the place where the event occurs. By definition, the mouth noises described above and [eh] should NEVER co-occur with speech. They should simply be inserted at the place where they occur in the utterance. Events that co-occur with speech: -------------------------------- [background_noise], [background_speech], and [line_noise] may co-occur with speech. They may occur as single events that coincide with the subject's production of a single word, or they may span several words or even the entire utterance. In either case, they are simply marked in the Signal Quality field. Since almost all utterances in this corpus are no longer than 2 seconds, providing detailed locational information for events co-occurring with speech is considered expensive. Multiple sources of noise: ------------------------- There are many utterances that will have multiple sources of noise in them, i.e. both background noise and line noise, or line noise and mouth noise. In such cases, multiple values are assigned to the Signal Quality field, in alphabetical order. For example, if there is both line noise and mouth noise, then the Signal Quality field has the value 'line noise, mouth noise'. All the original rules for marking them in the transcription also apply.