VOICE ACROSS HISPANIC AMERICA TRANSCRIPTION
		===========================================

	     Yeshwant Muthusamy, Barb Wheatley and Joseph Picone
               Personal Systems Laboratory, Texas Instruments
		
INTRODUCTION
------------

This document describes the conventions used to validate and transcribe 
Spanish speech data collected as part of the Voice Across Hispanic America 
project.

Validation and transcription consists of checking whether each utterance is 
according to specifications, making subjective judgments about the quality of 
the speech, transcribing the speech, and noting down all events that co-occur 
with the speech. This occurs after the utterances have been passed through a
preliminary validation stage described in the final report.

Utterances with the following characteristics are not included with the 
official release of the corpus:

        - truncated utterances

        - extremely noisy files
	
	- empty or silent files

	- inappropriate responses (e.g., singing, prank responses, speech
          that is not in response to the prompt)

The transcription is done using an interactive tool, based on UNIX curses,
that for each utterance, allows the transcriber to:

	 - listen to the utterance as many times as needed
	
	 - see the prompting text (for read utterances)

	 - type in or change the transcription 

	 - enter speaker and utterance judgments into designated information
           fields (speaker rate, signal quality, etc.)

For read items, the prompting text is the default transcription. The 
transcriber just has to modify it, if needed. 

The information fields are not part of the transcription per se. Rather, they 
provide additional information about the speaker or the speech. The values for
these fields are determined by the supervisors before transcription began and
will be modified as the project progresses to handle all cases. There are 
3 speaker-specific fields, i.e., fields that have the same value for all 
utterances of a speaker, and 6 utterance-specific fields. Allowable values for
each field are described below:

Speaker Information Fields
--------------------------
s1) Speaker sex:       unidentified       (default value)
                       female             
                       male

s2) Speaker age:       unidentified       (default value)
                       juvenile
                       adult
                       elderly

s3) Speaker accent:    spanish          (default value)
                       non_native
                       unidentified

Utterance Information Fields
----------------------------
u1) Signal  Condition: unidentified     (default value)
                       complete
                       partial
                       void
                       truncated 
                       unintelligible response
                       unintelligible word
                       inappropriate

Each of these items is explained below.

                       complete                 - The caller gave a complete 
                                                  response to the prompt.
                     
                       partial                  - The caller gave a partial 
                                                  response to the prompt. 
                                                  E.g. If asked to say 3 
                                                  numbers, the caller says 
						  only one or two.

                       void                     - Silence (empty file) or 
                                                  response in another language 
                                                  or other invalid response 
                                                  (e.g. song, laugh)

                       truncated                - Caller was cut off while 
                                                  speaking.

                       unintelligible response  - Cannot make out what caller 
                                                  says at all.

                       unintelligible word      - Specific word is 
                                                  unintelligible.

                       inappropriate            - Caller did not follow 
                                                  instructions, but said 
                                                  something else in Spanish.

           
u2) Speaker Effort:       unidentified      (default value)
                          normal
                          high
                          low

                    where normal - normal loudness
                          high   - very loud
                          low    - soft

u3) Speaker Articulation: unidentified  (default value)
                          normal
                          deliberate
                          poor

u4) Speaker Rate:         unidentified     (default value)
                          normal
                          fast
                          slow

u5) Speaker Quality:      unidentified     (default value)
                          normal
                          abnormal

u6) Signal  Quality:      unidentified     (default value)
                          normal
                          echo
                          distortion
                          line noise
                          mouth noise
                          background noise
                          background speech
                          intelligible background speech
			
			  These values are explained in detail elsewhere in
			  this document.

These fields are in the header of the NIST format speech files. A sample
NIST file header is given in the final report.
   

TRANSCRIPTION CONVENTIONS
-------------------------

For the sake of consistency among POLYPHONE corpora, we have attempted to 
follow the conventions used in the Macrophone American English corpus as far
as possible. Wherever appropriate, examples are given in Spanish, with English 
translations.


CASE:  All transcription is done in lexical case (i.e., proper names will 
       begin with uppercase letters)


PUNCTUATION:

	No punctuation is used, except for apostrophes.

	Periods are not used.

	Hyphenated words are not common in Spanish. If English hyphenated words
        occur, they are either transcribed as a single compound	word, if 
        appropriate, or split into two separate words.
        
DIACRITICS:

        The following conventions will be followed for transcribing diacritic
        marks and special characters unique to Spanish.
        
        Spanish                 Transcription
        -------                 -------------
            á			a'
	    é			e'
	    í          		i'
	    ó			o'
	    ú			u'
	    ñ			n~
            ¿			??
	    ¡			!!
	    ü			u"


ABBREVIATIONS:

	No abbreviations are used. Titles such as sen~or, sen~ora,
        and sen~orita are spelled out fully rather than being transcribed
        as sr, sra and srta respectively. Words such as doctor and saint
	(Spanish: santo - masculine, santa - feminine) are spelled out as
         complete words, rather than being transcribed as dr or sto or sta.


ACRONYMS:

	Acronyms are transcribed as words, if they are said as words.
	They are spelled out (see below) if the subject says the names
	of the letters in the acronym rather than saying it as a word.
	For example, "NIST", if pronounced as /nist/ would be transcribed
	as a word:  nist


SPELLED WORDS:

	Spelled words and acronyms where the subject says the letter names,
	such as "IBM", are transcribed by leaving spaces around each letter.
	For example, "IBM" would be transcribed as:  i b m
	And "U. S. A." would be transcribed:  u s a

SPECIAL NOTE ABOUT "Q" and "W":

	Many of the read spelled words and some of the spontaneous sentences
	contain the letters "w" and "q".  Since there is no "w" in Spanish,
	there is variability among speakers in how it is read: some said it
	as "doble u" (double u) and others in some other arbitrary manner.
	As for "q", when it is followed by "u", Spanish assigns it the
	phoneme "k", but when "q" occurs alone (as in a spelled word), its
	pronunciation, again, can be arbitrary.  As a result, files
	containing these letters may be poor candidates for use in training
	acoustic models, without closer attention to their actual spoken
	content.

	The files affected by this issue can be identified from the
	transcription table (the file transcrp.tbl in the doc directory), by
	searching for lines containing " q " or " w " -- that is, the
	isolated letters, bounded on both sides by spaces.  (The transcript
	table has been prepared to assure that each full utterance is
	bounded by space characters, so a search for space-bounded patterns
	will find occurrences in initial and final positions, as well as
	medial positions within the utterance.)  For convenience, a list of
	affected file names has been prepared in advance, called
	"q_and_w.lst".

INSERTIONS, DELETIONS and SUBSTITUTIONS:

	If the subject misreads a sentence, producing words that differ
	from the prompt text, the transcription is changed to match what
	the subject actually said.  If the subject leaves out a word, that
	word is removed from the transcription.  If the subject changes a
	word, saying a word that is different from the one in the prompt
	text, the word that was actually produced replaces the one from the
	prompt text.  If the subject inserts an extra word, adding a "the",
	for example, when there wasn't one in the prompt text, that word is
	inserted in the transcription.

	Note that this rule applies only to correctly formed words.  Also,
	in the case of substitutions, if the substituted word does not make
        sense in the local context, it is included as long as it appears
        to be a reasonable pronunciation of an alternative word. We will not
        worry about semantic coherence.
	
	Mispronunciations, word fragments and other disfluencies are
	handled as described below.


DISFLUENCIES:

	Mispronunciations:
	------------------

	Obviously mispronounced words are marked by placing a "*" both
	immediately before and immediately after the word.

	For example:  *personalise*

	The transcription of the word itself is not modified in any way.
	There is no attempt made to produce a phonetic level transcription.

	In general, there is a high degree of tolerance for pronunciation
	variants:

	Dialectical variants, such as "doh" for "dos" ('two' in Spanish) are 
        not marked as mispronunciations.

	Spanish does not allow multiple pronunciations of a word, so it is
        not an issue here.

	Finally, a fair degree of latitude is given in judging the
	pronunciation of less common words and names with which the subject
	may not be familiar.  If the subject produces a reasonable
	pronunciation based on the spelling of the word or name, whether
	correct or not, it is not normally marked as a mispronunciation.
	For instance, any reasonable attempt at pronouncing "Deng Xiao Ping"
	would be accepted, while non-standard pronunciations of "Tierra del 
	Fuego" or "Velasquez" would be marked as mispronunciations.

	Word fragments and stutters:
	----------------------------

	Partial words are transcribed by entering the portion of the word
	that was said, immediately followed by a "=" to show that some
	portion of the word is missing.  If it is clear what the intended
	word was, the missing portion of the word can optionally be shown
	inside of parentheses, prior to the "=".

	If the first part of the word is missing, the "=" would appear in
	front of the portion of the word that was produced.  This is not very
	common, except in the case of truncations, which are discarded anyway.

	For example, the following would both be legitimate transcriptions
	for the word fragment "ame'" when the subject was attempting to say
	"ame'rica":

		ame'=
		ame'(rica)=

	Because the transcription is done on a word level, and not on
	a phonetic level, the portion of the word that is shown is that
	which comes closest to matching what the subject said given the
	spelling of the word.

	Often, especially with stutters, only a single phone is produced and
	there is no way of knowing what the intended word was.  In this case
	the letter that comes closest to representing that phone is used,
	followed by the "=", as in "s=".


	Verbal deletions and full word stutters:
	----------------------------------------

	Verbal deletions of full words and full word stutters are not
	marked.  Each instance of the word is entered in the transcription.

	For example, if the subject stuttered while saying "Nueva York", but
	produced the full word "Nueva" as the stutter, the transcription would
	be:

		Nueva Nueva York


	For another example, assume the subject was reading the number
	"2578" and misread the "5" as "4", then corrected himself.  The
	transcription would be:

		dos cuatro cinco siete ocho

	Or, if he inserted other words or verbal hesitations at the
	point where he realized the mistake, the transcription might be
	something like:

		dos cuatro [eh] no cinco siete ocho

	Or:	dos cuatro no dos cinco siete ocho


	Prosodic Markings:
	-----------------

	Pauses:  Pauses are not marked in any way.

	Emphatic or abnormal stress:  Stress is not marked in any way.

	Lengthening:  Lengthening is marked only in a few extreme cases.
		      The convention for marking lengthening is to append
		      a ":" immediately following the lengthened sound (or
		      the letter in the word that most closely represents
		      that sound).

		      For example, if the subject says: "nnnnnno", drawing
		      the "n" out, the transcription would be:  "n:o"


	Speech Style:
	------------

	Different speaking styles are not annotated in any way.


	Unintelligible speech:
	---------------------

	If the entire speech in a file is unintelligible, it is marked
   	as [unintelligible] in the transcription.

   	If only one or more WORDS are unintelligible, then the
	marker [unintelligible] replaces those words in the transcription. For 
	example, if the complete utterance was 'modificar lista' and the 
	speaker said something unintelligible for the first word, then the 
	transcription would be:

        [unintelligible] lista

	Duration is not considered: unintelligible speech of any length
	is marked with only one [unintelligible] marker.


NON-SPANISH WORDS OR PHRASES:
	
	If one or more non-Spanish (e.g., English) words occur in the 
	utterance, the foreign words are enclosed in "<" and ">" along 
	with the abbreviated name of the language. For example,
	
		cinco cuatro siete <Eng: oh I'm sorry> seis dos

        indicates the person lapsed into English because he said
        7 instead of 6 while reading the number string 5462. 

        These utterances are still included in the corpus as they contain
        useful speech. The foreign word delimiters will help users of the 
        corpus to set aside such bilingual utterances whenever they require 
        utterances with only Spanish speech.
      

EXTRANEOUS OR NON-SPEECH EVENTS:

     Non-speech events are marked in the Signal Quality information field 
     mentioned above. Sometimes when the extraneous event is completely 
     localized in the speech (i.e. does not co-occur with speech), it is 
     useful to indicate its location in the transcription as well, so that 
     recognition systems can model it. 

       IMPORTANT EXCEPTION to this rule: 'normal', 'echo' and 'distortion' 
       are only marked in the Signal Quality field. They never appear in the 
       transcription.

      Each of the Signal Quality field values are described below. Where
      appropriate, sub-categories are defined. These sub-categories are
      marked within the transcription using square brackets.

	
	echo			- Used to indicate echoing of the speech
				  (a feature of some bad phone lines). 
				  Marked only in the Signal Quality field.

	distortion		- Used to indicate distortion of the speech
				  due to a bad phone line.
				  Marked only in the Signal Quality field.

	background noise	- Used to mark background noise.

		         	  Includes noise from any source. 
                                  Possible noise sources include
			          but are not limited to: dogs barking, bird or
			          other pet noises, phone cord tapping, 
                                  paper rustling, finger tapping, and TV or 
                                  radio noise (including TV or radio speech).  
                                  
				  [background_noise] is a "catch-all" marker 
                                  that is used for any noise that is notable
			          but does not fall into one of the other
			          categories.
      				  
                                  If the noise source is localized and easily
			          identifiable, the following sub-categories
 				  may be marked in the transcription:

                                  [paper rustle]
				  [handset noise]
				  [click]
				  [bg_laughter]

    				  (If there are two clicks, mark them as 
				  [click] [click], if there are three clicks, 
				  mark them as [click] [click] [click], and 
				  so on).
				  
    				  The transcribers are instructed not to 
				  spend too much effort identifying the noise
				  sources. The above sub-categories are to be
                                  used only if the source is obvious. 
                                  [bg_laughter] is used to mark laughter 
                                  produced by people in the background. 
				  Laughter by the subject is dealt with in the
                                  'mouth noise' section (see below).
				  Any other source of background noise is 
				  just marked as [background_noise].
				  
				  Examples:
				  
				  If there was handset noise after the person
				  said "tercer nu'mero", then it is 
                                  transcribed as:
  				
				   tercer nu'mero [handset_noise]

				  However, if instead of handset noise, there
				  was a dog bark after the phrase, then it is
				  transcribed as

				   tercer nu'mero [background_noise]

				  because a dog bark is not one of the four
				  common sub-categories.

	background speech	- Used to mark background speech, or
			          cross-talk that is not intelligible enough
				  to be transcribed.

  			          Background speech is defined as audible 
                                  speech from other talkers in the area where 
                                  the subject is calling from. This DOES NOT 
                                  include speech by the subject that is 
                                  directed to someone else and is not in 
                                  response to a prompt from our system. Such 
                                  speech is to be transcribed.
                                  For example, if the subject turned to 
				  someone else  and said something like:

			    "shh estoy en el tele'fono" (shh i'm on the phone).
                         
                                  and then responded to the prompt with 
                                  'ochenta y siete' (87), the transcription
				   would be:
                                                    
	   		          //shh estoy en el tele'fono// ochenta y siete
 				
	                          with '//' as markers.
                                                 
			          Audible speech from other talkers is defined
                                  any speech that is loud enough to be 
				  identified as speech, but is not intelligible.

        intelligible 
        background speech       - Used to mark audible speech in the 
				  background that is intelligible enough to be
                                  transcribed.

                                  Such speech will be distinguished from the 
				  subject's speech by the left and right 
				  markers '[bg' and 'bg]' respectively. For 
				  example, if someone in the background said 
				  '¿Quien esta en el teléfono?'
                                  (Who is on the telephone?) after the subject
				  finished saying 'quitar lista', then the 
				  transcription would be:

   			     quitar lista [bg ??quien esta en el tele'fono? bg]
			
	line noise	        - Used to mark noticeable popping or static
			          from the telephone lines.

			          'line noise' may be used to mark the popping
				  noises that result from dropped packets in 
				  transmission of digitized speech over phone 
				  lines.  It is also used to mark files that 
				  have noticeable static.

			          The 'line noise' marker is to be used 
				  sparingly, and only when the noise is 
				  clearly due to telephone lines.  If there is
				  a doubt as to the noise source, the 
				  'background noise' marker should be used 
				  instead. Normal levels of telephone noise 
				  and low levels of static that would probably
				  not be noticed by a telephone user are not 
				  marked.
			
          
        mouth noise               The following self-explanatory sub-
                                  categories are used to mark mouth or nose 
				  noises produced by the subject, excluding 
				  verbalized hesitations.

			         	[cough]
				        [throat_clear]
				        [sniff]
				        [sneeze]
				        [breath_noise]
			        	[inhalation]
			        	[exhalation]
			        	[lip_smack]	
			       		[tongue_click]
					[laughter]
			
 			          For example, if the subject coughed before 
			          responding, the transcription would be:

	 			"[cough] si me gustari'a" (yes I would like to)

                                  It is to be emphasized again that the above 
				  sub-categories are provided so that 
				  recognition systems can model them if they 
				  occur often enough. However, such a detailed
				  marking is time-consuming. If the transcriber
				  is not sure which of the above 
				  sub-categories the mouth noise falls into, 
				  she justs mark it as 'mouth noise' in the 
				  Signal Quality field and moves on.

	normal		        - Used when none of the above values apply, 
	                          i.e. when the utterance is OK in all 
				  respects. Marked only in the Signal Quality 
				  field.

Verbalized Hesitations:
----------------------
[eh] is used to mark verbal hesitations within the transcription.
It has no corresponding Signal Quality field value. All verbalized hesitations,
whether the subject actually says "eh", or says something different, such as 
"um", "mm", "uh","este", etc., are marked as [eh]. There is no attempt made to
distinguish between the different possible verbalized hesitations, or to 
characterize them on a phonetic level. Duration is also not considered: 
verbalized hesitations of any length are marked simply with one [eh] marker.


PLACEMENT OF EVENT DESCRIPTORS:

	Events that do not co-occur with speech:
	---------------------------------------
	Events that do not conflict or co-occur with speech should be
	marked by providing the appropriate value to the Signal Quality field
        and where possible, by placing the appropriate sub-category descriptor
        in the transcription at the place where the event occurs.

	By definition, the mouth noises described above and [eh] 
        should NEVER co-occur with speech.  They should simply be 
        inserted at the place where they occur in the utterance.


	Events that co-occur with speech:
	--------------------------------
	[background_noise], [background_speech], and [line_noise] may co-occur 
        with speech.  They may occur as single events that coincide with the 
        subject's production of a single word, or they may span several words 
        or even the entire utterance. In either case, they are simply 
        marked in the Signal Quality field. Since almost all utterances in this
        corpus are no longer than 2 seconds, providing detailed locational
        information for events co-occurring with speech is considered
        expensive.

    	Multiple sources of noise:
	-------------------------
	There are many utterances that will have multiple sources of noise
        in them, i.e. both background noise and line noise, or line noise 
	and mouth noise. In such cases, multiple values are assigned to the 
	Signal Quality field, in alphabetical order. For example, if there is 
        both line noise and mouth noise, then the Signal Quality field has
        the value 'line noise, mouth noise'. All the original rules for 
	marking them in the transcription also apply.