-------------------------------------------------------------
	Description of the Hub-5 telephone speech and transcript 
	           corpus for Spanish, 106 transcripts	
	-------------------------------------------------------------

January 12,1997


	Project leader:		Jennifer Alabiso

	Programming:		David Graff
				Robert MacIntyre
				Zhibiao Wu

	Personnel:		Jennifer Alabiso

	Transcribers:		Elisa Munoz (lead transcriber)
				Gustavo Gallegos
				Philip Garrison
				Karla Lozano
				Angelica Minero
				Claudia Palmeros
			
CONTENTS

	1. Summary abstract
	2. Data acquisition
	3. Data verification
	4. Speaker demographics
	5. Data transcription - General
	6. Data transcription - Interjections
	7. Data transcription - Non-lexemes
	8. Quality control (QC) procedures


-----------------------------------------------------------------------
1.  Summary abstract

	This corpus consists of 10-30 minute transcriptions from 106
recorded telephone conversations originally collected by the LDC in
support of the project on Language Recognition, sponsored by the
U.S. Department of Defense.  The transcribed data is intended as
additional training data in support of the project on Large Vocabulary
Conversational Speech Recognition (LVCSR), also sponsored by the
U.S. Department of Defense.

	This release of the Hub-5 Spanish corpus consists of 106
unscripted telephone conversations between native speakers of Spanish.
The transcripts cover a contiguous 10-30 minute segment (see section 2
below) taken from a recorded conversation lasting up to 30 minutes.
All speakers were aware that they were being recorded.  They were
given no guidelines concerning what they should talk about.  Once a
caller was recruited to participate, he/she was given a free choice of
whom to call.  Most participants called family members or close
friends.  All calls originated in North America and were placed to
various locations within North America, Puerto Rico or the Dominican
Republic.  The distribution of call destinations can be found in the
file "spkrinfo.tbl".

	The transcripts are timestamped by speaker turn for alignment
with the speech signal, and are provided in standard orthography.

-----------------------------------------------------------------------
2.  Data acquisition

	Speakers were solicited by the LDC to participate in this
telephone speech collection effort via the internet, publications
(advertisements), and personal contacts.  A total of 200 call
originators were found, each of whom placed a telephone call via a
toll-free robot operator maintained by the LDC.  Access to the robot
operator was possible via a unique Personal Identification Number
(PIN) issued by the recruiting staff at the LDC when the caller
enrolled in the project.  The participants were made aware that their
telephone call would be recorded, as were the call recipients.  The
call was allowed only if both parties agreed to being recorded.  Each
caller was allowed to talk up to 30 minutes.  Upon successful
completion of the call, the caller was paid $20 (in addition to making
a free long-distance telephone call).  Each caller was allowed to
place only one telephone call.

	In all, 106 calls were transcribed.  All of these calls are
being designated as additional training data for the LVCSR project in
Spanish.  

-----------------------------------------------------------------------
3.  Data verification

	After a successful call was completed, a human audit of each
telephone call was conducted to verify that the proper language was
spoken, to check the quality of the recording, and to select and
describe the region to be transcribed.  The description of the
transcribed region provides information about channel quality, number
of speakers, their gender, and other attributes.  The information from
this audit may be found in the file "callinfo.tbl".

-----------------------------------------------------------------------
4.  Speaker demographics

	Information on speaker demographics can be found in the file
"spkrinfo.tbl."

-----------------------------------------------------------------------
5.  Data transcription - General

	All Hub-5 telephone conversations were transcribed using
the general conventions described below.  The finite set of
"non-lexemes" (hesitation sounds) used in the transcripts are provided
in section 6 below.

	The transcription was carried out on Sun 4 workstations.  The
transcription was done using the emacs text editor which was linked to
the visual and auditory soundwave from the telephone recording in an
xwaves window.  A program written at the LDC linked the xwaves signal
to the emacs buffer so that a highlighted region of the soundwave
could be brought into the emacs buffer as a timestamp via a simple
keystroke.  Similarly, the transcribers could listen to any timemarked
turn in the transcript, and view the aligned soundwave as well.  Thus,
the transcribers had a visual as well as auditory signal that they
were transcribing.  Both the visual and auditory signal were broken
into two separate channels that could be reviewed separately or
together.

	The transcribers were given the transcription conventions
provided below as a guideline how to transcribe the telephone
conversations.


	---------------------------------------------------------------
			LDC Transcription Conventions


What to transcribe

Telephone speech

	For the telephone speech transcription, the goal is to
	transcribe the entire 30 minute conversation. However, you
	should skip over the parts that are "difficult". What does
	that mean? As a rule of thumb, "difficult" means:

	- more than one or two portions of overlapping speech in a row
	- if you have to listen to a passage more than 4 times in order
	   to understand anything, it is probably too difficult to
	   transcribe 
	- heavy distortion or overwhelming background noise
	   over a portion of the conversation 

	If you skip any portion of the conversation, you should
	provide a time-stamp of the skipped speech portion (even if it
	is a minute long), and add the notation "[[skip]]" on the line
	following the timestamp with a single space:

	323.08 351.19 [[skip]] 


Definition of turns:    Speaker change


	For ease of transcription, turns can be broken up into shorter
	timestamped segments.  These segments should be no longer than
	about 8 seconds in duration.  

Timestamps:             Each speaker turn is marked with a unique timestamp
                        (in seconds). The timestamps mark the beginning and
                        end time of each turn relative to the beginning of the
                        recording. Each timestamp is precise to the 100th of a
                        second, and is in the format: beginning time [space]
                        ending time, followed by the turn.
                        Some samples:

                27.98 28.72 A: You know so

                137.49 139.47 A: yeah {breath} (( )) [distortion]

                284.54 286.79 B: %ah ^Lydia ^Van ^Damme.

Timestamps should be included based upon the following guidelines:


                (1) speaker change, e.g.

                        A:  Well I was thinking about that

                        B:  I know I talked to ^Jan about it yesterday

                (2) within one speaker's stretch of talk, a long
                turn should be broken up in terms of what makes
                grammatical/semantic sense, e.g.

                        A: And I told her %um I didn't I wasn't
                        setting you up to be a spiritual director or
                        anything {laugh} but I did say to her that if she
                        were to talk if she felt that she wanted to
                        talk about her prayer experience in Spanish

                        A: that you would probably be able to certainly
                        to understand her but to empathize a little bit
                        with what she was experiencing

                (3) If there is an extra-long pause (more than a half
		second) within a single speaker's turn, break the turn
		up into two sections, e.g.

                        B: When we were fishing out on Lake ^Travis last
                        August I thought I saw, %uh 

                        B: %uh, ^George ^Martin, but I wasn't sure it was him.


Orthography

	For both broadcast speech and telephone speech transcription,
	we are following the general orthographic conventions
	(spelling) for the given language. Words that usually take
	capital letters in the language should be written with capital
	letters, otherwise lowercase should be used.

	In addition, we have a set of clearly defined symbols that
	should be used with items such as proper names, acronyms,
	mispronounced words, and non-lexemes (see below).

	- Capitalization: capitalization in our transcripts is used as
	  an aid for human comprehension of the text. You should follow
	  the accepted standard way to capitalize words, including words
	  at the beginning of a sentence, proper names, and so on.

		He took the car on Saturday. 
		Jane was walking along Walnut Street when I met her. 
 
	- Numerals: write out all numerals, do not use digits: 

		twenty-two 
		nineteen-ninety-five 
		seven thousand two hundred seventy-five 

	- Abbreviations: write out all abbreviations (except those
	  listed as examples in each language, if any. Consult your
	  language leader):

		junior 
		doctor 


Punctuation

	The following punctuation marks should be used in the
	transcripts. The punctuation marks are primarily for ease of
	(human) reading. Use only those punctuation marks indicated
	below.

	- periods "." should be added at the end of declarative sentences
	- question marks "?" should be added at the end of interrogative 
	  sentences 
     	- commas "," should be added between clauses as is accepted in 
	  the standard orthography of the language 


Symbols

	- Acronyms I: those that are pronounced as a single word should
	  be written in caps (no spaces) and preceded by a "@" symbol: 

	         @NATO 
	         @DARPA 
	         @AIDS 

     	- Acronyms II: acronyms that are normally written as a single
	  word but pronounced as a sequence of individual letters should
	  be written in all caps (no spaces) and preceded by a "~" symbol: 

	         ~FBI 
	         ~CEO 
	         ~YMCA 

     	- Individual letters: Individual letters that are pronounced
	as such should be written in caps and preceded by a "~" symbol: 

		I got an ~A on the test. 

	- In spelling cases, every individual letter should be written in
	caps, separated by spaces and preceded by a "~" symbol:
        
	 	his name is spelled ~S ~I ~M ~P ~S ~O ~N. 

	
	- Proper names: both proper names and place names should be
	  marked with a "^"symbol. If you encounter a "proper name
	  phrase", mark only those words as proper names that are true
	  proper names on their own. Personal initials are treated as
	  proper names in these transcripts.  They must not not have a
	  period after them unless this marks the end of a sentence.

	         ^Frank ^Sinatra 
	         ^Beijing 
	         ^Sony 
        	 ^Maria's Bar and Grill 
        	 
	- Middle Initials or abbreviated first names should be treated as
	individual letters, and thus, should be preceded by a "~" symbol:
		
		^Homer ~L ^Simpson
		he calls himself ~J ~R ^Jones

     	- Partial words: partial words are indicated with a dash
	  (without any spacing between the dash and the word): 

	         absolu- 
	         -tion 

	- Mispronounced words:if a word is mispronounced (such as a
	  slip of the tongue), provide the correct spelling of the word,
	  and place a "+" symbol in front of the word:

	         +probably 
	         +yesterday 

     	- Interjections: in each language, we have a set of
	  standardized spellings for interjections.

		(see the list of interjections below)

	- Non-lexemes: in addition to the interjections (which are
	  considered to be words), we also have a set of standardized
	  spellings for hesitation sounds that speakers make while
	  speaking in each language. Every such "non word" in the
	  transcripts is marked with the "%" symbol. 

		(see the list of non-lexemes below)

     	- Idiosyncratic words: if a speaker uses a "made-up" word
	  which is not used by other speakers (although it may be
	  understandable), place a "*" symbol before the word. Consult
	  your language leader in cases where you are uncertain whether
	  a word fits in this category. Onomatopoeia fits into this
	  category:

         	*poodle-ish 
         	Do you dress like a *schlump yet? 
         	why she said *drr I don't know. 


Noises

	In order to account for sound phenomena such as distortion,
	coughs, breaths, unintelligible speech, foreign words and
	phrases, etc, we utilize a set of unique brackets.

     	- {text}: sound made by the talker. Use only those sounds
	  described below:

	         {laugh} 
	         {cough} 
	         {sneeze} 
	         {breath} 
	         {lipsmack}

	- [text]: sound not made by the talker (usually background or
	  channel). This notation should be used only in those rare
	  cases where the background condition is overwhelming. Use only
	  those descriptions provided below 

	         [distortion] 
        	 [static] -- used for channel noise such as "buzzes", 
			     "pops", etc. 
         	[background] -- used for other noises such as children 
				crying, pots being struck, etc. 

	- [text/] [/text]: marks when sound not made by the talker is
	  non-instantaneous. Place this at the beginning and end of the
	  noisy region. 

         	[distortion/] I am not really sure. [/distortion] 
         	[static/] Sure, she really loved it. [/static] 
         	[background/] Yes, that is my little girl. [/background] 


Other conventions

	- ((text)): unintelligible speech. This is the transcriber's
	  best guess. It should only be used during the first stage of
	transcription to aid in the recognition of the word. It should be
	either corroborated or eliminated during checking stage.

         	((wonderful)) 
         	Well, I ((thought)) that it was fine. 
         	And then she told me that I should ((just leave)). 

	- (( )): unintelligible speech that you cannot even make a
	guess at (with a single space between the parentheses).

         	I went to the (( )) on my way over. 

     	- <language text>: this is used to indicate speech (one or
	  more words) in another language. In place of "language", write
	  the name of the language,if known. If the language is not
	  known, type "?". If you do not know how to transcribe what was
	  said, use the "(( ))" notation. Our rule of thumb for noting a
	  "foreign word" is that these words are not pronounced as
	  native words. For example, the pronunciation of the word
	  "okay" has been nativized and we are writing it as a Spanish word
	  following the standard Spanish spelling: "okey." Moreover,
	  foreign proper names should not be marked with a language tag
	  unless there exists a commonly used translation of that name in
	  Spanish, such as "New York" and "Nueva York." If you have any
	questions, consult your language leader.

		sí, viaja bastante a <English ^New ^Jersey> y a ^Nueva ^York.
        	And then I took all of the <German sachen> to my room. 
         	That type of cheese is called <French fromage rough> 
         	^John told me that (( )) did not like <? olas>. 
         	then there were a couple of <? (( ))> which I tried on.

     	- <as> text </as>: this is used to mark an aside made by the
	  primary talker where the talker is addressing someone in the
	  background. 

         	no, no <as> quit it, I'm talking to your sister, </as>
	        no, I don't know.

     	- <ov> text </ov>: used to indicate overlapping speech on the
	same channel. 

         	121.23 122.98 A: The store on the <ov> corner </ov>.
         	122.50 123.91 A1: <ov> Across from </ov> the ^Wawa
         	                  near your school. 


-----------------------------------------------------------------------
6.  Data transcription - Interjections

	Below is the list of common interjection spellings used in
these transcripts.

	ajá 
     	mhm (meaning "yes") 
     	mm (meaning "no") 
     	auch 
     	guau 
     	okey 
     	chao 


-----------------------------------------------------------------------
7.  Data transcription - Non-lexemes

	For LVCSR purposes, some of the speech sounds uttered by the
conversational participants were deemed to be "non-lexemes" or
periodic sound sequences that are not listed as words in the
pronunciation dictionary.  The "non-lexemes" are distinct from the set
of interjections such as "okey", which is considered as a word
in the lexicon.  The "non-lexemes" can loosely be considered as
hesitation sounds that a speaker makes while speaking.  While the
spelling of these sounds is somewhat arbitrary, the transcribers were
given a finite list from which to choose in order to maintain
orthographic consistency.  

	Below is the histogram of the token and frequency of non-lexemes
occurring in the transcribed portions of these 80 transcripts.

	3122 %ah
	2103 %ay
	1881 %eh
	1735 %mmh
	690 %oh
	70 %uy
	61 %uh
	56 %ey
	49 %oy
	29 %ha
	28 %shh
	24 %pss
	24 %uf
	12 %pff


-----------------------------------------------------------------------
8.  Quality control (QC) procedures

	The creation of the transcripts was made in an iterative
manner.  The first step was to transcribe and timestamp the
appropriate portion of each conversation.  Once this was completed,
proper formatting and spelling was checked and corrected.  Once this
was completed, a second pass over all of the transcripts was made,
where both content and formatting was checked once more.  Throughout
this process, small improvements were constantly made and re-checked
for accuracy.  In most instances, a third (or even fourth) pass was
made over the transcript to verify its accuracy.  

Spelling: 

	As the telephone conversations were being transcribed, the
words found in the transcripts were being compiled for inclusion in
pronunciation dictionaries also being prepared by the LDC.  As the
lexicon workers compiled lists of words, they checked (among other
things) for spelling errors.  The lists of spelling/typo errors found
in the transcripts were compiled, and a program was run over the
transcripts to replace a misspelled word with its correct spelling.
Thus, work on the pronunciation dictionaries of the respective
languages helped to double-check the proper spelling of all words in
the transcripts.  

Syntax:  

	To check the well-formedness of the bracketing, a program was
written which goes over the transcripts and notes any apparent
irregularities.  This program was later adapted for on-line use by the
transcribers to be used while creating the transcripts.  A final
syntax check was run over all transcripts before the final release.

Timestamps:

	To check the well-formedness of timestamps, a program was
developed that checked for (1) overlapping timestamps, (2) start times
that are greater than end times, (3) turns that are missing
timestamps, (4) the proper formatting of a blank line before each
timestamp, (5) proper number of digits in each timestamp, and (6) the
proper marking of the speaker id.  This procedure was folded into the
syntax checking procedure to be used on-line by the transcribers.  

Content:

	To check that the properly spelled and formatted transcription
actually matched the spoken signal, a second human pass was made over
all of the transcripts.