-------------------------------------------------------------
	Description of the HUB-5 telephone speech and transcript 
	           corpus for Mandarin, 42 transcripts	
	-------------------------------------------------------------

May 30, 1997


	Project leader:		Jennifer Alabiso

	Programming:		David Graff
				Robert MacIntyre
				Zhibiao Wu

	Personnel:		Jennifer Alabiso
				Nii Martey

	Transcribers:		Shudong Huang (lead transcriber)
				Nina H. Jiang
				Jing Liu
				Yongmin Yan
				Zhao-Kai Qin
				Lei Wu
			
CONTENTS

	1. Summary abstract
	2. Data acquisition
	3. Data verification
	4. Speaker demographics
	5. Data transcription - General
	6. Data transcription - Non-lexemes
	7. Quality control (QC) procedures


-----------------------------------------------------------------------
1.  Summary abstract

	This corpus consists of 5-30 minute transcriptions from 42
recorded telephone conversations originally collected by the LDC in
support of the project on Language Recognition, sponsored by the
U.S. Department of Defense.  The transcribed data is intended as
additional training data in support of the project on Large Vocabulary
Conversational Speech Recognition (LVCSR), also sponsored by the
U.S. Department of Defense.

	This release of the HUB-5 Mandarin corpus consists of 42
unscripted telephone conversations between native speakers of
Mandarin.  The transcripts cover a contiguous 5-30 minute segment
taken from a recorded conversation lasting up to 30 minutes.  All
speakers were aware that they were being recorded.  They were given no
guidelines concerning what they should talk about.  Once a caller was
recruited to participate, he/she was given a free choice of whom to
call.  Most participants called family members or close friends.  All
calls originated in North America and were placed to various locations
within North America.  The distribution of call destinations can be
found in the file "spkrinfo.tbl".

	The transcripts are timestamped by speaker turn for alignment
with the speech signal, and are provided in standard orthography.

-----------------------------------------------------------------------
2.  Data acquisition

	Speakers were solicited by the LDC to participate in this
telephone speech collection effort via the internet, publications
(advertisements), and personal contacts.  A total of 200 call
originators were found, each of whom placed a telephone call via a
toll-free robot operator maintained by the LDC.  Access to the robot
operator was possible via a unique Personal Identification Number
(PIN) issued by the recruiting staff at the LDC when the caller
enrolled in the project.  The participants were made aware that their
telephone call would be recorded, as were the call recipients.  The
call was allowed only if both parties agreed to being recorded.  Each
caller was allowed to talk up to 30 minutes.  Upon successful
completion of the call, the caller was paid $20 (in addition to making
a free long-distance telephone call).  Each caller was allowed to
place only one telephone call.

	In all, 42 calls were transcribed.  All of these calls are
being designated as additional training data for the LVCSR project in
Mandarin.  

-----------------------------------------------------------------------
3.  Data verification

	After a successful call was completed, a human audit of each
telephone call was conducted to verify that the proper language was
spoken, to check the quality of the recording, and to select and
describe the region to be transcribed.  The description of the
transcribed region provides information about channel quality, number
of speakers, their gender, and other attributes.  The information from
this audit may be found in the file "callinfo.tbl".

-----------------------------------------------------------------------
4.  Speaker demographics

	Information on speaker demographics can be found in the file
"spkrinfo.tbl."

-----------------------------------------------------------------------
5.  Data transcription - General

	All HUB-5 telephone conversations were transcribed using
the general conventions described below.  The finite set of
"non-lexemes" (hesitation sounds) used in the transcripts are provided
in section 6 below.

	The transcription was carried out on Sun 4 workstations.  The
transcription was done using the emacs text editor which was linked to
the visual and auditory soundwave from the telephone recording in an
xwaves window.  A program written at the LDC linked the xwaves signal
to the emacs buffer so that a highlighted region of the soundwave
could be brought into the emacs buffer as a timestamp via a simple
keystroke.  Similarly, the transcribers could listen to any timemarked
turn in the transcript, and view the aligned soundwave as well.  Thus,
the transcribers had a visual as well as auditory signal that they
were transcribing.  Both the visual and auditory signal were broken
into two separate channels that could be reviewed separately or
together.

	The transcribers were given the transcription conventions
provided below as guidelines for transcribing the telephone
conversations.


	---------------------------------------------------------------
		LDC Transcription Conventions for Hub-5 Mandarin 1997


What to transcribe

Telephone speech

	For the telephone speech transcription, the goal is to
	transcribe the entire 30 minute conversation. However, you
	should skip over the parts that are "difficult". What does
	that mean? As a rule of thumb, "difficult" means:

	- more than one or two portions of overlapping speech in a row
	- if you have to listen to a passage more than 4 times in order
	   to understand anything, it is probably too difficult to
	   transcribe 
	- heavy distortion or overwhelming background noise
	   over a portion of the conversation 

	If you skip any substantial portion of the conversation, you should
	provide a time-stamp of the skipped speech portion (even if it
	is a minute long), and add the notation "[[skip]]" on the line
	following the timestamp with a single space. NOTE: This notation 
	spans both channels.
	

	323.08 351.19 [[skip]] 


Definition of turns:  Speaker change


	For ease of transcription, turns can be broken up into shorter
	timestamped segments.  These segments should be no longer than
	about 8 seconds in duration.  

	Timestamps should be included based upon the following guidelines:


                (1) speaker change, e.g.

                        A:  Well I was thinking about that

                        B:  I know I talked to ^Jan about it yesterday

                
                (2) If there is an extra-long pause (more than a half
		second) within a single speaker's turn, break the turn
		up into two sections, e.g.

                        B: When we were fishing out on Lake ^Travis last
                        August I thought I saw, %uh 

                        B: %uh, ^George ^Martin, but I wasn't sure it was him.


Timestamps:  Each speaker turn is marked with a unique timestamp
             (in seconds). The timestamps mark the beginning and
             end time of each turn relative to the beginning of the
             recording. Each timestamp is precise to the 100th of a
             second, and is in the format: beginning time [space]
             ending time, followed by the turn.
	     A: corresponds to the local channel, 
	     B: corresponds to the remote channel.

             Some samples:

                27.98 28.72 A: You know so

                30.49 32.47 A: yeah {breath} (( )) [distortion]

                31.56 32.79 B: %ah ^Lydia ^Van ^Damme.

	If there are multiple speakers on a single channel, appended
	numbers are to be added to the letters to further distinguish 
	speakers: A1, B2,
	etc. 

Orthography

	For both broadcast speech and telephone speech transcription,
	we are following the general orthographic conventions
	(spelling) for the given language. Words that usually take
	capital letters in the language should be written with capital
	letters, otherwise lowercase should be used.

	In addition, we have a set of clearly defined symbols that
	should be used with items such as proper names, acronyms,
	mispronounced words, and non-lexemes (see below).

	- Capitalization: capitalization in our transcripts is used as
	  an aid for human comprehension of the text. You should follow
	  the accepted standard way to capitalize words, including words
	  at the beginning of a sentence, proper names, and so on.

		He took the car on Saturday. 
		Jane was walking along Walnut Street when I met her. 
 
	- Numerals: write out all numerals, do not use digits: 

		twenty-two 
		nineteen-ninety-five 
		seven thousand two hundred seventy-five 

	- Abbreviations: Abbreviations as such do not occur in Mandarin,
	and therefore all words, even if they are formed from truncation 
	processes, are treated as "real" words, with no special designation
	as "abbreviations."


	For Mandarin, we indicate word boundaries by including spaces
	between words (sequences of two or more characters).  The
	word division is based upon that found in the LDC Mandarin
	lexicon.  The principles for Mandarin word division used in
	these transcripts can be found in the document:

	"word_division.principles"


Punctuation

	The following punctuation marks should be used in the
	transcripts. The punctuation marks are primarily for ease of
	(human) reading. Use only those punctuation marks indicated
	below.

	- periods "." should be added at the end of declarative sentences
	- question marks "?" should be added at the end of interrogative 
	  sentences 
     	- commas "," should be added between clauses as is accepted in 
	  the standard orthography of the language 


Symbols 
	- Acronyms and single letters: Abbreviations, acronyms and single
	  letters do not occur in Mandarin; therefore, no special symbols
	  are needed.
  
	- Proper names: both proper names and place names should be
	  marked with a "^"symbol. As there is a possibility that a given
	  place name could also be a functional word in Mandarin, only the
	 following names should be tagged as names rather than regular words:

		i) Personal names (Chinese and foreign). 
			Surname and given name are separated.

		ii)Place names in China
			Do not tag names above the level of province and
			provincial capitals. All other place names - which 
			are usually less familiar and may not be in the lexicon,
			should be tagged with a ^.

		iii)Foreign place names
			Do not tag continental, regional, (eg Southeast
			Asia), country, capital, and major city names.
			Do not tag US state and big city (such as
			Philadelphia) names. Tag only small place names. 
			If unsure about a name, tag it. 

		iv) Institution names. 
			if the institution has a word that should otherwise
			be tagged under i - iii, just tag that word.
			^Motorola Company

     	- Partial words: 

	In Mandarin, indicate partial syllables (incomplete
	characters) by using Pinyin with a dash "-". Indicate partial
	multi-syllabic words (that have one or more complete syllables
	or characters) by using the Chinese character(s) with a dash
	"-"

	- Mispronounced words:if a word is mispronounced (such as a
	  slip of the tongue), provide the correct spelling of the word,
	  and place a "+" symbol in front of the word:

	         +probably 
	         +yesterday 

     	- Interjections: in each language, we have a set of
	  standardized spellings for interjections.

		(see the list of interjections below)

	- Non-lexemes: in addition to the interjections (which are
	  considered to be words), we also have a set of standardized
	  spellings for hesitation sounds that speakers make while
	  speaking in each language. Every such "non word" in the
	  transcripts is marked with the "%" symbol. 

		(see the list of non-lexemes below)

     	- Idiosyncratic words: if a speaker uses a "made-up" word
	  which is not used by other speakers (although it may be
	  understandable), place a "*" symbol before the word. Consult
	  your language leader in cases where you are uncertain whether
	  a word fits in this category. Onomatopoeia fits into this
	  category:

         	*poodle-ish 
         	Do you dress like a *schlump yet? 
         	why she said *drr I don't know. 


Noises

	In order to account for sound phenomena such as distortion,
	coughs, breaths, unintelligible speech, foreign words and
	phrases, etc, we utilize a set of unique brackets.

     	- {text}: sound made by the talker. Use only those sounds
	  described below:

	         {laugh} 
	         {cough} 
	         {sneeze} 
	         {breath} 
	         {lipsmack}

	- [text]: sound not made by the talker (usually background or
	  channel). This notation should be used only in those rare
	  cases where the background condition is overwhelming. Use only
	  those descriptions provided below 

	         [distortion] 
        	 [static] -- used for channel noise such as "buzzes", 
			     "pops", etc. 
         	[background] -- used for other noises such as children 
				crying, pots being struck, etc. 

	- [text/] [/text]: marks when [sound] not made by the talker lasts
	for a duration longer than a word . Place this at the
	beginning and end of the noisy region. These insertions are channel
	specific, and each [text/] insertion will indicate that the
	condition exists until the point when a [/text] is inserted. If the
	condition occurs on both channels, it must be indicated on each channel.

         	[distortion/] I am not really sure. [/distortion] 
         	[static/] Sure, she really loved it. [/static] 
         	[background/] Yes, that is my little girl. [/background] 


Other conventions

	- ((text)): unintelligible speech. This is the transcriber's
	  best guess.

         	((wonderful)) 
         	Well, I ((thought)) that it was fine. 
         	And then she told me that I should ((just leave)). 

	- (( )): unintelligible speech that you cannot even make a
	guess at (with a single space between the parentheses).

         	I went to the (( )) on my way over. 

     	- <language text>: this is used to indicate speech (one or
	  more words) in another language. In place of "language", write
	  the name of the language,if known. If the language is not
	  known, type "?". If you do not know how to transcribe what was
	  said, use the "(( ))" notation. Our rule of thumb for noting a
	  "foreign word" is that these words are not pronounced as
	  native words. For example, the pronunciation of the word
	  "okay" has been nativized in Egyptian Colloquial Arabic, and
	  we are writing it as an Arabic word. If you have any
	  questions, consult your language leader.

          	And then I took all of the <German Sachen> to my room. 
         	Oh, <Spanish gracias>, he said. 
         	^John told me that (( )) did not like <? olas>. 
         	then there were a couple of <? (( ))> which I tried on. 

     	- <as> text </as>: this is used to mark an aside made by the
	  primary talker where the talker is addressing someone in the
	  background. 

         	no, no <as> quit it, I'm talking to your sister, </as>
	        no, I don't know.

     	- <ov> text </ov>: used to indicate overlapping speech on the
	same channel. 

         	121.23 122.98 A: The store on the <ov> corner </ov>.
         	122.50 123.91 A1: <ov> Across from </ov> the ^Wawa
         	                  near your school. 


-----------------------------------------------------------------------
6.  Data transcription - Non-lexemes

	For LVCSR purposes, some of the speech sounds uttered by the
conversational participants were deemed to be "non-lexemes" or
periodic sound sequences that are not listed as words in the
pronunciation dictionary.  The "non-lexemes" are distinct from the set
of interjections such as "okey", which is considered as a word
in the lexicon.  The "non-lexemes" can loosely be considered as
hesitation sounds that a speaker makes while speaking.  While the
spelling of these sounds is somewhat arbitrary, the transcribers were
given a finite list from which to choose in order to maintain
orthographic consistency.  

	Below is the histogram of the token and frequency of non-lexemes
occurring in the transcribed portions of these 10 transcripts.


	133 %ЯА
	64 %Ян
	16 %єЗ


-----------------------------------------------------------------------
7.  Quality control (QC) procedures

	The creation of the transcripts was made in an iterative
manner.  The first step was to transcribe and timestamp the
appropriate portion of each conversation.  Once this was completed,
proper formatting and spelling was checked and corrected.  Once this
was completed, a second pass over all of the transcripts was made,
where both content and formatting was checked once more.  Throughout
this process, small improvements were constantly made and re-checked
for accuracy.  In most instances, a third (or even fourth) pass was
made over the transcript to verify its accuracy.  

Spelling: 

	As the telephone conversations were being transcribed, the
words found in the transcripts were being compiled for inclusion in
pronunciation dictionaries also being prepared by the LDC.  As the
lexicon workers compiled lists of words, they checked (among other
things) for spelling errors.  The lists of spelling/typo errors found
in the transcripts were compiled, and a program was run over the
transcripts to replace a misspelled word with its correct spelling.
Thus, work on the pronunciation dictionaries of the respective
languages helped to double-check the proper spelling of all words in
the transcripts.  

Syntax:  

	To check the well-formedness of the bracketing, a program was
written which goes over the transcripts and notes any apparent
irregularities.  This program was later adapted for on-line use by the
transcribers to be used while creating the transcripts.  A final
syntax check was run over all transcripts before the final release.

Timestamps:

	To check the well-formedness of timestamps, a program was
developed that checked for (1) overlapping timestamps, (2) start times
that are greater than end times, (3) turns that are missing
timestamps, (4) the proper formatting of a blank line before each
timestamp, (5) proper number of digits in each timestamp, and (6) the
proper marking of the speaker id.  This procedure was folded into the
syntax checking procedure to be used on-line by the transcribers.  

Content:

	To check that the properly spelled and formatted transcription
actually matched the spoken signal, a second human pass was made over
all of the transcripts.