------------------------------------------------------------- Description of the HUB-5 telephone speech and transcript corpus for Mandarin, 42 transcripts ------------------------------------------------------------- May 30, 1997 Project leader: Jennifer Alabiso Programming: David Graff Robert MacIntyre Zhibiao Wu Personnel: Jennifer Alabiso Nii Martey Transcribers: Shudong Huang (lead transcriber) Nina H. Jiang Jing Liu Yongmin Yan Zhao-Kai Qin Lei Wu CONTENTS 1. Summary abstract 2. Data acquisition 3. Data verification 4. Speaker demographics 5. Data transcription - General 6. Data transcription - Non-lexemes 7. Quality control (QC) procedures ----------------------------------------------------------------------- 1. Summary abstract This corpus consists of 5-30 minute transcriptions from 42 recorded telephone conversations originally collected by the LDC in support of the project on Language Recognition, sponsored by the U.S. Department of Defense. The transcribed data is intended as additional training data in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), also sponsored by the U.S. Department of Defense. This release of the HUB-5 Mandarin corpus consists of 42 unscripted telephone conversations between native speakers of Mandarin. The transcripts cover a contiguous 5-30 minute segment taken from a recorded conversation lasting up to 30 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in North America and were placed to various locations within North America. The distribution of call destinations can be found in the file "spkrinfo.tbl". The transcripts are timestamped by speaker turn for alignment with the speech signal, and are provided in standard orthography. ----------------------------------------------------------------------- 2. Data acquisition Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements), and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call. In all, 42 calls were transcribed. All of these calls are being designated as additional training data for the LVCSR project in Mandarin. ----------------------------------------------------------------------- 3. Data verification After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, to check the quality of the recording, and to select and describe the region to be transcribed. The description of the transcribed region provides information about channel quality, number of speakers, their gender, and other attributes. The information from this audit may be found in the file "callinfo.tbl". ----------------------------------------------------------------------- 4. Speaker demographics Information on speaker demographics can be found in the file "spkrinfo.tbl." ----------------------------------------------------------------------- 5. Data transcription - General All HUB-5 telephone conversations were transcribed using the general conventions described below. The finite set of "non-lexemes" (hesitation sounds) used in the transcripts are provided in section 6 below. The transcription was carried out on Sun 4 workstations. The transcription was done using the emacs text editor which was linked to the visual and auditory soundwave from the telephone recording in an xwaves window. A program written at the LDC linked the xwaves signal to the emacs buffer so that a highlighted region of the soundwave could be brought into the emacs buffer as a timestamp via a simple keystroke. Similarly, the transcribers could listen to any timemarked turn in the transcript, and view the aligned soundwave as well. Thus, the transcribers had a visual as well as auditory signal that they were transcribing. Both the visual and auditory signal were broken into two separate channels that could be reviewed separately or together. The transcribers were given the transcription conventions provided below as guidelines for transcribing the telephone conversations. --------------------------------------------------------------- LDC Transcription Conventions for Hub-5 Mandarin 1997 What to transcribe Telephone speech For the telephone speech transcription, the goal is to transcribe the entire 30 minute conversation. However, you should skip over the parts that are "difficult". What does that mean? As a rule of thumb, "difficult" means: - more than one or two portions of overlapping speech in a row - if you have to listen to a passage more than 4 times in order to understand anything, it is probably too difficult to transcribe - heavy distortion or overwhelming background noise over a portion of the conversation If you skip any substantial portion of the conversation, you should provide a time-stamp of the skipped speech portion (even if it is a minute long), and add the notation "[[skip]]" on the line following the timestamp with a single space. NOTE: This notation spans both channels. 323.08 351.19 [[skip]] Definition of turns: Speaker change For ease of transcription, turns can be broken up into shorter timestamped segments. These segments should be no longer than about 8 seconds in duration. Timestamps should be included based upon the following guidelines: (1) speaker change, e.g. A: Well I was thinking about that B: I know I talked to ^Jan about it yesterday (2) If there is an extra-long pause (more than a half second) within a single speaker's turn, break the turn up into two sections, e.g. B: When we were fishing out on Lake ^Travis last August I thought I saw, %uh B: %uh, ^George ^Martin, but I wasn't sure it was him. Timestamps: Each speaker turn is marked with a unique timestamp (in seconds). The timestamps mark the beginning and end time of each turn relative to the beginning of the recording. Each timestamp is precise to the 100th of a second, and is in the format: beginning time [space] ending time, followed by the turn. A: corresponds to the local channel, B: corresponds to the remote channel. Some samples: 27.98 28.72 A: You know so 30.49 32.47 A: yeah {breath} (( )) [distortion] 31.56 32.79 B: %ah ^Lydia ^Van ^Damme. If there are multiple speakers on a single channel, appended numbers are to be added to the letters to further distinguish speakers: A1, B2, etc. Orthography For both broadcast speech and telephone speech transcription, we are following the general orthographic conventions (spelling) for the given language. Words that usually take capital letters in the language should be written with capital letters, otherwise lowercase should be used. In addition, we have a set of clearly defined symbols that should be used with items such as proper names, acronyms, mispronounced words, and non-lexemes (see below). - Capitalization: capitalization in our transcripts is used as an aid for human comprehension of the text. You should follow the accepted standard way to capitalize words, including words at the beginning of a sentence, proper names, and so on. He took the car on Saturday. Jane was walking along Walnut Street when I met her. - Numerals: write out all numerals, do not use digits: twenty-two nineteen-ninety-five seven thousand two hundred seventy-five - Abbreviations: Abbreviations as such do not occur in Mandarin, and therefore all words, even if they are formed from truncation processes, are treated as "real" words, with no special designation as "abbreviations." For Mandarin, we indicate word boundaries by including spaces between words (sequences of two or more characters). The word division is based upon that found in the LDC Mandarin lexicon. The principles for Mandarin word division used in these transcripts can be found in the document: "word_division.principles" Punctuation The following punctuation marks should be used in the transcripts. The punctuation marks are primarily for ease of (human) reading. Use only those punctuation marks indicated below. - periods "." should be added at the end of declarative sentences - question marks "?" should be added at the end of interrogative sentences - commas "," should be added between clauses as is accepted in the standard orthography of the language Symbols - Acronyms and single letters: Abbreviations, acronyms and single letters do not occur in Mandarin; therefore, no special symbols are needed. - Proper names: both proper names and place names should be marked with a "^"symbol. As there is a possibility that a given place name could also be a functional word in Mandarin, only the following names should be tagged as names rather than regular words: i) Personal names (Chinese and foreign). Surname and given name are separated. ii)Place names in China Do not tag names above the level of province and provincial capitals. All other place names - which are usually less familiar and may not be in the lexicon, should be tagged with a ^. iii)Foreign place names Do not tag continental, regional, (eg Southeast Asia), country, capital, and major city names. Do not tag US state and big city (such as Philadelphia) names. Tag only small place names. If unsure about a name, tag it. iv) Institution names. if the institution has a word that should otherwise be tagged under i - iii, just tag that word. ^Motorola Company - Partial words: In Mandarin, indicate partial syllables (incomplete characters) by using Pinyin with a dash "-". Indicate partial multi-syllabic words (that have one or more complete syllables or characters) by using the Chinese character(s) with a dash "-" - Mispronounced words:if a word is mispronounced (such as a slip of the tongue), provide the correct spelling of the word, and place a "+" symbol in front of the word: +probably +yesterday - Interjections: in each language, we have a set of standardized spellings for interjections. (see the list of interjections below) - Non-lexemes: in addition to the interjections (which are considered to be words), we also have a set of standardized spellings for hesitation sounds that speakers make while speaking in each language. Every such "non word" in the transcripts is marked with the "%" symbol. (see the list of non-lexemes below) - Idiosyncratic words: if a speaker uses a "made-up" word which is not used by other speakers (although it may be understandable), place a "*" symbol before the word. Consult your language leader in cases where you are uncertain whether a word fits in this category. Onomatopoeia fits into this category: *poodle-ish Do you dress like a *schlump yet? why she said *drr I don't know. Noises In order to account for sound phenomena such as distortion, coughs, breaths, unintelligible speech, foreign words and phrases, etc, we utilize a set of unique brackets. - {text}: sound made by the talker. Use only those sounds described below: {laugh} {cough} {sneeze} {breath} {lipsmack} - [text]: sound not made by the talker (usually background or channel). This notation should be used only in those rare cases where the background condition is overwhelming. Use only those descriptions provided below [distortion] [static] -- used for channel noise such as "buzzes", "pops", etc. [background] -- used for other noises such as children crying, pots being struck, etc. - [text/] [/text]: marks when [sound] not made by the talker lasts for a duration longer than a word . Place this at the beginning and end of the noisy region. These insertions are channel specific, and each [text/] insertion will indicate that the condition exists until the point when a [/text] is inserted. If the condition occurs on both channels, it must be indicated on each channel. [distortion/] I am not really sure. [/distortion] [static/] Sure, she really loved it. [/static] [background/] Yes, that is my little girl. [/background] Other conventions - ((text)): unintelligible speech. This is the transcriber's best guess. ((wonderful)) Well, I ((thought)) that it was fine. And then she told me that I should ((just leave)). - (( )): unintelligible speech that you cannot even make a guess at (with a single space between the parentheses). I went to the (( )) on my way over. - : this is used to indicate speech (one or more words) in another language. In place of "language", write the name of the language,if known. If the language is not known, type "?". If you do not know how to transcribe what was said, use the "(( ))" notation. Our rule of thumb for noting a "foreign word" is that these words are not pronounced as native words. For example, the pronunciation of the word "okay" has been nativized in Egyptian Colloquial Arabic, and we are writing it as an Arabic word. If you have any questions, consult your language leader. And then I took all of the to my room. Oh, , he said. ^John told me that (( )) did not like . then there were a couple of which I tried on. - text : this is used to mark an aside made by the primary talker where the talker is addressing someone in the background. no, no quit it, I'm talking to your sister, no, I don't know. - text : used to indicate overlapping speech on the same channel. 121.23 122.98 A: The store on the corner . 122.50 123.91 A1: Across from the ^Wawa near your school. ----------------------------------------------------------------------- 6. Data transcription - Non-lexemes For LVCSR purposes, some of the speech sounds uttered by the conversational participants were deemed to be "non-lexemes" or periodic sound sequences that are not listed as words in the pronunciation dictionary. The "non-lexemes" are distinct from the set of interjections such as "okey", which is considered as a word in the lexicon. The "non-lexemes" can loosely be considered as hesitation sounds that a speaker makes while speaking. While the spelling of these sounds is somewhat arbitrary, the transcribers were given a finite list from which to choose in order to maintain orthographic consistency. Below is the histogram of the token and frequency of non-lexemes occurring in the transcribed portions of these 10 transcripts. 133 %ßÀ 64 %ßí 16 %ºÇ ----------------------------------------------------------------------- 7. Quality control (QC) procedures The creation of the transcripts was made in an iterative manner. The first step was to transcribe and timestamp the appropriate portion of each conversation. Once this was completed, proper formatting and spelling was checked and corrected. Once this was completed, a second pass over all of the transcripts was made, where both content and formatting was checked once more. Throughout this process, small improvements were constantly made and re-checked for accuracy. In most instances, a third (or even fourth) pass was made over the transcript to verify its accuracy. Spelling: As the telephone conversations were being transcribed, the words found in the transcripts were being compiled for inclusion in pronunciation dictionaries also being prepared by the LDC. As the lexicon workers compiled lists of words, they checked (among other things) for spelling errors. The lists of spelling/typo errors found in the transcripts were compiled, and a program was run over the transcripts to replace a misspelled word with its correct spelling. Thus, work on the pronunciation dictionaries of the respective languages helped to double-check the proper spelling of all words in the transcripts. Syntax: To check the well-formedness of the bracketing, a program was written which goes over the transcripts and notes any apparent irregularities. This program was later adapted for on-line use by the transcribers to be used while creating the transcripts. A final syntax check was run over all transcripts before the final release. Timestamps: To check the well-formedness of timestamps, a program was developed that checked for (1) overlapping timestamps, (2) start times that are greater than end times, (3) turns that are missing timestamps, (4) the proper formatting of a blank line before each timestamp, (5) proper number of digits in each timestamp, and (6) the proper marking of the speaker id. This procedure was folded into the syntax checking procedure to be used on-line by the transcribers. Content: To check that the properly spelled and formatted transcription actually matched the spoken signal, a second human pass was made over all of the transcripts.