------------------------------------------------------------- Description of the Hub-5 telephone speech and transcript corpus for Spanish, 106 transcripts ------------------------------------------------------------- January 12,1997 Project leader: Jennifer Alabiso Programming: David Graff Robert MacIntyre Zhibiao Wu Personnel: Jennifer Alabiso Transcribers: Elisa Munoz (lead transcriber) Gustavo Gallegos Philip Garrison Karla Lozano Angelica Minero Claudia Palmeros CONTENTS 1. Summary abstract 2. Data acquisition 3. Data verification 4. Speaker demographics 5. Data transcription - General 6. Data transcription - Interjections 7. Data transcription - Non-lexemes 8. Quality control (QC) procedures ----------------------------------------------------------------------- 1. Summary abstract This corpus consists of 10-30 minute transcriptions from 106 recorded telephone conversations originally collected by the LDC in support of the project on Language Recognition, sponsored by the U.S. Department of Defense. The transcribed data is intended as additional training data in support of the project on Large Vocabulary Conversational Speech Recognition (LVCSR), also sponsored by the U.S. Department of Defense. This release of the Hub-5 Spanish corpus consists of 106 unscripted telephone conversations between native speakers of Spanish. The transcripts cover a contiguous 10-30 minute segment (see section 2 below) taken from a recorded conversation lasting up to 30 minutes. All speakers were aware that they were being recorded. They were given no guidelines concerning what they should talk about. Once a caller was recruited to participate, he/she was given a free choice of whom to call. Most participants called family members or close friends. All calls originated in North America and were placed to various locations within North America, Puerto Rico or the Dominican Republic. The distribution of call destinations can be found in the file "spkrinfo.tbl". The transcripts are timestamped by speaker turn for alignment with the speech signal, and are provided in standard orthography. ----------------------------------------------------------------------- 2. Data acquisition Speakers were solicited by the LDC to participate in this telephone speech collection effort via the internet, publications (advertisements), and personal contacts. A total of 200 call originators were found, each of whom placed a telephone call via a toll-free robot operator maintained by the LDC. Access to the robot operator was possible via a unique Personal Identification Number (PIN) issued by the recruiting staff at the LDC when the caller enrolled in the project. The participants were made aware that their telephone call would be recorded, as were the call recipients. The call was allowed only if both parties agreed to being recorded. Each caller was allowed to talk up to 30 minutes. Upon successful completion of the call, the caller was paid $20 (in addition to making a free long-distance telephone call). Each caller was allowed to place only one telephone call. In all, 106 calls were transcribed. All of these calls are being designated as additional training data for the LVCSR project in Spanish. ----------------------------------------------------------------------- 3. Data verification After a successful call was completed, a human audit of each telephone call was conducted to verify that the proper language was spoken, to check the quality of the recording, and to select and describe the region to be transcribed. The description of the transcribed region provides information about channel quality, number of speakers, their gender, and other attributes. The information from this audit may be found in the file "callinfo.tbl". ----------------------------------------------------------------------- 4. Speaker demographics Information on speaker demographics can be found in the file "spkrinfo.tbl." ----------------------------------------------------------------------- 5. Data transcription - General All Hub-5 telephone conversations were transcribed using the general conventions described below. The finite set of "non-lexemes" (hesitation sounds) used in the transcripts are provided in section 6 below. The transcription was carried out on Sun 4 workstations. The transcription was done using the emacs text editor which was linked to the visual and auditory soundwave from the telephone recording in an xwaves window. A program written at the LDC linked the xwaves signal to the emacs buffer so that a highlighted region of the soundwave could be brought into the emacs buffer as a timestamp via a simple keystroke. Similarly, the transcribers could listen to any timemarked turn in the transcript, and view the aligned soundwave as well. Thus, the transcribers had a visual as well as auditory signal that they were transcribing. Both the visual and auditory signal were broken into two separate channels that could be reviewed separately or together. The transcribers were given the transcription conventions provided below as a guideline how to transcribe the telephone conversations. --------------------------------------------------------------- LDC Transcription Conventions What to transcribe Telephone speech For the telephone speech transcription, the goal is to transcribe the entire 30 minute conversation. However, you should skip over the parts that are "difficult". What does that mean? As a rule of thumb, "difficult" means: - more than one or two portions of overlapping speech in a row - if you have to listen to a passage more than 4 times in order to understand anything, it is probably too difficult to transcribe - heavy distortion or overwhelming background noise over a portion of the conversation If you skip any portion of the conversation, you should provide a time-stamp of the skipped speech portion (even if it is a minute long), and add the notation "[[skip]]" on the line following the timestamp with a single space: 323.08 351.19 [[skip]] Definition of turns: Speaker change For ease of transcription, turns can be broken up into shorter timestamped segments. These segments should be no longer than about 8 seconds in duration. Timestamps: Each speaker turn is marked with a unique timestamp (in seconds). The timestamps mark the beginning and end time of each turn relative to the beginning of the recording. Each timestamp is precise to the 100th of a second, and is in the format: beginning time [space] ending time, followed by the turn. Some samples: 27.98 28.72 A: You know so 137.49 139.47 A: yeah {breath} (( )) [distortion] 284.54 286.79 B: %ah ^Lydia ^Van ^Damme. Timestamps should be included based upon the following guidelines: (1) speaker change, e.g. A: Well I was thinking about that B: I know I talked to ^Jan about it yesterday (2) within one speaker's stretch of talk, a long turn should be broken up in terms of what makes grammatical/semantic sense, e.g. A: And I told her %um I didn't I wasn't setting you up to be a spiritual director or anything {laugh} but I did say to her that if she were to talk if she felt that she wanted to talk about her prayer experience in Spanish A: that you would probably be able to certainly to understand her but to empathize a little bit with what she was experiencing (3) If there is an extra-long pause (more than a half second) within a single speaker's turn, break the turn up into two sections, e.g. B: When we were fishing out on Lake ^Travis last August I thought I saw, %uh B: %uh, ^George ^Martin, but I wasn't sure it was him. Orthography For both broadcast speech and telephone speech transcription, we are following the general orthographic conventions (spelling) for the given language. Words that usually take capital letters in the language should be written with capital letters, otherwise lowercase should be used. In addition, we have a set of clearly defined symbols that should be used with items such as proper names, acronyms, mispronounced words, and non-lexemes (see below). - Capitalization: capitalization in our transcripts is used as an aid for human comprehension of the text. You should follow the accepted standard way to capitalize words, including words at the beginning of a sentence, proper names, and so on. He took the car on Saturday. Jane was walking along Walnut Street when I met her. - Numerals: write out all numerals, do not use digits: twenty-two nineteen-ninety-five seven thousand two hundred seventy-five - Abbreviations: write out all abbreviations (except those listed as examples in each language, if any. Consult your language leader): junior doctor Punctuation The following punctuation marks should be used in the transcripts. The punctuation marks are primarily for ease of (human) reading. Use only those punctuation marks indicated below. - periods "." should be added at the end of declarative sentences - question marks "?" should be added at the end of interrogative sentences - commas "," should be added between clauses as is accepted in the standard orthography of the language Symbols - Acronyms I: those that are pronounced as a single word should be written in caps (no spaces) and preceded by a "@" symbol: @NATO @DARPA @AIDS - Acronyms II: acronyms that are normally written as a single word but pronounced as a sequence of individual letters should be written in all caps (no spaces) and preceded by a "~" symbol: ~FBI ~CEO ~YMCA - Individual letters: Individual letters that are pronounced as such should be written in caps and preceded by a "~" symbol: I got an ~A on the test. - In spelling cases, every individual letter should be written in caps, separated by spaces and preceded by a "~" symbol: his name is spelled ~S ~I ~M ~P ~S ~O ~N. - Proper names: both proper names and place names should be marked with a "^"symbol. If you encounter a "proper name phrase", mark only those words as proper names that are true proper names on their own. Personal initials are treated as proper names in these transcripts. They must not not have a period after them unless this marks the end of a sentence. ^Frank ^Sinatra ^Beijing ^Sony ^Maria's Bar and Grill - Middle Initials or abbreviated first names should be treated as individual letters, and thus, should be preceded by a "~" symbol: ^Homer ~L ^Simpson he calls himself ~J ~R ^Jones - Partial words: partial words are indicated with a dash (without any spacing between the dash and the word): absolu- -tion - Mispronounced words:if a word is mispronounced (such as a slip of the tongue), provide the correct spelling of the word, and place a "+" symbol in front of the word: +probably +yesterday - Interjections: in each language, we have a set of standardized spellings for interjections. (see the list of interjections below) - Non-lexemes: in addition to the interjections (which are considered to be words), we also have a set of standardized spellings for hesitation sounds that speakers make while speaking in each language. Every such "non word" in the transcripts is marked with the "%" symbol. (see the list of non-lexemes below) - Idiosyncratic words: if a speaker uses a "made-up" word which is not used by other speakers (although it may be understandable), place a "*" symbol before the word. Consult your language leader in cases where you are uncertain whether a word fits in this category. Onomatopoeia fits into this category: *poodle-ish Do you dress like a *schlump yet? why she said *drr I don't know. Noises In order to account for sound phenomena such as distortion, coughs, breaths, unintelligible speech, foreign words and phrases, etc, we utilize a set of unique brackets. - {text}: sound made by the talker. Use only those sounds described below: {laugh} {cough} {sneeze} {breath} {lipsmack} - [text]: sound not made by the talker (usually background or channel). This notation should be used only in those rare cases where the background condition is overwhelming. Use only those descriptions provided below [distortion] [static] -- used for channel noise such as "buzzes", "pops", etc. [background] -- used for other noises such as children crying, pots being struck, etc. - [text/] [/text]: marks when sound not made by the talker is non-instantaneous. Place this at the beginning and end of the noisy region. [distortion/] I am not really sure. [/distortion] [static/] Sure, she really loved it. [/static] [background/] Yes, that is my little girl. [/background] Other conventions - ((text)): unintelligible speech. This is the transcriber's best guess. It should only be used during the first stage of transcription to aid in the recognition of the word. It should be either corroborated or eliminated during checking stage. ((wonderful)) Well, I ((thought)) that it was fine. And then she told me that I should ((just leave)). - (( )): unintelligible speech that you cannot even make a guess at (with a single space between the parentheses). I went to the (( )) on my way over. - : this is used to indicate speech (one or more words) in another language. In place of "language", write the name of the language,if known. If the language is not known, type "?". If you do not know how to transcribe what was said, use the "(( ))" notation. Our rule of thumb for noting a "foreign word" is that these words are not pronounced as native words. For example, the pronunciation of the word "okay" has been nativized and we are writing it as a Spanish word following the standard Spanish spelling: "okey." Moreover, foreign proper names should not be marked with a language tag unless there exists a commonly used translation of that name in Spanish, such as "New York" and "Nueva York." If you have any questions, consult your language leader. sí, viaja bastante a y a ^Nueva ^York. And then I took all of the to my room. That type of cheese is called ^John told me that (( )) did not like . then there were a couple of which I tried on. - text : this is used to mark an aside made by the primary talker where the talker is addressing someone in the background. no, no quit it, I'm talking to your sister, no, I don't know. - text : used to indicate overlapping speech on the same channel. 121.23 122.98 A: The store on the corner . 122.50 123.91 A1: Across from the ^Wawa near your school. ----------------------------------------------------------------------- 6. Data transcription - Interjections Below is the list of common interjection spellings used in these transcripts. ajá mhm (meaning "yes") mm (meaning "no") auch guau okey chao ----------------------------------------------------------------------- 7. Data transcription - Non-lexemes For LVCSR purposes, some of the speech sounds uttered by the conversational participants were deemed to be "non-lexemes" or periodic sound sequences that are not listed as words in the pronunciation dictionary. The "non-lexemes" are distinct from the set of interjections such as "okey", which is considered as a word in the lexicon. The "non-lexemes" can loosely be considered as hesitation sounds that a speaker makes while speaking. While the spelling of these sounds is somewhat arbitrary, the transcribers were given a finite list from which to choose in order to maintain orthographic consistency. Below is the histogram of the token and frequency of non-lexemes occurring in the transcribed portions of these 80 transcripts. 3122 %ah 2103 %ay 1881 %eh 1735 %mmh 690 %oh 70 %uy 61 %uh 56 %ey 49 %oy 29 %ha 28 %shh 24 %pss 24 %uf 12 %pff ----------------------------------------------------------------------- 8. Quality control (QC) procedures The creation of the transcripts was made in an iterative manner. The first step was to transcribe and timestamp the appropriate portion of each conversation. Once this was completed, proper formatting and spelling was checked and corrected. Once this was completed, a second pass over all of the transcripts was made, where both content and formatting was checked once more. Throughout this process, small improvements were constantly made and re-checked for accuracy. In most instances, a third (or even fourth) pass was made over the transcript to verify its accuracy. Spelling: As the telephone conversations were being transcribed, the words found in the transcripts were being compiled for inclusion in pronunciation dictionaries also being prepared by the LDC. As the lexicon workers compiled lists of words, they checked (among other things) for spelling errors. The lists of spelling/typo errors found in the transcripts were compiled, and a program was run over the transcripts to replace a misspelled word with its correct spelling. Thus, work on the pronunciation dictionaries of the respective languages helped to double-check the proper spelling of all words in the transcripts. Syntax: To check the well-formedness of the bracketing, a program was written which goes over the transcripts and notes any apparent irregularities. This program was later adapted for on-line use by the transcribers to be used while creating the transcripts. A final syntax check was run over all transcripts before the final release. Timestamps: To check the well-formedness of timestamps, a program was developed that checked for (1) overlapping timestamps, (2) start times that are greater than end times, (3) turns that are missing timestamps, (4) the proper formatting of a blank line before each timestamp, (5) proper number of digits in each timestamp, and (6) the proper marking of the speaker id. This procedure was folded into the syntax checking procedure to be used on-line by the transcribers. Content: To check that the properly spelled and formatted transcription actually matched the spoken signal, a second human pass was made over all of the transcripts.