MACROPHONE TRANSCRIPTION

Overview

The goal of the Macrophone transcription effort is to provide an accurate word level transcription of what the caller said, with minimal markings for extraneous events and disfluencies. Because of the volume of data being transcribed, providing a detailed transcription of every utterance would require a prohibitive level of effort at this point. An accurate word level transcription is the minimum necessary in order to make the data useful for the majority of researchers.

Default transcriptions are generated for the majority of the data based on subjects' responses to one of the items, which is a key to the unique sheet which a particular caller is using as a prompt sheet. The default transcriptions are derived from the text on the caller's prompt sheet. Because of the method in which the data is solicited, from unidentified callers over the telephone, each call must be listened to at least once to verify that the caller did in fact say what was expected. The majority of callers respond appropriately to most prompts, but there are enough misreadings, different potential readings of numbers and dollar amounts, etc., that the default transcriptions alone are not reliable enough for most potential uses of this data.

Transcribers for the bulk of the data are not necessarily linguists, but are individuals trained in the requirements for this particular task. They are instructed to handle, i.e., transcribe according to a reduced set of transcription guidelines, those utterances that are relatively straightforward. They are asked to set aside any utterances that they can not transcribe so that a linguist can review them.

Transcribers use a tool which brings up one utterance at a time from a list of utterances. Each utterance is played out once automatically. The waveform and default transcription, if there is one, are displayed. The transcriber has a window in which they may edit the default transcription or enter a new transcription. The tool allows transcribers to re-play the utterance or any portion of it as many times as they like. If the transcriber is unable to transcribe the utterance, they can click on a button that sets that utterance aside for later review by a supervisor.

To the extent that non-speech events and disfluencies are noted, the conventions for marking have been borrowed from the guidelines used to transcribe the CSR/Wall Street Journal database.

The following types of utterances are sorted out and are NOT being delivered as part of the Macrophone database:

All of these files will be saved and can be transcribed and delivered at a later date if there is a need for the data.

For all remaining data, the priority is to transcribe and deliver the "clean" data first, i.e., data that is relatively free of extraneous events and disfluencies, and which can therefore be handled by our trained transcribers. To the extent possible, data set aside by the transcribers because they weren't sure how to handle it will be transcribed by a linguist and included in the delivered database.

The transcribers are asked to set aside utterances containing any of the following:

Transcription Conventions

CASE:
All transcription is done in lowercase.
PUNCTUATION:
No punctuation is used, except for apostrophes.

No periods are used, even in abbreviations.

Hyphenated words are either transcribed as a single compound word, if appropriate, or split into two separate words.

ABBREVIATIONS:
The only abbreviations used are: mr, mrs and ms, all transcribed in lowercase without a period. Words such as doctor and saint are spelled out as complete words, rather than being transcribed as dr or st

ACRONYMS:
Acronyms are transcibed as words, if they are said as words. They are spelled out (see below) if the subject says the names of the letters in the acronym rather than saying it as a word. For example, "NIST", if pronounced as /nist/ would be transcribed as a word: nist

SPELLED WORDS:
Spelled words and acronyms where the subject says the letter names, such as "IBM", are transcribed by leaving spaces around each letter. For example, "IBM" would be transcribed as: i b m And "U. S. A." would be transcribed: u s a The Macrophone database contains several spelled words, where the subject was presented with a word in capital letters, separated by dashes, and was asked to "spell the word". Several subjects responded by first saying the word, then spelling it out, or by saying the words "capital" or "dash". In all cases, the transcription is a word for word transcription of what was actually said. For example: If the word was C-A-T possible transcriptions might be:

INSERTIONS, DELETIONS and SUBSTITUTIONS:
If the subject misreads a sentence, producing words that differ from the prompt text, the transcription is changed to match what the subject actually said. If the subject leaves out a word, that word is removed from the transcription. If the subject changes a word, saying a word that is different from the one in the prompt text, the word that was actually produced replaces the one from the prompt text. If the subject inserts an extra word, adding a "the", for example, when there wasn't one in the prompt text, that word is inserted in the trancription.

Note that this rule applies only to correctly formed words. Also, in the case of substitutions, the substituted word should make sense in the local context, that is, at least within the context of the few surrounding words. If it does not make sense, it should instead be treated as a mispronunciation of the intended word.

Mispronunciations, word fragments and other disfluencies are handled separately.

DISFLUENCIES:
Mispronunciations:

Obviously mispronounced words are marked by placing a "*" both immediately before and immediately after the word.

For example: *explicitly*

The transcription of the word itself is not modified in any way. There is no attempt made to produce a phonetic level transcription.

In general, there is a high degree of tolerance for pronunciation variants:

Dialectical variants, such as "aks" for "ask" are not marked as mispronunciations.

Common mispronunciations, such as "nucular" for "nuclear" are not marked.

Words that have multiple pronunciations that are commonly accepted, such as "harassment" are not marked (unless the production of the word does not match any of the accepted pronunciations).

Finally, a fair degree of latitude is given in judging the pronunciation of less common words and names with which the subject may not be familiar. If the subject produces a reasonable pronunciation based on the spelling of the word or name, whether correct or not, it is not normally marked as a mispronunciation. For instance, any reasonable attempt at pronouncing "Deng Xiao Ping" would be accepted, while non-standard pronunciations of "Bush" or "Clinton" would be marked as mispronuniations.

Word fragments and stutters:

Partial words are transcribed by entering the portion of the word that was said, immediately followed by a "-" to show that some portion of the word is missing. If it is clear what the intended word was, the missing portion of the word can optionally be shown inside of parentheses, prior to the "-".

If the first part of the word is missing, the "-" would appear in front of the portion of the word that was produced. This is not very common, except in the case of truncations, which are discarded anyway.

For example, the following would both be legitimate transcriptions for the word fragment "ame" when the subject was attempting to say "america":

Because the transcription is done on a word level, and not on a phonetic level, the portion of the word that is shown is that which comes closest to matching what the subject said given the spelling of the word.

Often, expecially with stutters, only a single phone is produced and there is no way of knowing what the intended word was. In this case the letter that comes closest to representing that phone is used, followed by the "-", as in "s-".

Verbal deletions and full word stutters:

Verbal deletions of full words and full word stutters are not marked. Each instance of the word is entered in the transcription.

For example, if the subject stuttered while saying "New York", but produced the full word "New" as the stutter, the transcription would be:

new new york

For another example, assume the subject was reading the number "2578" and misread the "5" as "4", then corrected himself. The transcription would be:

two four five seven eight

Or, if he inserted other words or verbal hesitations at the point where he realized the mistake, the transcription might be something like:

two four [uh] no five seven eight

Or: two four no two five seven eight

Prosodic Markings

Pauses: Pauses are not marked in any way.

Emphatic or abnormal stress: Stress is not marked in any way.

Lengthening: Lengthening is marked only in a few extreme cases. Users of this database should not count on lengthened sounds being marked, as the convention was not applied consistently. However, users of the data can be assured that any sound that is marked as lengthened is in fact an abnormally drawn out sound. The convention for marking lengthening is to append a ":" immediately following the lengthened sound (or the letter in the word that most closely represents that sound).

For example, if the subject says: "nnnnnno", drawing the "n" out, the transcription would be: "n:o"

Speech Style

Different speaking styles are not annotated in any way.

Unintelligible speech

Unintelligible speech is marked using square bracket notation like that used for extraneous events, as described below.

The marker: [unintelligible] is entered in the transcription in place of whatever speech was present, regardless of the length of the unintelligible segment.

EXTRANEOUS OR NON-SPEECH EVENTS:
Extraneous events are marked using square brackets enclosing a descriptor for the type of event. Only 5 different event descriptors are used for this database. They are: [bg_noise], [bg_speech], [line_noise], [mouth_noise], and [uh].

Each of these is defined in detail below:

[bg_noise] - Used to mark background noise.
Background noise from any source is transcribed as [bg_noise]. Possible noise sources include but are not limited to: dogs barking, bird or other pet noises, phone cord tapping, finger tapping, and TV or radio noise (including TV or radio speech). [bg_noise] is a "catch-all" marker that is used for any noise that is notable but does not fall into one of the other categories.

[bg_speech] - Used to mark background speech, or cross-talk.
Background speech is defined as audible speech from other talkers in the area where the subject is calling from, or speech by the subject that is clearly directed at someone else and is not in response to a prompt from our system. This would include, for example, the subject turning to someone else and saying something like: "shh i'm on the phone". The [bg_speech] marker replaces what the subject said in this case (so the words "shh i'm on the phone" would NOT appear in the transcription). Audible speech from other talkers is to include any speech that is loud enough to be identified as speech, whether the words can be made out or not. It may also include laughter that is part of a normal conversation. Isolated laughter (from people in the background) should be marked as [bg_noise].

Note that laughter from the subject is covered by the marker [mouth_noise], as described below.

[line_noise] - Used to mark noticeable popping or static from the telephone lines.
This marker was added part way through the project as a result of inquiries from transcribers about how they should handle popping sounds and static. We initially asked them to set those files aside, until we defined this marker.

[line_noise] may be used to mark the popping noises that result from dropped packets in transmission of digitized speech over phone lines. It is also used to mark files that have noticeable static.

The [line_noise] marker is to be used sparingly, and only when the noise is clearly due to telephone lines. If there is a doubt as to the noise source, the [bg_noise] marker should be used instead. Normal levels of telephone noise and low levels of static that would probably not be noticed by a telephone user are not marked.

[mouth_noise] - Used to mark any mouth or nose noises on the part of the subject.
Mouth noises are defined as any non-speech noises from the subject, excluding verbalized hesitations. Included in this category are breath noises (inhale and exhale), tongue clicks, lip smacks, throat clearing, snorts, sniffs, sneezes, coughs, and laughs, and any combination of the above. Breaths, tongue clicks and lip smacks that are significantly lower in amplitude than the subject's speech do not need to be marked.

[uh] - Used to mark all verbal hesitations.
All verbalized hesitations, whether the subject actually says "uh", or says something different, such as "um", "mm", "eh", etc., are marked as [uh]. There is no attempt made to distinguish between the different possible verbalized hesitations, or to characterize them on a phonetic level. Duration is also not considered: verbalized hesitations of any length are marked simply with one [uh] marker.

PLACEMENT OF EVENT DESCRIPTORS:
Events that do not co-occur with speech

Events that do not conflict or co-occur with speech should be marked by placing the appropriate descriptor in the transcription at the place where the event occurs.

For example, the following transcription would indicate that there was some sort of mouth noise preceding the speech:

"[mouth_noise] yes i think so"

By definition, [mouth_noise] and [uh] should NEVER co-occur with speech. They should simply be inserted at the place where they occur in the utterance.

Events that co-occur with speech

[bg_noise], [bg_speech], and [line_noise] may co-occur with speech. They may occur as single events that coincide with the subject's production of a single word, or they may span several words or even the entire utterance.

Events that co-occur with one word

Extraneous events that co-occur with a single word should be transcribed by placing the appropriate marker either immediately before or immediately after the word in question, and inserting an arrow "<" or ">" inside the square brackets and pointing toward the word.

For example, if the subject was reading the sentence
"put the table outside"
and a dog barked right in the middle of the word "table" either of the following would be a correct transcription:

"put the [bg_noise>] table outside"

"put the table [<bg_noise] outside"

Events that co-occur with two or more words

Extraneous events that span multiple words, such as TV noise, or a dog that is barking steadily throughout an utterance, or a conversation that is taking place in the background, should be marked by placing the appropriate descriptor both prior to the first word and after the last word during which the event is heard. A "/" should be inserted inside the square brackets to indicate onset and offset of the event.

For example, in the sentence shown above, if the dog started barking during the word table and kept barking repeatedly throughout the rest of the utterance, the transcription would be:

"put the [bg_noise/] table outside [/bg_noise]"

As another example, if there is telephone line static throughout the entire utterance, the descriptors [line_noise/] and [/line_noise] would be placed at either end of the utterance. Similarly, if there is TV noise throughout, the transcription would be surrounded by the descriptors [bg_noise/] and [/bg_noise].