Overview
The goal of the Macrophone transcription effort is to provide an accurate word level transcription of what the caller said, with minimal markings for extraneous events and disfluencies. Because of the volume of data being transcribed, providing a detailed transcription of every utterance would require a prohibitive level of effort at this point. An accurate word level transcription is the minimum necessary in order to make the data useful for the majority of researchers.
Default transcriptions are generated for the majority of the data based on subjects' responses to one of the items, which is a key to the unique sheet which a particular caller is using as a prompt sheet. The default transcriptions are derived from the text on the caller's prompt sheet. Because of the method in which the data is solicited, from unidentified callers over the telephone, each call must be listened to at least once to verify that the caller did in fact say what was expected. The majority of callers respond appropriately to most prompts, but there are enough misreadings, different potential readings of numbers and dollar amounts, etc., that the default transcriptions alone are not reliable enough for most potential uses of this data.
Transcribers for the bulk of the data are not necessarily linguists, but are individuals trained in the requirements for this particular task. They are instructed to handle, i.e., transcribe according to a reduced set of transcription guidelines, those utterances that are relatively straightforward. They are asked to set aside any utterances that they can not transcribe so that a linguist can review them.
Transcribers use a tool which brings up one utterance at a time from a list of utterances. Each utterance is played out once automatically. The waveform and default transcription, if there is one, are displayed. The transcriber has a window in which they may edit the default transcription or enter a new transcription. The tool allows transcribers to re-play the utterance or any portion of it as many times as they like. If the transcriber is unable to transcribe the utterance, they can click on a button that sets that utterance aside for later review by a supervisor.
To the extent that non-speech events and disfluencies are noted, the conventions for marking have been borrowed from the guidelines used to transcribe the CSR/Wall Street Journal database.
The following types of utterances are sorted out and are NOT being delivered as part of the Macrophone database:
For example, if the prompt was a question like: "Approximately how many people live in your home town?" and the subject, instead of answering, turned to someone else and said something like "They want to know how many people live in this town, what do I tell them?"
For all remaining data, the priority is to transcribe and deliver the "clean" data first, i.e., data that is relatively free of extraneous events and disfluencies, and which can therefore be handled by our trained transcribers. To the extent possible, data set aside by the transcribers because they weren't sure how to handle it will be transcribed by a linguist and included in the delivered database.
The transcribers are asked to set aside utterances containing any of the following:
Transcription Conventions
No periods are used, even in abbreviations.
Hyphenated words are either transcribed as a single compound word, if appropriate, or split into two separate words.
Note that this rule applies only to correctly formed words. Also, in the case of substitutions, the substituted word should make sense in the local context, that is, at least within the context of the few surrounding words. If it does not make sense, it should instead be treated as a mispronunciation of the intended word.
Mispronunciations, word fragments and other disfluencies are handled separately.
Obviously mispronounced words are marked by placing a "*" both immediately before and immediately after the word.
For example: *explicitly*
The transcription of the word itself is not modified in any way. There is no attempt made to produce a phonetic level transcription.
In general, there is a high degree of tolerance for pronunciation variants:
Dialectical variants, such as "aks" for "ask" are not marked as mispronunciations.
Common mispronunciations, such as "nucular" for "nuclear" are not marked.
Words that have multiple pronunciations that are commonly accepted, such as "harassment" are not marked (unless the production of the word does not match any of the accepted pronunciations).
Finally, a fair degree of latitude is given in judging the pronunciation of less common words and names with which the subject may not be familiar. If the subject produces a reasonable pronunciation based on the spelling of the word or name, whether correct or not, it is not normally marked as a mispronunciation. For instance, any reasonable attempt at pronouncing "Deng Xiao Ping" would be accepted, while non-standard pronunciations of "Bush" or "Clinton" would be marked as mispronuniations.
Word fragments and stutters:
Partial words are transcribed by entering the portion of the word that was said, immediately followed by a "-" to show that some portion of the word is missing. If it is clear what the intended word was, the missing portion of the word can optionally be shown inside of parentheses, prior to the "-".
If the first part of the word is missing, the "-" would appear in front of the portion of the word that was produced. This is not very common, except in the case of truncations, which are discarded anyway.
For example, the following would both be legitimate transcriptions for the word fragment "ame" when the subject was attempting to say "america":
Often, expecially with stutters, only a single phone is produced and there is no way of knowing what the intended word was. In this case the letter that comes closest to representing that phone is used, followed by the "-", as in "s-".
Verbal deletions and full word stutters:
Verbal deletions of full words and full word stutters are not marked. Each instance of the word is entered in the transcription.
For example, if the subject stuttered while saying "New York", but produced the full word "New" as the stutter, the transcription would be:
new new york
For another example, assume the subject was reading the number "2578" and misread the "5" as "4", then corrected himself. The transcription would be:
two four five seven eight
Or, if he inserted other words or verbal hesitations at the point where he realized the mistake, the transcription might be something like:
two four [uh] no five seven eight
Or: two four no two five seven eight
Prosodic Markings
Pauses: Pauses are not marked in any way.
Emphatic or abnormal stress: Stress is not marked in any way.
Lengthening: Lengthening is marked only in a few extreme cases. Users of this database should not count on lengthened sounds being marked, as the convention was not applied consistently. However, users of the data can be assured that any sound that is marked as lengthened is in fact an abnormally drawn out sound. The convention for marking lengthening is to append a ":" immediately following the lengthened sound (or the letter in the word that most closely represents that sound).
For example, if the subject says: "nnnnnno", drawing the "n" out, the transcription would be: "n:o"
Speech Style
Different speaking styles are not annotated in any way.
Unintelligible speech
Unintelligible speech is marked using square bracket notation like that used for extraneous events, as described below.
The marker: [unintelligible] is entered in the transcription in place of whatever speech was present, regardless of the length of the unintelligible segment.
Each of these is defined in detail below:
Note that laughter from the subject is covered by the marker [mouth_noise], as described below.
[line_noise] may be used to mark the popping noises that result from dropped packets in transmission of digitized speech over phone lines. It is also used to mark files that have noticeable static.
The [line_noise] marker is to be used sparingly, and only when the noise is clearly due to telephone lines. If there is a doubt as to the noise source, the [bg_noise] marker should be used instead. Normal levels of telephone noise and low levels of static that would probably not be noticed by a telephone user are not marked.
Events that do not conflict or co-occur with speech should be marked by placing the appropriate descriptor in the transcription at the place where the event occurs.
For example, the following transcription would indicate that there was some sort of mouth noise preceding the speech:
"[mouth_noise] yes i think so"
By definition, [mouth_noise] and [uh] should NEVER co-occur with speech. They should simply be inserted at the place where they occur in the utterance.
Events that co-occur with speech
[bg_noise], [bg_speech], and [line_noise] may co-occur with speech. They may occur as single events that coincide with the subject's production of a single word, or they may span several words or even the entire utterance.
Events that co-occur with one word
Extraneous events that co-occur with a single word should be transcribed by placing the appropriate marker either immediately before or immediately after the word in question, and inserting an arrow "<" or ">" inside the square brackets and pointing toward the word.
For example, if the subject was reading the sentence
"put the table outside"
and a dog barked right in the middle of the word "table"
either of the following would be a correct transcription:
"put the [bg_noise>] table outside"
"put the table [<bg_noise] outside"
Events that co-occur with two or more words
Extraneous events that span multiple words, such as TV noise, or a dog that is barking steadily throughout an utterance, or a conversation that is taking place in the background, should be marked by placing the appropriate descriptor both prior to the first word and after the last word during which the event is heard. A "/" should be inserted inside the square brackets to indicate onset and offset of the event.
For example, in the sentence shown above, if the dog started barking during the word table and kept barking repeatedly throughout the rest of the utterance, the transcription would be:
"put the [bg_noise/] table outside [/bg_noise]"
As another example, if there is telephone line static throughout the entire utterance, the descriptors [line_noise/] and [/line_noise] would be placed at either end of the utterance. Similarly, if there is TV noise throughout, the transcription would be surrounded by the descriptors [bg_noise/] and [/bg_noise].