The transcriptions include the following information in the specified format:
- coughing, laughter, breath noise, inhaling, marked as <cough>, <laughter>, <breath>, <inhale>
- clicks and beeps in the message marked as <click>, <beep>
- other disfluencies marked as one of the following categories: <AH>,
<AHA>, <ERR>, <HMM>, <HUM>, <HA>, <HO>, <HUH>, <MMM>, <OH>, <UH>, <UM>
- if someone stammers and says 'thir-thirty', the corresponding transcription would be 'thir- thirty'
- punctuation is used in the transcription (unlike the Voicemail Corpus Part I); however, all punctuation marks (, ; . etc.) are preceded and followed by space
- scripts are cased (unlike the Voicemail Corpus Part I)
- when transcribing times, the convention is A.M. and P.M. instead of A M and
P M (this would be the only case where a punctuation mark is used)
- names in the message are preceded with a ! sign. This includes proper
names, names of companies, days of week, and months. For instance Apple Computer would be trancscribed as !Apple !Computer
- numbers spelled out; instead of using, for instance, 1997, would be transcribed as nineteen ninety-seven
- when transcribing spelled out letters, for example, if someone spells out
the letters in a word, for example I B M, is transcribed as !I.B.M., i.e. "." after every letter that is spelled out
- if a segment of acoustic is incomprehensible, we made up a spelling that
sounded like the acoustic, rather than mark them as mumble words. These
non-words are marked with a ? or @ sign at the start of the spelling.