CSR WSJ0 Detailed Orthographic Transcription (.dot) Specification

CCCC Transcription Subcommittee John Garofolo, Doug Paul, and Mike Phillips with help from Jon Fiscus and Bill Fisher

12/12/91

Revised 01/05/93 by John Garofolo to relax rules requiring prosodic markings and capitalization per the CCCC conference call 11/24/92.

Specification for CSR transcription conventions using extended SRO notation:

The following specification is written in an .sro-conformance approach but adds notations for the following:

* Please note that the SRO additions may not be compliant with additions being developed simultaneously by MADCOW.

The Detailed Orthographic Transcription (.dot) file will contain a case-sensitive transcription consisting of markings for an utterance's orthography, some prosodics and disfluencies, and non-speech events.

1. Orthography:

The lexical tokens in the transcription will be generated without special regard to case and capitalization. Appropriate capitalization is encouraged but not required. Grammatical (non verbalized) punctuation will be excluded except for periods (.) used specifically in abbreviations and apostrophes. Non-alpha-numeric characters which are part of a lexical item will be prefaced by the escape character, "\".

1.1 Read Speech

In the case of read speech, normal lexical items will be represented as they are in the truth text which corresponds to the prompt used to elicit the speech.

1.2 Spontaneous Speech

In the case of spontaneous speech, the following rules will be used in transcribing lexical items:

2. Disfluencies:

2.1 Mispronunciations

Obviously mispronounced but intelligible words should be delimited with a "*". When in doubt, if possible, the subject should be allowed to decide him/herself if he/she mispronounced a word. This construct should be used sparingly.

i.e.

If the prompt read, "He grew up in Belair." and the subject said, "He grew up in Blair." then the utterance should be transcribed: he grew up in *belair*
2.2 Verbal Deletions

Words which are verbally deleted - replaced with other words by the subject later in the utterance - are to be enclosed in angle brackets, "<>":

i.e.

The plane dropped <quickly> <uh> precipitously into the boiling ocean below
2.3 False Starts and Spoken Word Fragments

Incompletely spoken words will be transcribed using the following notation:

3. Prosodic Markings

3.1 Pauses

Only conspicuous pauses are to be marked with a single "." indicating the location of of the pause.

3.2 Emphatic Stress

Emphatic stress is indicated by prepending a "!" to the word or syllable which was stressed. This only includes stress which would not normally occur due to lexical and syntactic factors.

3.3 Lengthening

Lengthening is transcribed by appending a ":" to the lengthened sound. This only includes lengthening which would not normally occur due to lexical and syntactic factors.

4. Descriptive Markings of Speech and Non-Speech Events

4.1 Non-speech Events Non-speech events will be indicated by a descriptor enclosed in square brackets. The descriptor is to contain only alphabetic characters and underscores and, if possible, should be drawn from the following list:

ah
chair_squeak
cough
cross_talk
door_slam
er
grunt
laughter
lip_smack
loud_breath
mm
paper_rustle
phone_ring
sigh
throat_clear
tongue_click
uh
um
unintelligible
i.e.
The doctor said \"double-quote [throat_clear] open wide \"double-quote
4.2 Descriptor Placement and Concurrent Events

A descriptor is to be placed in the orthography at the point at which it occurs. If a non-speech event overlaps with a spoken lexical item, the descriptor should be placed next to the lexical item it co-occured with and the character, ">" or "<" should be appended or prepended to the descriptor depending on whether it is placed to the left or right of the co-occurring lexical item.

i.e.

the escaped convict [<door_slam] ran for his life
and
the escaped [door_slam>] convict ran for his life
are roughly equivalent

If a phenomenon is noted throughout, or co-occurs with, more than one lexical item, then the phenomenon's descriptor is be used in the following notation to bound the lexical items it spans:

[descriptor/] word word ... word [/descriptor]

The "/" appended to the start descriptor and prepended to the end descriptor indicates that the phenomena spans the bracketed lexical items

i.e.

[cross_talk/] The plane narrowly escaped disaster [/cross_talk] as it took off
4.3 Speech Style A marked change in speaking style should be transcribed using a the same notation as in Section 4.2 and the following descriptors:
loud
soft
whisper
4.4 Bad Recording

If the recording quality of an utterance is so bad that it defies transcription, then the flag, "[bad_recording]", can be substituted for the transcription in the .dot file and the utterance will be viewed as unusable.

i.e.

[bad_recording] (500c302b)

Please note that this convention should be used very sparingly.

5. Waveform Truncation

If a waveform file is truncated due to a recording error by the system or by the failure of the subject to press/depress the push-to-talk button at the proper times, the following notation in the corresponding transcription file is to be used:

*In the CSR corpus, null waveforms should probably be discarded. So these would not exist in the distributed data.

6. Utterance Identification

The 8-character utterance ID from the filename (minus extension) is to be placed at the end of each transcription string in parentheses immediately followed by a new-line character. The parenthesized utterance ID is to be separated from the transcription string by one space character.

text text text (utterance-ID)<new-line>

i.e.

Los Angeles based Government Funding is used to picking up where banks leave off (400c2001)