File dot_spec.doc

CSR WSJ0 Detailed Orthographic Transcription (.dot) Specification

CCCC Transcription Subcommittee John Garofolo, Doug Paul, and Mike Phillips with help from Jon Fiscus and Bill Fisher

12/12/91

Revised 01/05/93 by John Garofolo to relax rules requiring prosodic markings and capitalization per the CCCC conference call 11/24/92.

Specification for CSR transcription conventions using extended SRO notation:

The following specification is written in an .sro-conformance approach but adds notations for the following:

Inclusion of non-alpha-numeric characters in lexical items.
Rules for generating proper lexical forms
Descriptors for additional non-speech events
Format for transcribing co-occurrence of speech and non-speech phenomena
Format for bracketing phenomena across lexical items
Descriptors and format for transcribing speech style changes
Inclusion of within-transcription utterance ID

* Please note that the SRO additions may not be compliant with additions being developed simultaneously by MADCOW.

The Detailed Orthographic Transcription (.dot) file will contain a case-sensitive transcription consisting of markings for an utterance's orthography, some prosodics and disfluencies, and non-speech events.

1. Orthography:

The lexical tokens in the transcription will be generated without special regard to case and capitalization. Appropriate capitalization is encouraged but not required. Grammatical (non verbalized) punctuation will be excluded except for periods (.) used specifically in abbreviations and apostrophes. Non-alpha-numeric characters which are part of a lexical item will be prefaced by the escape character, "\".

1.1 Read Speech

In the case of read speech, normal lexical items will be represented as they are in the truth text which corresponds to the prompt used to elicit the speech.

1.2 Spontaneous Speech

In the case of spontaneous speech, the following rules will be used in transcribing lexical items:

Verbalized punctuation:

		- punctuation marks represented by:
			,COMMA
			.PERIOD
			"DOUBLE-QUOTE
			-HYPHEN
			.POINT
			%PERCENT
			--DASH
			&ERSAND
			:COLON
			)RIGHT-PAREN
			(LEFT-PAREN
			;SEMI-COLON
			?QUESTION-MARK
			'SINGLE-QUOTE
			...ELLIPSIS
			/SLASH
			}RIGHT-BRACE
			{LEFT-BRACE
			!EXCLAMATION-POINT
			+PLUS
			=EQUALS
			#SHARP-SIGN
			-MINUS

Non-verbalized punctuation: Transcribe what the speaker said. The following notations were used in the read texts:
```
		/ -> slash	eg. and/or -> and slash or
		% -> percent
		& -> and	eg. AT&T -> A. T. and T.
		. (decimal point) -> point
```

Letters:

		Normal (append a .): eg. IBM -> I. B. M.
		Plural (append .s): eg. IBMs -> I. B. M.s
		Possessive (append .'s): eg IBM's -> I. B. M.'s

Acronyms:

		- if pronnounced as letters, spell out
			eg. IBM -> I. B. M., USAir -> U. S. Air
		- if pronnounced as a word, leave it as a word
			eg. DARPA, NASDAQ

Numbers (incl Roman numerals): write out orthographic representation of what was said
```
		eg. 1935 -> ninteen thirty five
		    $123 -> one hundred twenty three dollars
```

All abbreviations spelled out EXCEPT FOR:

		Mr., Mrs., Ms., and Messrs.
		(There are NO English equivalents for Mrs. and Messrs.)

Hyphenated words--none in transcription (except for verbalized punctuation)
- remove hyphen (if normal usage) or can be expanded as 2 words (nonverbalized punct) or 3 words (verbalized punct)
- Check file wfl-64 to see the word occurs without the hyphen. Otherwise break into separate words. eg:
- compound in wfl-64: NON-STOP -> NONSTOP
- compound not in wfl-64
- non-verbalized punctuation: hard-headed -> hard headed
- verbalized punctuation hard-headed -> hard -HYPHEN headed

2. Disfluencies:

2.1 Mispronunciations

Obviously mispronounced but intelligible words should be delimited with a "*". When in doubt, if possible, the subject should be allowed to decide him/herself if he/she mispronounced a word. This construct should be used sparingly.

i.e.

If the prompt read, "He grew up in Belair." and the subject said, "He grew up in Blair." then the utterance should be transcribed: he grew up in *belair*

2.2 Verbal Deletions

Words which are verbally deleted - replaced with other words by the subject later in the utterance - are to be enclosed in angle brackets, "<>":

i.e.

The plane dropped <quickly> <uh> precipitously into the boiling ocean below

2.3 False Starts and Spoken Word Fragments

Incompletely spoken words will be transcribed using the following notation:

Beginning of word truncation (missing fragment known)
-(missing_fragment)spoken_fragment
Beginning of word truncation (missing fragment unknown)
-spoken_fragment
End of word truncation (missing fragment known)
spoken_fragment(missing_fragment)-
End of word truncation (missing fragment unknown)
spoken_fragment-

3. Prosodic Markings

3.1 Pauses

Only conspicuous pauses are to be marked with a single "." indicating the location of of the pause.

3.2 Emphatic Stress

Emphatic stress is indicated by prepending a "!" to the word or syllable which was stressed. This only includes stress which would not normally occur due to lexical and syntactic factors.

3.3 Lengthening

Lengthening is transcribed by appending a ":" to the lengthened sound. This only includes lengthening which would not normally occur due to lexical and syntactic factors.

4. Descriptive Markings of Speech and Non-Speech Events

4.1 Non-speech Events Non-speech events will be indicated by a descriptor enclosed in square brackets. The descriptor is to contain only alphabetic characters and underscores and, if possible, should be drawn from the following list:

ah
chair_squeak
cough
cross_talk
door_slam
er
grunt
laughter
lip_smack
loud_breath
mm
paper_rustle
phone_ring
sigh
throat_clear
tongue_click
uh
um
unintelligible

i.e.

The doctor said \"double-quote [throat_clear] open wide \"double-quote

4.2 Descriptor Placement and Concurrent Events

A descriptor is to be placed in the orthography at the point at which it occurs. If a non-speech event overlaps with a spoken lexical item, the descriptor should be placed next to the lexical item it co-occured with and the character, ">" or "<" should be appended or prepended to the descriptor depending on whether it is placed to the left or right of the co-occurring lexical item.

i.e.

the escaped convict [<door_slam] ran for his life

and

the escaped [door_slam>] convict ran for his life

are roughly equivalent

If a phenomenon is noted throughout, or co-occurs with, more than one lexical item, then the phenomenon's descriptor is be used in the following notation to bound the lexical items it spans:

[descriptor/] word word ... word [/descriptor]

The "/" appended to the start descriptor and prepended to the end descriptor indicates that the phenomena spans the bracketed lexical items

i.e.

[cross_talk/] The plane narrowly escaped disaster [/cross_talk] as it took off

4.3 Speech Style A marked change in speaking style should be transcribed using a the same notation as in Section 4.2 and the following descriptors:

loud
soft
whisper

4.4 Bad Recording

If the recording quality of an utterance is so bad that it defies transcription, then the flag, "[bad_recording]", can be substituted for the transcription in the .dot file and the utterance will be viewed as unusable.

i.e.

[bad_recording] (500c302b)

Please note that this convention should be used very sparingly.

5. Waveform Truncation

If a waveform file is truncated due to a recording error by the system or by the failure of the subject to press/depress the push-to-talk button at the proper times, the following notation in the corresponding transcription file is to be used:

Beginning of utterance truncation:
~ transcription
End of utterance truncation:
transcription ~
Beginning and end of utterance truncation:
~ transcription ~
*Null waveform
~~

*In the CSR corpus, null waveforms should probably be discarded. So these would not exist in the distributed data.

6. Utterance Identification

The 8-character utterance ID from the filename (minus extension) is to be placed at the end of each transcription string in parentheses immediately followed by a new-line character. The parenthesized utterance ID is to be separated from the transcription string by one space character.

text text text (utterance-ID)<new-line>

i.e.

Los Angeles based Government Funding is used to picking up where banks leave off (400c2001)