CCCC Transcription Subcommittee John Garofolo, Doug Paul, and Mike Phillips with help from Jon Fiscus and Bill Fisher
12/12/91
Revised 01/05/93 by John Garofolo to relax rules requiring prosodic markings and capitalization per the CCCC conference call 11/24/92.
Specification for CSR transcription conventions using extended SRO notation:
The following specification is written in an .sro-conformance approach but adds notations for the following:
The Detailed Orthographic Transcription (.dot) file will contain a case-sensitive transcription consisting of markings for an utterance's orthography, some prosodics and disfluencies, and non-speech events.
1. Orthography:
The lexical tokens in the transcription will be generated without special regard to case and capitalization. Appropriate capitalization is encouraged but not required. Grammatical (non verbalized) punctuation will be excluded except for periods (.) used specifically in abbreviations and apostrophes. Non-alpha-numeric characters which are part of a lexical item will be prefaced by the escape character, "\".
1.1 Read Speech
In the case of read speech, normal lexical items will be represented as they are in the truth text which corresponds to the prompt used to elicit the speech.
1.2 Spontaneous Speech
In the case of spontaneous speech, the following rules will be used in transcribing lexical items:
- punctuation marks represented by: ,COMMA .PERIOD "DOUBLE-QUOTE -HYPHEN .POINT %PERCENT --DASH &ERSAND :COLON )RIGHT-PAREN (LEFT-PAREN ;SEMI-COLON ?QUESTION-MARK 'SINGLE-QUOTE ...ELLIPSIS /SLASH }RIGHT-BRACE {LEFT-BRACE !EXCLAMATION-POINT +PLUS =EQUALS #SHARP-SIGN -MINUS
/ -> slash eg. and/or -> and slash or % -> percent & -> and eg. AT&T -> A. T. and T. . (decimal point) -> point
Normal (append a .): eg. IBM -> I. B. M. Plural (append .s): eg. IBMs -> I. B. M.s Possessive (append .'s): eg IBM's -> I. B. M.'s
- if pronnounced as letters, spell out eg. IBM -> I. B. M., USAir -> U. S. Air - if pronnounced as a word, leave it as a word eg. DARPA, NASDAQ
eg. 1935 -> ninteen thirty five $123 -> one hundred twenty three dollars
Mr., Mrs., Ms., and Messrs. (There are NO English equivalents for Mrs. and Messrs.)
2.1 Mispronunciations
Obviously mispronounced but intelligible words should be delimited with a "*". When in doubt, if possible, the subject should be allowed to decide him/herself if he/she mispronounced a word. This construct should be used sparingly.
i.e.
If the prompt read, "He grew up in Belair." and the subject said, "He grew up in Blair." then the utterance should be transcribed: he grew up in *belair*2.2 Verbal Deletions
Words which are verbally deleted - replaced with other words by the subject later in the utterance - are to be enclosed in angle brackets, "<>":
i.e.
The plane dropped <quickly> <uh> precipitously into the boiling ocean below2.3 False Starts and Spoken Word Fragments
Incompletely spoken words will be transcribed using the following notation:
-(missing_fragment)spoken_fragment
-spoken_fragment
spoken_fragment(missing_fragment)-
spoken_fragment-
3.1 Pauses
Only conspicuous pauses are to be marked with a single "." indicating the location of of the pause.
3.2 Emphatic Stress
Emphatic stress is indicated by prepending a "!" to the word or syllable which was stressed. This only includes stress which would not normally occur due to lexical and syntactic factors.
3.3 Lengthening
Lengthening is transcribed by appending a ":" to the lengthened sound. This only includes lengthening which would not normally occur due to lexical and syntactic factors.
4.1 Non-speech Events Non-speech events will be indicated by a descriptor enclosed in square brackets. The descriptor is to contain only alphabetic characters and underscores and, if possible, should be drawn from the following list:
ah chair_squeak cough cross_talk door_slam er grunt laughter lip_smack loud_breath mm paper_rustle phone_ring sigh throat_clear tongue_click uh um unintelligiblei.e.
The doctor said \"double-quote [throat_clear] open wide \"double-quote4.2 Descriptor Placement and Concurrent Events
A descriptor is to be placed in the orthography at the point at which it occurs. If a non-speech event overlaps with a spoken lexical item, the descriptor should be placed next to the lexical item it co-occured with and the character, ">" or "<" should be appended or prepended to the descriptor depending on whether it is placed to the left or right of the co-occurring lexical item.
i.e.
the escaped convict [<door_slam] ran for his lifeand
the escaped [door_slam>] convict ran for his lifeare roughly equivalent
If a phenomenon is noted throughout, or co-occurs with, more than one lexical item, then the phenomenon's descriptor is be used in the following notation to bound the lexical items it spans:
[descriptor/] word word ... word [/descriptor]
The "/" appended to the start descriptor and prepended to the end descriptor indicates that the phenomena spans the bracketed lexical items
i.e.
[cross_talk/] The plane narrowly escaped disaster [/cross_talk] as it took off4.3 Speech Style A marked change in speaking style should be transcribed using a the same notation as in Section 4.2 and the following descriptors:
loud soft whisper4.4 Bad Recording
If the recording quality of an utterance is so bad that it defies transcription, then the flag, "[bad_recording]", can be substituted for the transcription in the .dot file and the utterance will be viewed as unusable.
i.e.
[bad_recording] (500c302b)
Please note that this convention should be used very sparingly.
If a waveform file is truncated due to a recording error by the system or by the failure of the subject to press/depress the push-to-talk button at the proper times, the following notation in the corresponding transcription file is to be used:
~ transcription
transcription ~
~ transcription ~
~~
The 8-character utterance ID from the filename (minus extension) is to be placed at the end of each transcription string in parentheses immediately followed by a new-line character. The parenthesized utterance ID is to be separated from the transcription string by one space character.
text text text (utterance-ID)<new-line>
i.e.
Los Angeles based Government Funding is used to picking up where banks leave off (400c2001)