File: csr-dot-spec.doc.930105 CSR WSJ0 Detailed Orthographic Transcription (.dot) Specification CCCC Transcription Subcommittee John Garofolo, Doug Paul, and Mike Phillips with help from Jon Fiscus and Bill Fisher 12/12/91 Revised 01/05/93 by John Garofolo to relax rules requiring prosodic markings and capitalization per the CCCC conference call 11/24/92. Specification for CSR transcription conventions using extended SRO notation: The following specification is written in an .sro-conformance approach but adds notations for the following: - Inclusion of non-alpha-numeric characters in lexical items. - Rules for generating proper lexical forms - Descriptors for additional non-speech events - Format for transcribing co-occurrence of speech and non-speech phenomena - Format for bracketing phenomena across lexical items - Descriptors and format for transcribing speech style changes - Inclusion of within-transcription utterance ID * Please note that the SRO additions may not be compliant with additions being developed simultaneously by MADCOW. The Detailed Orthographic Transcription (.dot) file will contain a case-sensitive transcription consisting of markings for an utterance's orthography, some prosodics and disfluencies, and non-speech events. 1. Orthography: The lexical tokens in the transcription will be generated without special regard to case and capitalization. Appropriate capitalization is encouraged but not required. Grammatical (non verbalized) punctuation will be excluded except for periods (.) used specifically in abbreviations and apostrophes. Non-alpha-numeric characters which are part of a lexical item will be prefaced by the escape character, "\". 1.1 Read Speech In the case of read speech, normal lexical items will be represented as they are in the truth text which corresponds to the prompt used to elicit the speech. 1.2 Spontaneous Speech In the case of spontaneous speech, the following rules will be used in transcribing lexical items: - Verbalized punctuation: - punctuation marks represented by: ,COMMA .PERIOD "DOUBLE-QUOTE -HYPHEN .POINT %PERCENT --DASH &ERSAND :COLON )RIGHT-PAREN (LEFT-PAREN ;SEMI-COLON ?QUESTION-MARK 'SINGLE-QUOTE ...ELLIPSIS /SLASH }RIGHT-BRACE {LEFT-BRACE !EXCLAMATION-POINT +PLUS =EQUALS #SHARP-SIGN -MINUS - Non-verbalized punctuation: Transcribe what the speaker said. The following notations were used in the read texts: / -> slash eg. and/or -> and slash or % -> percent & -> and eg. AT&T -> A. T. and T. . (decimal point) -> point - Letters: Normal (append a .): eg. IBM -> I. B. M. Plural (append .s): eg. IBMs -> I. B. M.s Possessive (append .'s): eg IBM's -> I. B. M.'s - Acronyms: - if pronnounced as letters, spell out eg. IBM -> I. B. M., USAir -> U. S. Air - if pronnounced as a word, leave it as a word eg. DARPA, NASDAQ - Numbers (incl Roman numerals): write out orthographic representation of what was said eg. 1935 -> ninteen thirty five $123 -> one hundred twenty three dollars - All abbreviations spelled out EXCEPT FOR: Mr., Mrs., Ms., and Messrs. (There are NO English equivalents for Mrs. and Messrs.) - Hyphenated words--none in transcription (except for verbalized punctuation) - remove hyphen (if normal usage) or can be expanded as 2 words (nonverbalized punct) or 3 words (verbalized punct) - Check file wfl-64 to see the word occurs without the hyphen. Otherwise break into separate words. eg: - compound in wfl-64: NON-STOP -> NONSTOP - compound not in wfl-64 - non-verbalized punctuation: hard-headed -> hard headed - verbalized punctuation hard-headed -> hard -HYPHEN headed 2. Disfluencies: 2.1 Mispronunciations Obviously mispronounced but intelligible words should be delimited with a "*". When in doubt, if possible, the subject should be allowed to decide him/herself if he/she mispronounced a word. This construct should be used sparingly. i.e. If the prompt read, "He grew up in Belair." and the subject said, "He grew up in Blair." then the utterance should be transcribed: he grew up in *belair* 2.2 Verbal Deletions Words which are verbally deleted - replaced with other words by the subject later in the utterance - are to be enclosed in angle brackets, "<>": i.e. The plane dropped precipitously into the boiling ocean below 2.3 False Starts and Spoken Word Fragments Incompletely spoken words will be transcribed using the following notation: - Beginning of word truncation (missing fragment known) -(missing_fragment)spoken_fragment - Beginning of word truncation (missing fragment unknown) -spoken_fragment - End of word truncation (missing fragment known) spoken_fragment(missing_fragment)- - End of word truncation (missing fragment unknown) spoken_fragment- 3. Prosodic Markings 3.1 Pauses Only conspicuous pauses are to be marked with a single "." indicating the location of of the pause. 3.2 Emphatic Stress Emphatic stress is indicated by prepending a "!" to the word or syllable which was stressed. This only includes stress which would not normally occur due to lexical and syntactic factors. 3.3 Lengthening Lengthening is transcribed by appending a ":" to the lengthened sound. This only includes lengthening which would not normally occur due to lexical and syntactic factors. 4. Descriptive Markings of Speech and Non-Speech Events 4.1 Non-speech Events Non-speech events will be indicated by a descriptor enclosed in square brackets. The descriptor is to contain only alphabetic characters and underscores and, if possible, should be drawn from the following list: ah chair_squeak cough cross_talk door_slam er grunt laughter lip_smack loud_breath mm paper_rustle phone_ring sigh throat_clear tongue_click uh um unintelligible i.e. The doctor said \"double-quote [throat_clear] open wide \"double-quote 4.2 Descriptor Placement and Concurrent Events A descriptor is to be placed in the orthography at the point at which it occurs. If a non-speech event overlaps with a spoken lexical item, the descriptor should be placed next to the lexical item it co-occured with and the character, ">" or "<" should be appended or prepended to the descriptor depending on whether it is placed to the left or right of the co-occurring lexical item. i.e. the escaped convict [] convict ran for his life are roughly equivalent If a phenomenon is noted throughout, or co-occurs with, more than one lexical item, then the phenomenon's descriptor is be used in the following notation to bound the lexical items it spans: [descriptor/] word word ... word [/descriptor] The "/" appended to the start descriptor and prepended to the end descriptor indicates that the phenomena spans the bracketed lexical items i.e. [cross_talk/] The plane narrowly escaped disaster [/cross_talk] as it took off 4.3 Speech Style A marked change in speaking style should be transcribed using a the same notation as in Section 4.2 and the following descriptors: loud soft whisper 4.4 Bad Recording If the recording quality of an utterance is so bad that it defies transcription, then the flag, "[bad_recording]", can be substituted for the transcription in the .dot file and the utterance will be viewed as unusable. i.e. [bad_recording] (500c302b) Please note that this convention should be used very sparingly. 5. Waveform Truncation If a waveform file is truncated due to a recording error by the system or by the failure of the subject to press/depress the push-to-talk button at the proper times, the following notation in the corresponding transcription file is to be used: - Beginning of utterance truncation: ~ transcription - End of utterance truncation: transcription ~ - Beginning and end of utterance truncation: ~ transcription ~ - *Null waveform ~~ *In the CSR corpus, null waveforms should probably be discarded. So these would not exist in the distributed data. 6. Utterance Identification The 8-character utterance ID from the filename (minus extension) is to be placed at the end of each transcription string in parentheses immediately followed by a new-line character. The parenthesized utterance ID is to be separated from the transcription string by one space character. text text text (utterance-ID) i.e. Los Angeles based Government Funding is used to picking up where banks leave off (400c2001) -------------------------------------------------------------------------- Deferred proposal for CSR transcription conventions using extended SGML notation: Adopt-SGML: The current MADCOW .sro specs are incompatible with the WSJ data. Furthermore, since it uses normal puctuation marks in abnormal ways, there are also likely to be conflicts with any new tasks (such as the planned adjunct tasks for the full database). A simple SGML (Standardized Generalized Markup Language) notation should be used to delimit the markers for non-lexical phenomena in the CSR corpus. In addition, the SRO specs are currently being changed by MADCOW and would likely be incompatible with a CSR SRO implementation eventually anyway. The original WSJ data is marked with SGML: begin sentence: , end sentence: , begin paragraph:

, end paragraph:

, etc. (The paragraph and sentence id's in the processed texts are modified begin marks.) The following constructs are the basis for the SGML-based transcription approach: (function operates on these words) (function operates on these words using arg) <*ns_function> (function does not have a corresponding end mark) <*ns_function arg> (function does not have a corresponding end mark) (some group of SGML operators) SGML will also make parsing of the .dot files easier--anything outside of the SGML marks is a SNOR word. Anything inside of the SGML marks has a special meaning. A misparsed mark in the MADCOW .sro format becomes a new word to the recognizer. SGML is a robust standardized document preparation language and is well suited for our purposes. If the function names and descriptors are thought to be unwieldy, we can either come up with shorter names are create editor macros for the transcribers. In the SGML version, SGML statements are always enclosed in "<>" and can easily be separated from the lexical items. The character, "<", must be escaped using a "\" if it is part of a lexical item. 2. Disfluencies: 2.1 Mispronunciations If the prompt read, "He grew up in Belair." and the subject said, "He grew up in Blair." SGML: he grew up in belair 2.2 Verbal Deletions SGML: The plane dropped quickly uh precipitously into the boiling ocean below 2.3 False Starts and Spoken Word Fragments - Beginning of word truncation (missing fragment known) SGML: <*ns_missing_fragment missing-fragment> <*ns_spoken_fragment spoken_fragment> - Beginning of word truncation (missing fragment unknown) SGML: <*ns_missing_fragment> <*ns_spoken_fragment spoken_fragment> - End of word truncation (missing fragment known) SGML: <*ns_spoken_fragment spoken-fragment> <*ns_missing_fragment missing_fragment> - End of word truncation (missing fragment unknown) SGML: <*ns_spoken_fragment spoken-fragment> <*ns_missing_fragment> *Note: Word fragments are always grouped using the "paren" construct. Grouped fragments will be considered to constititute a complete word unless a "missing_fragment" contains a null argument. 3. Prosodic Markings 3.1 Pauses SGML: text <*ns_insert pause> text 3.2 Emphatic Stress SGML: text The text can also be part of a word: aspiration 3.3 Lengthening SGML: text The text can also be part of a word: aspiration 4. Descriptive Markings of Speech and Non-Speech Events 4.1 Non-speech Events SGML: The doctor said "double-quote <*ns_insert throat_clear> open wide "double-quote 4.2 Descriptor Placement and Concurrent Events SGML: word word ... word SGML: the escaped convict ran for his life SGML: The plane narrowly escaped disaster as it took off 4.3 Speech Style SGML: The town cryer screamed seven o'clock and all's well 5. Waveform Truncation SGML: the following mark need only be placed at the site of the truncation SGML: <*ns_waveform_truncation> - *Null waveform transcription would only contain the SGML segment 6. Utterance Identification text text text <*utterance_id utterance-ID> i.e. Los Angeles based Government Funding is used to picking up where banks leave off <*utterance_id 400c2001> -----