File sro_spec.doc, Revised 10/20/93.

ATIS SR Output (".sro") Transcription Conventions

The transcription is intended to be an orthographic, lexical transcription with a few details included that represent audible acoustic events (speech and nonspeech) present in the corresponding waveform files. The SRO transcriptions will be automatically mapped to lexical SNOR conventions for scoring of recognition systems. The extra marks contained in the SRO transcription aid in interpreting the text form of the utterance. The SRO transcription will be stored in the query's auxiliary file of type ".sro".

The transcriptions are intended to be a quick and broad transcription; transcribers should not have to agonize over decisions, but rather realize that their transcription is intended to be a rough guide that others may examine further for details. Transcriptions should be made in two passes: one pass in which words are transcribed, and a second in which the additional details (extraneous noises, and prosodic marks) are added. Many phenomena (silences, noises, "uh"s) are easy to miss unless specifically attended to. It is recommended that transcribers have some background in phonetics and in linguistics, or that their training and preparation for the transcription task cover some basics in acoustic phonetics and dialect and style variations.

1. Markings Required for Scoring.

1.1 Case

Transcriptions are case insensitive and all case information will be lost in the translation to the all uppercase SNOR conventions. Using all lower case for SRO conventions is recommended so that SRO files are immediately recognizable from SNOR and lexical SNOR files.

1.2 Spelling

Normal lexical items will be represented by their spellings in the normal way. NIST maintains a common lexicon of spellings of words used in the ATIS corpus. It is available via remote FTP to ssi.ncsl.nist.gov and should be consulted when in doubt on spellings of words. The file is located in the directory, "madcow/logs" and is named, "lexicon.doc.DATE", where DATE represents the latest date of update of this file.

Spellings which cannot be predicted from .sro conventions:

1.2.1 Number sequences

Number sequences (flight numbers, times, dates, aircraft types, dollar amounts, etc.) will be spelled out to reflect what was said ("flight six one three"; "seven thirty"; "august twenty first"; "seven forty seven"; "four hundred and ten dollars".)

Reminder: No hyphens will be used ("seven forty seven", not "seven forty-seven".)

Note: care should be taken to transcribe the digit "0" as "zero" or "oh", depending on what the speaker said.

1.2.2 Letter sequences

Letter sequences occur in acronyms and abbreviations ("d f w"; "a p slash eighty"; "p m"; "c o"; etc.) Letters should be in lower case, separated by a space. Note that the determiner "a" and the letter "a" in "t w a" are not distinguished in these conventions.

Previous conventions indicated an exception to the above rule for "washington dc" in which there was no space between the "d" and "c". This exception never made sense and has not been used consistently in practice. In all future transcriptions it should NOT be treated as an exception and should always be transcribed as "washington d c".

[NIST has changed all occurrences of "dc" to "d c" in the MADCOW data they have distributed, so the "dc" form has never been used in official MADCOW data. It may, however, exist in the ATIS0 data.]

The AM and PM of times (e.g., "five thirty p m") will be treated as examples of letter sequences, i.e., lower case and separated by a space, with no periods.

If a speaker pronounces an acronym or abbreviation as a word, for example "den" or "bos", then these should be spelled out as words, rather than as "d e n" and "b o s".

1.2.3 Contractions

When a standard orthographic form exists for a contraction, it can be used to represent a contracted pronounciation.

Some pairs of words are commonly run together or contracted, but standard orthography doesn't include a representation of the contraction. In these cases, the two or more component words should be transcribed separately. Some examples are:

       Contracted   Transcribed
         wanna       want to
         wanna       want a
         gonna       going to
         hafta       have to
         useta       used to
         oughta      ought to
         sonova      son of a
1.2.4 Compound words

A compound word that might be written without a hyphen, such as "dinnertime" or "dinner time", should be transcribed as separate words unless it occurs in a standard dictionary without spaces. The existence of such a compound word in a dictionary doesn't preclude the use of its component words as a syntactic phrase, though; in this example the two words are used both ways:

seven p m is after noon, but it is not afternoon

1.3 Hyphenation

Hyphens will not generally be used; if the items on either side of a potential hyphen are both words, a space will be used instead of a hyphen. If one or both of the items is NOT a lexical item, neither a space nor a hyphen will be used, e.g., "nonstop" should be used, NOT "non-stop" or "non stop"; "round trip" should be used and NOT "round-trip"; "one way" should be used and NOT "one-way" or "oneway"; "nonsmoking" should be used and NOT "non-smoking".

1.4 Punctuation

This transcription will not contain normal English punctuation and will consist of lowercase characters except for proper nouns and individual letters. Conventional punctuation, including commas, periods, and question marks, will not be used. Periods will be used to indicate silent pauses (see 2.2) within an utterance, and should only occur following a space. Commas are used to indicate intonational separation; exclamation points are used to indicate emphatic stress.

Periods, question marks or exclamation points should NOT be used to indicate the end of a sentence.

1.5 Mispronunciations

Obviously mispronounced words that are nevertheless intelligible will be marked with stars (e.g, *transportation* for ``transportetation''). Asterisks should not be used to indicate pronunciations of words that represent normal dialectal (e.g., "warshed" for "washed" or "cah" for "car" or stylistic variation (e.g., "bout" for "about" or "wanna" for "want a" or for "want to". If the speaker would not consider the pronunciation an error, the asterisk notation should not be used. Obviously, there may be some clear and some unclear cases; transcribers should use their best judgment. A background in phonetics is helpful for transcribers.

Similarly, glottalization at onset or offset of a vowel are not transcribed.

If some of the word is left off, resulting in a word fragment (q.v.), and that word fragment is also mispronounced, then both the fragment symbol (-) and the mispronunciation symbols (*) should be used, with the "*" outside the "-".

Example: "flights from *den(ver)-* denver to dallas"

1.6 Verbal Deletions

Words verbally deleted by the subject will be enclosed in angle brackets. Verbal deletion means words spoken by the user but which, in the opinion of the transcriber, are superseded by subsequent speech explicitly (e.g., "show <flights> <i> <mean> fares") or implicitly (e.g., "show me the <fares> flights to Boston".

Verbal deletions occur any time there is a repetition or restart. In repetitions, one or more words are repeated, and there may or may not be extra material inserted into the repetition, for example:

show me <the> <flights> the flights to boston
show me <the> <flights> the nonstop flights to boston
In restarts, words are not repeated, but the speaker changes direction, as in:
<show> <me> <the> how many flights go to boston
Note that EACH word in a verbal deletion should be enclosed in angle brackets.

1.7 Word Fragments

Word fragments, i.e. instances in which the speaker did not complete a word, will be marked with a hyphen. As much of the word as is audible will be transcribed, followed immediately by the hyphen:

please show fli- flights from dallas
Though these represent "verbal deletions" as described above, the hyphen occurring before (or after) a space is sufficient to cue this fact, and should not be enclosed in angle brackets, as this just adds work for the transcribers. That is, the above example should NOT be "please show <fli-> flights from dallas"

Fragments include cases in which only an initial consonant or vowel is heard:

please show f- flights from dallas
This may sometimes be a judgement call on the part of the transcriber. Within word hesitations may be transcribed as:
dall:as (indicating lengthening of the "l") (see section 2.4)
dal- [um] -las (indicating a within word interruption - rare)
dal- . -as (indicating a silence interrupting a word).
The transcription will specify the intended word if such is obvious to the transcribers and is NOT obvious from context (this is of course a judgement call on the part of the transcriber). The completion of the presumed intended word will be enclosed in parentheses, BEFORE the hyphen, as in:
please show flights <from> de(nver)- from dallas
If the word fragment is a mispronounced attempt at a word, then both the fragment symbol (-) and the mispronunciation symbols (*) should be used, with the "*" outside the "-".

Example: "flights from *den(ver)-* denver to dallas"

1.8 Non-Speech Acoustic Events

Acoustic events enclosed in square brackets can come from the following set:

-Filled Pause ([uh], [um], [er], [ah], [mm])
-Speaker other ([laughter], [cough], [grunt], [throat_clear], [mumbling], [unintelligible])
-Nonspeaker other ([phone], [paper_rustle], [door_slam])
Note that while the exact specification of the type of acoustic event is subjective, these events MUST be marked in the correct location in a transcribed utterance. It is often difficult to localize these events; transcribing the utterance first, and listening for these events in a second pass is the correct procedure.

Note that any term can be used inside the brackets, but there should be no spaces inside brackets; use an underscore to connect words.

Note that the the filled pauses represent acoustic events similar acoustically and phonetically to speech. If possible, try to limit these to the set on the list, so that those interested in these events can find them easily. If others occur, contact the MADCOW committee via your MADCOW representative.

For noise events that occur over a span of one or more words, the transcriber should:

These guidelines are compatible with those used for the DOT files associated with the Wall Street Journal task. That task specifies the following set of non-speech markers, which for compatibility, transcribers of .sro files are encouraged to use:
      [chair_squeak]
      [cough]
      [cross_talk]
      [door_slam]
      [grunt]
      [laughter]
      [lip_smack] (use ONLY if EXCEPTIONALLY loud!)
      [loud_breath] (do NOT mark audible but low-level breath noises)
      [paper_rustle]
      [phone_ring]
      [sigh] (only if the amplitude is comparable to surrounding speech)
      [throat_clear]
      [tongue_click] (use ONLY if EXCEPTIONALLY loud!)
      [unintelligible]
      [sniff]
      [tap]
      [noise]
(The following speech sounds are also transcribed in the CSR and in the .sro transcriptions)
      [er]
      [mm]
      [uh]
      [um]
(The following speech style markers are used in CSR, and considered optional in the .sro transcriptions).
      [loud]
      [soft]
      [whisper]
Note: Acoustic events such as inhalation, exhalation, tongue clicks, lip smacks, and breath noise will not be transcribed if they are low level and non-intrusive.

2. Markings Helpful for Interpretation.

These markings should be used when salient; transcribers should not assume that they are optional. However, the transcriber should not agonize over these decisions. If in doubt, leave it out. The transcription is basically at the lexical level, and should be done relatively quickly. The following are intended to be helpful markings that should be used when the phenomena are very clear.

2.1 Intonational Boundaries

A comma will be used to indicate an intonational separation. It is preceded and followed by a space.

i'd like to fly on delta , first class , july second
An intonational separation may be achieved by: Boundary tones at the ends of the transcribed complete utterance or before a significant silence, as indicated in 2.2 and when transcribed, are so often redundant that they need not be transcribed. The use of the comma is intended to disambiguate and to make more interpretable utterances that would otherwise be either difficult or ambiguous, e.g.,

2.2 Silent Pauses

Silent pauses will be marked with a period (``.''). The use of the period indicates a significant silence, i.e., one that is clearly noticeable by listening, and which is significantly longer than a silence associated with a stop consonant closure for the rate of speech used by the speaker. Example:

show me the . flights to boston
Previous SRO conventions dictated that "." be used for a one-second pause, ". ." for a two second pause, etc. This is no longer in effect: a "." may be used to indicate a significant duration of silence, without giving further information on its duration. This was hard for transcribers to do, was inconsistently applied, and is more appropriately done by automatic methods. Thus in the above example, the silence could be 400 ms or one minute, for example.

2.3 Emphatic Stress

An exclamation mark (``!'') before a word or syllable indicates emphatic stress. This includes stress beyond what might normally occur based on lexical and syntactic factors. This is used sparingly and subjectively. Note that the "!" only precedes a word. Example:

show me only !delta flights
2.4 Lengthening

Lengthening, typically vowel lengthening, will be indicated by a colon (``:'') placed immediately after the lengthened sound. This is used sparingly and subjectively. Note that ":" always follows some sound; if it occurs within a word, it is not followed by a space. Examples:

show me the: flights to boston
which flights ha:ve economy fares
Lengthenings before silences are so often observed that hearing them is difficult and would make the transcribers job much more difficult than it is intended to be. They therefore need not be marked before the end of the utterance or before a transcribed silence.

3. Truncated Waveforms

3.1 Marking of transcription

If a .wav file is truncated due to a recording error by the system or by the failure of the subject to press/depress the push-to-talk button at the proper times, the following notation in the corresponding .sro file is to be used:

* If the wizard responded to a totally truncated utterance with an error message, and this "empty" interchange is retained in the .log file then the .sro transcription should consist of a blank new-line and NOT a "~~". The utterance will then not be annotated as a "trunc-utt". The purpose of this is to distinguish those cases where dialogue coherence has been maintained, from those cases where the system may have gotten out of sync with what has been recorded in the .wav file.

However, the transcribers are typically not looking at the .log files, and hence do not know what the wizard did. Sites that still produce truncate utterances are strongly encouraged to correct the data collection mechanism to avoid this problem. In the meantime, transcribers at these sites may have to consult the .log files for resolution of some instances.

For cases in which the user pushed the button and then said nothing, the corresponding .sro file should be a blank line, with no indication of truncation.

4. Speech style.

Speech style is considered a level of detail that need not be included in the SRO transcriptions. However, those sites who want to include it should use the conventions for these markers that are described in the documentation for the .dot files for the Wall Street Journal task. (See section 1.8).

5. Autocompletion

Autocompletion files, in conjunction with gnuemacs tools can greatly increase the transcriber's efficiency. SRI does this via a file that can be maintained and updated by the transcriber, and can be obtained by requesting this software from SRI, via your MADCOW representative.