Design Specifications for the Transcription of Spoken Language


This document describes a convention for formatting and structuring transcripts of spoken language. The convention is intended to apply equally well to all forms of spoken language, and to support varying types and amounts of detail in transcription. It is also designed to support an efficient user interface for the creation of transcript data, as well as clear, stable and unambiguous interpretation by users of transcripts.

In general, it is expected that both creators and users of transcripts will make use of digital recordings of the speech in conjunction with the transcripts, so an important feature of the convention is to provide consistent and reliable references to time offsets within the recordings. But for applications in which accurate reference to the acoustic signal is not required, the time-offset information can readily be ignored or excluded, without any impact on the presentation of relevant data (i.e. the text of utterances and any structural or featural information associated with the utterances).

The convention is based on a simple skeletal structure of SGML markup, allowing extensibility to support varying levels of detail, as well as parsibility to assure both formal adherence to an established specification and overall coherence within each transcript document. The formal specification is simply the SGML Document Type Definition (DTD), which dictates the character encoding to be used, the inventory of units and structures that make up the transcript document, and the organization of these units and structures within the document.

This design specification is organized into two sections: General Properties, and Properties of transcribed speech. The former describes the elements of the design that apply equally to all transcripts, while the latter provides details that depend on such issues as character encoding, orthographic practice, definitions of "word", etc, which differ from one language to another.

General Properties

All transcripts are derived from recordings of speech, so the boundaries of a transcript are, at maximum, the boundaries of a particular recording. For applications that involve use of the acoustic signal with the transcript, identification of the acoustic data (i.e. a file name or other reference) is provided. The transcript may exclude one or more portions of the recording, and the excluded portion(s) may or may not have an explicit representation in the transcript, depending on the needs of particular applications where the transcript is to be used.

All transcripts are based on a fundamental unit of recorded speech behavior, which will be refered to as a "speaker turn", or simply "Turn". The Turn may range in length from a brief conversational interjection to an extended speech or lecture. Whatever its length, its content represents a single act of communication by one individual, bounded either by the limits of the transcript or by other Turns within the transcript.

In transcripts where two or more speakers are present, Turns are labeled to identify each speaker uniquely. The nature of the labeling can range from arbitrary and generic ("A", "B", etc) to featural ("M1", "F1", etc) to specific ("BClinton", "PJennings", etc). In general, the speaker labels applied to Turns should reflect relevant information about the speaker, such as gender; in the case of multichannel recordings with speakers divided by channel, the channel identification is reflected in the speaker labels as well.

For some applications, sequences of Turns may be grouped into larger structural units, such as transactions or stories. In addition, some applications may require that Turns be subdivided according to various events or conditions that occur within Turns. These varying needs can be met by adding levels of structure to the SGML markup above or below the level of the Turn. In all cases, the Turn remains the fundamental unit whose basic representation and function are consistent across all transcripts.

All time offsets in a transcript are given a single, uniform representation, called a "Breakpoint" tag. In order for the correlation of transcripts and recordings to be consistent and reliable across all uses of this specification, each Turn must begin and end with a Breakpoint tag to define the temporal extent of the Turn in the recording. The Turn can contain additional Breakpoint tags, to associate time offsets with other notations within the Turn, or simply to split a long Turn into chunks that are more convenient for auditing, transcribing or processing.

If portions that are excluded from the transcription process are to be listed explicitly as such in the transcript, these portions must be identified by some alternative SGML tagged unit, whose contents are simply the two Breakpoint tags that define its temporal extent.

In cases where two or more speakers are recorded in conversation, it is expected that their respective Turns will occasionally overlap in time. Since each Turn contains an initial and final Breakpoint time offset, this overlap already has a direct (but minimal) representation without further notation. In applications where it is important to identify overlapping speech in more careful detail (e.g. where two voices overlap on a single audio channel), the portions of each Turn affected by the overlap can be bounded by Overlap tags, and these can be bound to additional Breakpoint tags within each Turn where necessary.

In order to clarify this set conventions, a few different transcription scenarios will be considered in detail below. In each scenario, there is a brief summary of requirements imposed by the nature of the recording and the needs of an application to be served by the transcript. Following the summary, there is a table that summarizes the transcript specifications in terms of SGML units; these would be implemented by rendering the contents of the table into a corresponding DTD. Following the table, there is an explanation of the table notations, and an example of the resulting transcript format, together with comments about the format.

Scenario 1: Two-channel telephone conversation

Summary of requirements

Each recording represents (part of) a telephone conversation; the microphone signal from each end of the telephone connection is recorded on a separate channel. It is possible that more than one speaker may be heard on a given channel, either because one telephone handset is shared among two or more people, or because two or more handsets are being used simultaneously at one end of the connection. In the latter case, voices may overlap on one channel, and regions of such overlap should be marked in terms of both textual and temporal extent. (It may be that the textual content of overlap regions does not need to be transcribed for one or all of the speakers involved, but Breakpoint and Overlap tags are needed to indicate where the overlap occurs.) When speech on one channel overlaps with speech on the other, no further notation is needed beyond the Breakpoints showing the bounds of the Turns. Speaker labels indicate channel and distinguish among people on the same channel.

Table of Hub5 specifications


SGML Unit Attributes Contents
Call CallID Turn=
Turn SpkrID=[AB][1-9]* Time,((OV | #TRANSDATA?),Time)+
Time sec= (empty)

Explanation of Hub5 table

The second and third columns of the table make use of a type of regular expression notation that can be mapped directly to the syntax of the SGML DTD. The first row defines a global unit for the transcript, the "Call"; it has a "CallID" attribute to identify the corresponding acoustic data file, and its contents comprise a set of one or more Turns. The second row indicates that each Turn has a "SpkrID" attribute, whose value indicates the channel the turn is on (A or B), and if necessary, a digit to uniquely identify different speakers on the same channel. Each Turn starts with a Breakpoint, followed by one or more pairings of Overlap units or transcription text and subsequent Breakpoint. The Overlap unit simply serves to delimit those portions of time and transcription text (#TRANSDATA) that are involved in overlapping speech on the same channel; as formulated above, an OV unit must always be bounded by Time tags, though it may be empty of text content.

The definition of what constitutes the text content of a transciption (i.e. #TRANSDATA) is language dependent. It will be discussed in greater detail under Properties of transcribed speech.

Sample Hub5 transcripts:

Comments on the sample transcripts

Long pauses within a Turn (1 sec or more) have been indicated by means of two consecutive Time tags without intervening text; a single Time tag has been inserted within the text of a Turn where there is a shorter pause (0.5 sec or more). If this level of accuracy in delimiting the speech regions of the signal is not be needed, it may still be useful for the transcriber to have Time tags inserted at convenient points within a long Turn (e.g. at intervals of 8 sec if a Turn is much longer than that), so that there are manageable segments defined for auditing and repeated playback.

Words preceded by an initial "/" are to be classified as proper nouns; this is a language-dependent convention for token classification, to be discussed below. No punctuation was used by the transcriber in this case, except for a question mark where appropriate. (This sample was adapted from an excerpt produced in accordance with previous callhome conventions.) It would be possible to define other forms of punctuation in the language-specific portion of the DTD.

Scenario 2: Single-channel broadcast news

Summary of requirements

In recordings of news broadcasts, there are numerous speakers and numerous recording conditions represented. The Turn remains a fundamental unit, but sets of Turns are grouped together into larger topical units of two basic types: "reports" and "filler" (i.e. news stories proper, and verbal exchanges that serve as transitions between stories). Also, it may be necessary for a Turn to be subdivided according to the occurrence of selected background conditions, such as music being played while someone is speaking. Some portions of a broadcast are to be excluded from the transcript (e.g. commercials), and their position and temporal extent are to be noted. Speakers should be identified by name where possible, or by gender and a unique index where the name is not known. We expect some overlapping speech, and both the textual and temporal extent should be noted.

Table of Hub4 specifications

SGML Unit Attributes Contents
Episode Filename, Program, Language Section+
Section Type, startTime, endTime (Turn*|Comment?)
Turn Speaker, Sex, startTime, endTime (Time?,(#PCDATA|Foreign|Unclear|Overlap))+
Overlap startTime, endTime (Time?,(#PCDATA|Foreign|Unclear))+
Foreign Language (#PCDATA|Unclear)+
Unclear (none) (#PCDATA)*
Time sec EMPTY

Explanation of Hub4 table

The "Section" unit will comprise a topically uniform portion of the broadcast; when concatenated together, the union of Sections should represent the entire broadcast that is called "Episode". The attribute value of "Type=nontrans" on a Section indicates that it contains no transcription i.e. no Turns, but it might contain optional Comment tag with some commentary. Within a Turn, there can be any number of Time tags, followed by either character data(#PCDATA) or Foreign text or Unclear text or Overlap. Regions within a Turn that overlap with speech from another Turn are marked off using the Overlap tag. Note that Overlap tags, same as Section and Turn, are bounded by startTime and endTime. Optional Time tags that are inserted in the Turn or Overlap to help to keep track of the time flow in the transcript file.

Sample Hub4 transcripts:

Properties of transcribed speech

There are three basic elements that make up the language-specific (transcription content) specification of a transcript:

  1. a definition of lexical tokens
  2. a definition of token separators (white space and punctuation)
  3. a definition of token classes

These definitions establish the possible character patterns that make up the transcription text proper (i.e. the #PCDATA portions of the DTD specification): which characters make up lexical tokens (as well as non-lexical verbalizations, or "non-lexemes", e.g. "um"), and which characters divide tokens. Among the tokens, it will generally be useful to define particular subsets having peculiar qualities, such as proper nouns, non-lexemes, alphabetic strings ("FBI"), etc. A special set of characters, not otherwise used in rendering the tokens, is applied to identify members of each class.

The list (and explanation) of special these characters is found in the document:

LDC Transcription Conventions