Annotated DTD for ICSI Meeting Corpus Transcripts

The ".mrt" files contain both the transcript of the meeting and certain meta data. This file describes the format. It includes examples for most entries.

More information is available at

For questions or comments, email to

<!ELEMENT Meeting  (Preamble,Transcript)>
                   DateTimeStamp CDATA #REQUIRED
                   Version CDATA #IMPLIED
                   VersionDate CDATA #IMPLIED>


<Meeting Session="Bmr012" DateTimeStamp="2001-02-01-1500" Version="1" VersionDate="Dec 11 2002">

The Session tag specifies the location (B for the ICSI meeting room in Berkeley), the meeting type (mr for meeting recorder meeting), and a unique 3 digit number. For a complete description of naming conventions, please see naming.txt or naming.html in the doc directory.

The DateTimeStamp is the start time and date of the meeting. The format is yyyy-mm-dd-hhmm, where hh is a 24 hour time.

The Version and VersionDate indicate the version of the transcript.

<!ELEMENT Preamble (Notes|Participants|Channels)*>

The Preamble contains meta data about the participants, the channels, and free-form notes about the meeting.

<!ELEMENT Notes    (#PCDATA)>

The Notes contain free-form notes. Often, it will contain information about problems with the meeting (e.g. dead batteries), participant information (e.g. mn017 left early), and tasks in the meeting (e.g. simultaneous digits).

The names of participants have been replaced with their speaker IDs in the Notes section.

<!ELEMENT Participants (Participant)*>

Information about the Participants of the meeting. By "Participant", we simply mean any person who spoke during the meeting. The DB tag indicates the speaker database in which is stored more information about the participants.

<!ELEMENT Participant EMPTY>
<!ATTLIST Participant  Name ID  #REQUIRED
                       Channel IDREF #IMPLIED
                       Seat CDATA #IMPLIED>


<Participant Name="me018" Channel="chan0" Seat="9"/>
<Participant Name="mn005" Channel="chanB"/>
<Participant Name="mn014"/>

The Name is the unique user tag. See naming.txt or naming.html in the doc directory for details on the naming convention, and the speaker database (icsi1.spk) for additional information about each speaker.

The Channel is a unique tag indicating the primary near-field channel on which the participant can be heard (see below). If the participant did not use a near-field microphone, then the Channel tag will be absent.

The Seat indicates in which seat the participant started the meeting. The seats are numbered clockwise starting at the door (see seatingchart.txt in the doc directory). We only began recording seat location about half way through the collection process, so many meetings do not have Seat tags.

<!ELEMENT Channels (Channel)*>
                  Mic CDATA #IMPLIED
                  Transmission CDATA #IMPLIED
                  Gain CDATA #IMPLIED
                  AudioFile CDATA #IMPLIED>


<Channel Name="chan0" Mic="c1" Transmission="w1" Gain="9" AudioFile="chan0.sph"/>
<Channel Name="chanE" Mic="c2" Transmission="u1" Gain="9" AudioFile="chanE.sph"/>

The Channel contains information about the microphones. The Name is a unique identifier for the channel. The Mic is an identifier for the type of microphone (e.g. c1 is a Crown Headset Microphone, c2 is a Crown PZM desktop microphone). Transmission indicates how the microphone is connected to the recording system (e.g. w1 is a wireless Sony system, j1 is a wired jack). See naming.txt or naming.html for details.

The Gain is a software recording level setting. Note that some early meetings did not indicate the gain setting. In these cases, a default gain (which is probably incorrect) is listed in this tag, and mention is made in the Notes section.

Finally, the AudioFile contains the file name in which the data is stored. For the ICSI meeting corpus, the files are in NIST Sphere/audio format and have been compressed with the "shorten" program.

<!ELEMENT Transcript (Segment)*>
<!ATTLIST Transcript StartTime CDATA #IMPLIED
                     EndTime CDATA #IMPLIED>

<!ELEMENT Segment (#PCDATA |
                   Emphasis |
                   Pronounce | 
                   Foreign | 
                   Pause |
                   Comment |
                   VocalSound |
                   NonVocalSound |
                  EndTime CDATA #REQUIRED
                  Participant IDREFS #IMPLIED
                  Channel IDREFS #IMPLIED
                  CloseMic (true | false) 'true'
                  DigitTask (true | false) 'false'>


<Segment StartTime="331.150" EndTime="332.339" Participant="me011">
  On <Emphasis> which </Emphasis> tool?

Each Segment must identify a start time and an end time measured in seconds from the start of the audio file. If the segment represents sounds from a particular participant (e.g. speech), then the Segment must include a Participant attribute. The value of the Participant attribute must match the Name attribute from the Participant tag (see above). The Channel defaults to the Channel specified in the Participant tag. In unusual cases, the Channel may specify a channel different from the default channel (e.g. a participant temporarily uses another participant's microphone).

If the participant spoke into a near-field microphone, then the CloseMic attribute will be true (the default). Otherwise, the CloseMic attribute will be false.

During most meetings, we ask the participants to read a set of connected digits. The DigitTask attribute should be true during these segments.

Note on acronyms: We did not include a separate acronym tag. Instead, if the acronym is pronounced as a word, then the acronym is listed with no special notation. For example, NASA will simply appear as "NASA".

If the letters of the acronym are spoken, such as with "IBM", this is denoted by underscores in the transcription. For example, "IBM" appears in the corpus as "I_B_M".

Mixed cases also appear. For example, "x ray" is transcribed "X_ray".

See the section on <Emphasis> below for important information on acronyms with emphasis.

<!ELEMENT Foreign (#PCDATA |
                    Emphasis |
                    Uncertain )*>
                   Description CDATA #IMPLIED>


<Foreign Language="de"> Nein. </Foreign>

Parts of a segment that are spoken in a language other than English are identified with the Foreign tag. The attribute Language is required, and must be the two letter international language symbol for the language being used.

The Description attribute is optional, but frequently contains the translation.

<!ELEMENT Emphasis (#PCDATA |
                    Pronounce |
                    Uncertain |

When a syllable in a word is emphasized, the word may be marked with the Emphasis tag.

When Emphasis is used within an acronym, the coding is a combination of underscores and the Emphasis tags. For example, "I_<Emphasis> B </Emphasis>_M" indicates that the B is emphasized.

When the emphasis tag is in the leading position, it may indicate that any syllable is emphasized. For example:

I_<Emphasis> B </Emphasis>_M      The B is emphasized
<Emphasis> I_B_M </Emphasis>      Any syllable may be emphasized

<!ELEMENT Pronounce (#PCDATA)>
<!ATTLIST Pronounce Pronunciation CDATA #IMPLIED>


One of <Pronounce Pronunciation="&quot;em&quot;"> them </Pronounce>
linguistic factors <Pronounce Pronunciation="shortened"> too. </Pronounce>
and <Pronounce> partially </Pronounce>

When an obviously non-canonical pronunciation is used, it may be marked with the Pronounce tag. The Pronunciation attribute, if included, indicates how the word or phrase was actually pronounced, either by specifying the pronunciation (as in the first example) or by describing its properties (as in the second example). The contained text shows the canonical pronunciation. If the Pronunciation attribute is missing, it indicates that the word was pronounced in a non-canonical way.

<!ELEMENT Uncertain (#PCDATA |
                     Uncertain | 
                     Pause |
                     Pronounce |
<!ATTLIST Uncertain NumSyllables CDATA #IMPLIED>


<Uncertain> I mean </Uncertain> these are long meetings
<Uncertain NumSyllables="2"> @@ </Uncertain> We should

In cases where the transcriber was uncertain of the transcript, the Uncertain tag may be used. The contained text is the transcriber's best guess, or may be the special symbol @@ to mean unintelligible. If the NumSyllables attribute is given, it is the transcriber's best guess at the number of syllables in the Uncertain section.


<!ATTLIST Comment Description CDATA #REQUIRED>


and so forth. <Comment Description="rising intonation"/>
On - on Friday. <Comment Description="restart at overlap"/>
Johno? <Comment Description="whispered"/>
is that - <Comment Description="while laughing"/>
<Comment Description="very long pause, almost 6 seconds"/>

Miscellaneous comments from transcribers, usually about the preceding utterance.

<!ATTLIST VocalSound Description CDATA #IMPLIED>


<VocalSound Description="laugh"/>
<VocalSound Description="breath"/>
<VocalSound Description="clears throat"/>
<VocalSound Description="mouth"/>
<VocalSound Description="sniff"/>

A non-lexical vocalization related to a speaker.

<!ELEMENT NonVocalSound EMPTY>
<!ATTLIST NonVocalSound Description CDATA #IMPLIED>


<NonVocalSound Description="mike noise"/
<NonVocalSound Description="pages shuffling"/>
<NonVocalSound Description="writing on whiteboard"/>
<NonVocalSound Description="door opens"/>
<NonVocalSound Description="computer beep"/>

A non-vocalized noise.