Top-Level Documentation for HUB-4 Mandarin Transcript Data
----------------------------------------------------------

This distribution contains files of completed transcription data for
the training set of the 1997 DARPA HUB-4 Mandarin Benchmark, together
with supporting files and documentation for using standard SGML
utilities to access the transcript data.

The transcript files themselves have been created in a manner that
directly supports the use of a standard SGML parsing utility to
extract the transcription text and associated information in each
file.  The supporting files provide SGML Document Type Declarations
(DTD's) that fully describe the structure and content of the
transcript files; the DTD's are provided as input to an SGML parser
along with the transcript files in order to verify the format of the
transcript file and extract its contents.

A separate documentation file, "hub4sgml.doc", explains how to obtain
and use an SGML parser utility, and what can be done with each of the
DTD files.

Detailed information about the transcription conventions, word
segmentation principles and SGML structure of the transcripts are
provided as HTML documents.

The remainder of the current documentation file (h4m_tran.doc)
provides overview information about the transcripts and how they
relate to the associated acoustic data (which have been published
separately on CD-ROM).


DATA SOURCES FOR THE 1997 HUB-4 MANDARIN COLLECTION
---------------------------------------------------

This collection will ultimately include materials that have been
recorded from broadcasts by the following sources:

	Voice of America (VOA) -- United States Information Agency radio
	CCTV -- People's Republic of China television
	KAZN -- Commercial radio based in Los Angeles, CA.

Of these three sources, the first two comprise the bulk of the
collection, and will be represented in roughly equal amounts; only a
relatively small sample of KAZN recordings will be included, owing to
the relatively high proportion of unusable material (commercials,
local traffic reports loaded with California place names, etc).

The following table indicates the relative amounts of data from each
source, in terms of number of files, number of hours of broadcast
recordings (i.e. in the speech files published on CD-ROM), and number
of hours of actual transcribed speech data (i.e. time bounded by turn
tags in the transcripts).

		No. of	Hours	  Hours
	Source	Files	Recorded  Transcribed
	=====================================
	CCTV	25	13.0	  11.7
	KAZN	 9	 4.5	   2.7
	VOA	24	24.0	  15.8
	-------------------------------------
	Total:	58	41.5	  30.2

(The apparent low yield of VOA recordings was due to the presence of
some non-news program segments in the acoustic files, as well as the
presence of repeated news content across files recorded on the same
date; when a given speaker read the same news text more than once in
the recorded collection, only one of those readings was transcribed.)


ORGANIZATION OF DATA FILES
--------------------------

The names of individual transcript files indicate the language, source
and date of the broadcast, as follows:

   1st character:  language (i.e. "m" for Mandarin)
   2nd character:  source ("v" for VOA, "c" for CCTV, "k" for KAZN)
   3rd-7th chars:  date of broadcast (YYMDD; e.g. "97418")
   8th character:  if present, one of "a", "b", "c", "d"

The seven- or eight-character name matches the name of the
corresponding acoustic data file, which the LDC is providing via
CD-ROM.  The transcript and speech files are distinguished by the
3-character "extension" to the file name, as follows:

	fileid.sph : speech file on CD-ROM
	fileid.sgm : transcript file (in SGML format)

The "date of broadcast" field of the file names will use the letters
"A", "B" and "C" to represent the months October, November and
December, respectively; the digits 1-9 represent January through
September in the normal way.  Note that the KAZN files have a "0" for
the month field -- this is because the recordings received by the LDC
from KAZN were not labeled with date of broadcast, and we have not
recovered the broadcast date from the content of the recordings.  For
this material, we arbitrarily assigned a unique 3-digit number to each
30-minute recording, starting at "001", and used this in place of the
"MDD" portion of the broadcast date field of the file name.  It is
likely that the sequence implied by these arbitrary numbers DOES NOT
correspond to the actual time sequence of the broadcasts.

In several cases, two or more VOA recordings were made on the same day
(or a single large recording was broken up into two or more smaller
files).  In these cases, the first seven characters of the file names
(the language, source and date fields) are identical, and an eighth
character is added to distinguish the individual files.  This eighth
character will be one of "a,b,c,d", and will represent actual
broadcast sequence of the files (i.e. the "a" file will have been
broadcast prior to the "b" file, and so on).


CONTENT OF TRANSCRIPT FILES
---------------------------

The content of the transcript files can be categorized into two types
of character data:

 (1) SGML (ASCII-encoded) markup data
 (2) Mandarin (GB-encoded) text data (with some ASCII notations)

These two types of data never occur together on the same line -- that
is, each line of a transcript file contains either SGML markup or
transcription text, but never both.

The SGML markup provides division of the text data into a hierarchical
structure of "sections" (defined on the basis of topic), "turns"
(defined on the basis of change of speaker), and "overlap" (regions
where two people are speaking at once).  For sections, the markup
indicates the type of section (one of: "nontranscribed", "filler" or
"report"); for turns, it indicates the gender and a unique identifying
string for each speaker.  It also establishes the timing information,
in units of seconds, for correlating the transcription text to the
acoustic data.  The HTML document file "sgmlspec.html" provides more
detail about the structure and meaning of the markup content.

With regard to identification of speakers in the SGML turn tags, we
have sought to identify speakers by name wherever possible; the
speaker's given name is provided in ASCII (pinyin) form as a single
attribute token within the turn tag (e.g. "speaker=Zeng_Yucheng").
Every speaker whose given name was not determinable from the recorded
broadcast was assigned an anonymous but uniquely indexed string, such
as "spkr_21" or "reporter_37".  In applying these anonymous labels to
turns within a file, the transcribers were instructed as follows: make
sure that a given label is not applied to more than one distinct
voice, and try as far as possible to apply the same label every time
the same voice is heard.  No attempt was made to correlate the
identity of anonymous voices across files.  Owing to the nature of the
task, it is likely that a single speaker will appear in different
files, or at different points in the same file, and be identified with
different labels (this may affect some cases of named speakers as
well).  While it is also possible that some mistakes have been made in
applying the same label to different speakers, this type of
identification error should be quite rare.

The Mandarin text data consist of 16-bit GB-encoded characters,
together with space, new-line and punctuation characters; the
punctuation consists of only the following: period, comma and question
mark (using the ASCII codes 0x2e, 0x2c and 0x3f, respectively).
Spaces are used indicate word segmentation, which has been done
manually by the Mandarin transcribers, in accordance with principles
described in the HTML document files "ma_segmentation.html" and
"ma_principles.html".

In addition to the Mandarin text and word separators, there is a small
set of curly-brace bracketed tokens to indicate non-speech sounds made
by a speaker (e.g. "{laugh}", "{cough}", etc), and a small set of
"token classifier" characters, which immediately precede a word token
and identify that token as falling into one of the following
categories:

  Character	Token Category
	%	non-lexeme (e.g. filled pause or hesitation sound)
	^	proper name (i.e. a person's given name or surname)
	+	mispronounced word (correct orthography of intended
					word is given)

The curly braces and token classifier characters can be interpreted as
SGML markup by the parser utility, when the appropriate DTD file is
used.  An alternative DTD can also be used to parse the transcripts
while leaving these characters intact (unprocessed).  Please refer to
"hub4sgml.doc" for further information and examples of usage.

The transcripts also include one special notation in double square
brackets: "[[NS]]".  This is used to identify a region between two
consecutive time stamps in which there is no speech.  This typically
occurs within a turn when the speaker pauses for a significant period
of time (two seconds or more), during which there is music, background
noise or silence.

Hyphens are used to indicate word fragments; the hyphen may occur
either at the beginning or the end of the fragment.  (A word-initial
hyphen indicates that noise or transmission problems during the
broadcast obscured or eliminated the beginning of the word.)  Hyphens
are not given any special treatment by the SGML parser, and are passed
through to the parser output unmodified.  In some cases, the word
fragment is rendered as one or more 16-bit GB characters, while in
other cases, it appears as one or more ASCII characters representing a
pinyin transcription of the fragment.