Top-Level Documentation for HUB-4 Spanish Transcript Data
----------------------------------------------------------

This distribution contains files of completed transcription data for
the training set of the 1997 DARPA HUB-4 Spanish Benchmark, together
with supporting files and documentation for using standard SGML
utilities to access the transcript data.

The transcript files themselves have been created in a manner that
directly supports the use of a standard SGML parsing utility to
extract the transcription text and associated information in each
file.  The supporting files provide SGML Document Type Declarations
(DTD's) that fully describe the structure and content of the
transcript files; the DTD's are provided as input to an SGML parser
along with the transcript files in order to verify the format of the
transcript file and extract its contents.

A separate documentation file, "hub4sgml.doc", explains how to obtain
and use an SGML parser utility, and what can be done with each of the
DTD files.

Detailed information about the transcription conventions, word
segmentation principles and SGML structure of the transcripts are
provided as HTML documents.

The remainder of the current documentation file (h4s_tran.doc)
provides overview information about the transcripts and how they
relate to the associated acoustic data (which have been published
separately on CD-ROM).


DATA SOURCES FOR THE 1997 HUB-4 SPANISH COLLECTION
---------------------------------------------------

This collection will ultimately include materials that have been
recorded from broadcasts by the following sources:

	Voice of America (VOA) -- United States Information Agency radio
	ECO
	UNIVISION

Of these three sources, the latter two originate in Mexico, but will
tend to include speakers from other regions in Latin America; in the
VOA broadcasts, Cuban (or other Caribbean) speakers will dominate.

The following table indicates the relative amounts of data from each
source, in terms of number of files, number of hours of broadcast
recordings (i.e. in the speech files published on CD-ROM), and number
of hours of actual transcribed speech data (i.e. time bounded by turn
tags in the transcripts).

		No. of	Hours	  Hours
	Source	Files	Recorded  Transcribed
	=====================================
	ECO	30	17.5	  7.8
	UNI	24	12.0	  7.3
	VOA	27	15.5	 17.6
	-------------------------------------
	Total:	81	45.0	 32.7

(The apparent low yield for these sources was due primarily to the
presence of significant non-news segments in the acoustic files.)


ORGANIZATION OF DATA FILES
--------------------------

The names of individual transcript files indicate the language, source
and date of the broadcast, as follows:

   1st character:  language (i.e. "s" for Spanish)
   2nd character:  source ("v" for VOA, "e" for ECO, "u" for Univision)
   3rd-7th chars:  date of broadcast (YYMDD; e.g. "97418")
   8th character:  if present, one of "a", "b", "c", "d"

The seven- or eight-character name matches the name of the
corresponding acoustic data file, which the LDC is providing via
CD-ROM.  The transcript and speech files are distinguished by the
3-character "extension" to the file name, as follows:

	fileid.sph : speech file on CD-ROM
	fileid.sgm : transcript file (in SGML format)

The "date of broadcast" field of the file names will use the letters
"A", "B" and "C" to represent the months October, November and
December, respectively; the digits 1-9 represent January through
September in the normal way.

In several cases, two or more VOA recordings were made on the same day
(or a single large recording was broken up into two or more smaller
files).  In these cases, the first seven characters of the file names
(the language, source and date fields) are identical, and an eighth
character is added to distinguish the individual files.  This eighth
character will be one of "a,b,c,d", and will represent actual
broadcast sequence of the files (i.e. the "a" file will have been
broadcast prior to the "b" file, and so on).


CONTENT OF TRANSCRIPT FILES
---------------------------

The content of the transcript files can be categorized into two types
of character data:

 (1) SGML (ASCII-encoded) markup data
 (2) Spanish (ISOLatin1-encoded) text data (with some ASCII notations)

These two types of data never occur together on the same line -- that
is, each line of a transcript file contains either SGML markup or
transcription text, but never both.

The SGML markup provides division of the text data into a hierarchical
structure of "sections" (defined on the basis of topic), "turns"
(defined on the basis of change of speaker), and "overlap" (regions
where two people are speaking at once).  For sections, the markup
indicates the type of section (one of: "nontranscribed", "filler" or
"report"); for turns, it indicates the gender and a unique identifying
string for each speaker.  It also establishes the timing information,
in units of seconds, for correlating the transcription text to the
acoustic data.  The HTML document file "sgmlspec.html" provides more
detail about the structure and meaning of the markup content.

With regard to identification of speakers in the SGML turn tags, we
have sought to identify speakers by name wherever possible; the
speaker's given name is provided in ASCII form as a single attribute
token within the turn tag (e.g. "speaker=Joaquin_del_Olmo").  Every
speaker whose given name was not determinable from the recorded
broadcast was assigned an anonymous but uniquely indexed string, such
as "spkr_21" or "reporter_37".  In applying these anonymous labels to
turns within a file, the transcribers were instructed as follows: make
sure that a given label is not applied to more than one distinct
voice, and try as far as possible to apply the same label every time
the same voice is heard.  No attempt was made to correlate the
identity of anonymous voices across files.  Owing to the nature of the
task, it is likely that a single speaker will appear in different
files, or at different points in the same file, and be identified with
different labels (this may affect some cases of named speakers as
well).  While it is also possible that some mistakes have been made in
applying the same label to different speakers, this type of
identification error should be quite rare.

Speaker information (as provided in the SGML turn tags) includes sex
and dialect classification of the speakers.  The dialect categories
applied to speakers of Spanish are as follows:

	Attribute label		Meaning
	----------------------------------
	Coastal			Caribbean, Lowland
	Interior		Mainland, Highland
	Peninsular		Spain
	Non-native		(but still speaking Spanish)

In addition, for turns in which the speaker did not utter any Spanish,
the dialect attribute is given as:

	Not-Spanish


The Spanish text data consist of 8-bit ISOLatin1-encoded characters,
together with space, new-line and punctuation characters; the
punctuation consists of only the following: period, comma and question
mark (using the ASCII codes 0x2e, 0x2c and 0x3f, respectively).

In addition to the Spanish text and word separators, there is a small
set of curly-brace bracketed tokens to indicate non-speech sounds made
by a speaker (e.g. "{laugh}", "{cough}", etc), and a small set of
"token classifier" characters, which immediately precede a word token
and identify that token as falling into one of the following
categories:

  Character	Token Category
	%	non-lexeme (e.g. filled pause or hesitation sound)
	^	proper name (i.e. a person's given name or surname)
	+	mispronounced word (correct orthography of intended
					word is given)
	_	alphabet letter (an initial or part of an acronym)

The curly braces and token classifier characters (except for the
underscore) can be interpreted as SGML markup by the parser utility,
when the appropriate DTD file is used.  An alternative DTD can also be
used to parse the transcripts while leaving these characters intact
(unprocessed).  Please refer to "hub4sgml.doc" for further information
and examples of usage.

The transcripts also include one special notation in double square
brackets: "[[NS]]".  This is used to identify a region between two
consecutive time stamps in which there is no speech.  This typically
occurs within a Turn when the speaker pauses for a significant period
of time (two seconds or more), during which there is music, background
noise or silence.

Hyphens are used to indicate word fragments; the hyphen may occur
either at the beginning or the end of the fragment.  (A word-initial
hyphen indicates that noise or transmission problems during the
broadcast obscured or eliminated the beginning of the word.)  Hyphens
are not given any special treatment by the SGML parser, and are passed
through to the parser output unmodified.