Documentation for SGML Usage in 1997 HUB-4 Transcript Data
----------------------------------------------------------
This file explains how to obtain and use a standard SGML parser
utility for treating the transcription data provided by the LDC.
AVAILABILITY OF SGML UTILITY SOFTWARE
-------------------------------------
A public domain package of SGML utilities, software library functions
and documentation is available via anonymous ftp as follows:
connect to: ftp.jclark.com
go to directory: pub/sp
In this directory you will find a Gnu-zip'd tar file containing the
latest version of the source code package, as well as platform-
specific subdirectories containing pre-compiled programs and software
libraries (for DOS, OS/2, Macintosh, Windows-95, DEC Alpha,
i386-linux, and various SunOS configurations). Complete documentation
is included in all packages.
Access to the packages, as well as additional information about the
utilities and the standards that they comply with, is also available
via the World Wide Web, at the URL:
http://www.jclark.com/sp/
This package, created and maintained by James Clark, is the one used
by the LDC to verify the transcripts and to parse their contents for
internal use. Other packages are available, both commercial and
public domain, and information about them is available by following
appropriate links from the web page cited above -- in particular, you
will want to follow the link labeled "The SGML Web Page", which
currently points to:
http://www.sil.org/sgml/sgml.html
and browse from there.
VALIDATION OF SGML TRANSCRIPT FILES
-----------------------------------
All of the transcript files in this release have been tested in
combination with the DTD files provided here, using James Clark's
"nsgmls" parser, and the LDC has verified that all files are
syntactically correct and valid with respect to their SGML markup.
The SGML syntax validation consists simply of running the parser on
each transcript file and confirming that it reports no errors. In
particular, the following command serves to report any errors in the
SGML markup of a given transcript file, and will produce no output if
the file is fully conformant with the given DTD (the "first-level"
tokenization in this case):
nsgmls -s h4_level1.dtd trans_file.sgml
We have also checked that the lexical tokenization and token
classification that results from parsing with the "first-level" DTD
(in file "h4_level1.dtd") produces results that are consistent with
the word segmentation and token marking established by the
transcribers.
DIFFERENT LEVELS OF SGML PARSING
--------------------------------
This release of transcript data includes three different versions of
DTD for parsing the transcripts. Their behavior is summarized as
follows.
Level 1 (h4_level1.dtd) :
- provides full lexical tokenization, and identifies lexical
tokens with respect to these token classes:
"PNAME" (proper names)
"NONLEXEME" (filled pauses "um", etc)
"NONSPEECH" (cough, laugh, sneeze, breath, etc)
"WORD" (common lexical items)
- translates punctuation and spaces into SGML units:
"PERIOD"
"COMMA"
"QMARK"
"SEPARATOR"
Level 2 (h4_level2.dtd) :
- same as level 1, except that token classification is not
performed; instead, the characters in the transcript that
flag these token classes are left intact in the parser
output -- i.e.:
"^Name" for proper names
"%um" etc for filled pauses
"{cough}" etc for non-speech sounds
These three classes of tokens are simply output as "WORD"
elements, just like common lexical items.
Level 3 (h4_level3.dtd) :
- reproduces the text content of the transcript without
further analysis or processing -- that is, the text content
comes out exactly as it appears in the original SGML file.
All three levels of parsing will have the same effect with regard to
rendering the SGML tags and attribute values of the original SGML
file. The particular output format produced by James Clark's "nsgmls"
parser is described in documentation provided with his distribution
packages, and some brief examples are discussed below.
Please note that the "level-1" DTD represents the LDC's "official
reference" DTD for this collection. The other levels have been
included to give users some examples of alternative processing, and to
serve as possible conveniences; the LDC does not make any specific
recommendations or commitments regarding the use or support of the
level-2 and level-3 DTD files.
USING THE SGML PARSER TO PROCESS TRANSCRIPT FILES
-------------------------------------------------
The following "typical" command will produce parsed transcript data on
stdout:
nsgmls dtd_file trans_file.sgml
In order to illustrate the parser output format, and to demonstrate
the differences among the three levels of parsing (using the three
different DTD files), a brief excerpt of transcript data (sample.sgml)
has been been provided, together with the results of each level of
parsing.
The basic structure of the parser output is a series of lines in which
the first character of each line identifies the nature of the
information on that line. The line-initial character is one of:
A -- line contains an attribute name and value
(i.e. derived from a "attrib_name=attrib_value" string
contained within an SGML tag); these attribute lines
always appear PRIOR to the name of the tag that
contains them
( -- line contains an SGML element name, and defines the start
of an instance of the named element (e.g. TURN)
) -- line contains an SGML element name, and defines the end
of an instance of the named element (e.g. TURN)
- -- line contains text data
C -- indicates end of the document, has no further content
Note that some SGML elements (such as TIME, PERIOD, COMMA) are defined
as "empty" elements -- that is, they contain no text data. For these
elements, the parser output will have a line with initial "(" followed
immediately by a line with the corresponding initial ")".
As a brief illustration, the following snippet of SGML transcription
represents one speaker turn, which begins with a proper name, and
includes a region of overlapping speech with a subsequent turn from
another speaker -- the words spoken by this speaker during the overlap
period were not transcribed because it was not readily clear what was
said:
Below we present, side by side, the output of the parser using the
level-1 and level-2 DTD files, respectively; note that the only
difference in this case is the classification of the proper name
token (as PNAME versus WORD), and the retention of the token flag
character "^" for the proper name in the level-2 output:
LEVEL-1 LEVEL-2
---------------------- ----------------------
ASPEAKER CDATA unknown ASPEAKER CDATA unknown
ASEX TOKEN MALE ASEX TOKEN MALE
ASTARTTIME CDATA 4.773 ASTARTTIME CDATA 4.773
AENDTIME CDATA 6.841 AENDTIME CDATA 6.841
(TURN (TURN
ASEC CDATA 4.773 ASEC CDATA 4.773
(TIME (TIME
)TIME )TIME
(PNAME (WORD
-Paco -^Paco
)PNAME )WORD
(SEPARATOR (SEPARATOR
)SEPARATOR )SEPARATOR
(WORD (WORD
-pues -pues
)WORD )WORD
(SEPARATOR (SEPARATOR
)SEPARATOR )SEPARATOR
(WORD (WORD
-ya -ya
)WORD )WORD
(SEPARATOR (SEPARATOR
)SEPARATOR )SEPARATOR
(WORD (WORD
-el -el
)WORD )WORD
(SEPARATOR (SEPARATOR
)SEPARATOR )SEPARATOR
(WORD (WORD
-fin -fin
)WORD )WORD
(SEPARATOR (SEPARATOR
)SEPARATOR )SEPARATOR
(WORD (WORD
-de -de
)WORD )WORD
(SEPARATOR (SEPARATOR
)SEPARATOR )SEPARATOR
(WORD (WORD
-semana -semana
)WORD )WORD
(PERIOD (PERIOD
)PERIOD )PERIOD
ASTARTTIME CDATA 5.966 ASTARTTIME CDATA 5.966
AENDTIME CDATA 6.841 AENDTIME CDATA 6.841
(OVERLAP (OVERLAP
ASEC CDATA 5.966 ASEC CDATA 5.966
(TIME (TIME
)TIME )TIME
(UNCLEAR (UNCLEAR
)UNCLEAR )UNCLEAR
)OVERLAP )OVERLAP
)TURN )TURN
Below is the output for the same snippet using the level-3 DTD; note
that, in addition to retaining the token flag character "^", this
version also retains the space and punctuation characters intact, but
translates new-line characters (0x0a) into the character sequence "\n".
(This is a documented feature of the parser, not an artifact of the
DTD. The reason why new-line characters are not treated this way in
the level-1 and level-2 parsing is that those DTD's redefine the
new-line character as one of the forms of the "SEPARATOR" element.)
ASPEAKER CDATA unknown
ASEX TOKEN MALE
ASTARTTIME CDATA 4.773
AENDTIME CDATA 6.841
(TURN
ASEC CDATA 4.773
(TIME
)TIME
-\n ^Paco pues ya el fin de semana.\n
ASTARTTIME CDATA 5.966
AENDTIME CDATA 6.841
(OVERLAP
ASEC CDATA 5.966
(TIME
)TIME
-\n
(UNCLEAR
)UNCLEAR
)OVERLAP
)TURN
While the parser output is generally bulkier in appearance, it is
certainly no more difficult to handle -- and for some purposes is
actually simpler -- than the original SGML text, in terms of automatic
filtering of the transcript data (e.g. eliminating overlap regions or
speech in a non-target language).
SPECIAL TRANSCRIPT NOTATIONS NOT TREATED BY SGML PARSING
--------------------------------------------------------
As of this release, there are three notations used in the SGML
transcript files that are not treated in any way by the SGML parser --
they are passed through to the parser output "as is". These are:
(1) Extended pauses or breaks within a speaker's turn ("[[NS]]")
(2) Word fragment indicators (hyphens)
(3) Spoken aphabet letters (initials or acronyms: _M, _F_B_I)
An extended pause within a turn is marked whenever a speaker stops
talking for a period of two seconds or more and then begins talking
again, continuing with the same topical "section" of the broadcast
(i.e. a news story or a "filler" of miscellaneous content). The
period of time during which the speaker is not talking may be filled
with background noise, music, or silence. This is noted in the SGML
file by a sequence like the following:
text of what is said before the break
[[NS]]
text of what is said after the break
The [[NS]] notation has been employed as an explicit placeholder so
that extended non-speech regions within a turn will be clearly marked
as such.
Since no treatment is defined for the square-bracket notation in the
current versions of the DTD, the string "[[NS]]" is passed through by
the parser as a lexical token (i.e. it appears as "-[[NS]]" in the
output).
Hyphens may appear at the beginning or the end of a word fragment;
typically, a fragment followed by a hyphen indicates a speaker
disfluency (false start or interruption), whereas a fragment preceded
by a hyphen indicates that the initial part of a word was not audible
for some reason. These word fragments are passed through the parser
as common "WORD" elements, with the hyphens intact (e.g. they appear
as "-frag-" or "--ment" in the output).
The treatment of alphabet letters (as in the pronunciation of initials
and acronyms like "FBI", etc) involves the use of the underscore
character "_", which is placed immediately before each letter. (This
serves to distinguish letter "_A" from indefinite article "A", and
letter "_I" from personal pronoun "I" in English, for example.) Like
square brackets, the underscore character is not given any special
status in the DTD, and is passed intact to the parser output.
In the languages that have acronyms and initials, the transcriptions
will show conjoined sequences of underscores and letters for acronyms
and common letter sequences (e.g. "_F_B_I", "_A_B_C", "_X_Y_Z"),
whereas a case of "spelling out" a word or name will be rendered with
spaces between the letters (e.g. "_P _A _L _E, not _P _A _I _L"). In
the former case, each string of letters will appear as a single "word"
in the parser output, whereas the latter case will be a sequence of
"words". The underscores will be present in the output in either
case.
Note that the underscore notation does not apply to the Mandarin
transcript data.