Documentation for SGML Usage in 1997 HUB-4 Transcript Data ---------------------------------------------------------- This file explains how to obtain and use a standard SGML parser utility for treating the transcription data provided by the LDC. AVAILABILITY OF SGML UTILITY SOFTWARE ------------------------------------- A public domain package of SGML utilities, software library functions and documentation is available via anonymous ftp as follows: connect to: ftp.jclark.com go to directory: pub/sp In this directory you will find a Gnu-zip'd tar file containing the latest version of the source code package, as well as platform- specific subdirectories containing pre-compiled programs and software libraries (for DOS, OS/2, Macintosh, Windows-95, DEC Alpha, i386-linux, and various SunOS configurations). Complete documentation is included in all packages. Access to the packages, as well as additional information about the utilities and the standards that they comply with, is also available via the World Wide Web, at the URL: http://www.jclark.com/sp/ This package, created and maintained by James Clark, is the one used by the LDC to verify the transcripts and to parse their contents for internal use. Other packages are available, both commercial and public domain, and information about them is available by following appropriate links from the web page cited above -- in particular, you will want to follow the link labeled "The SGML Web Page", which currently points to: http://www.sil.org/sgml/sgml.html and browse from there. VALIDATION OF SGML TRANSCRIPT FILES ----------------------------------- All of the transcript files in this release have been tested in combination with the DTD files provided here, using James Clark's "nsgmls" parser, and the LDC has verified that all files are syntactically correct and valid with respect to their SGML markup. The SGML syntax validation consists simply of running the parser on each transcript file and confirming that it reports no errors. In particular, the following command serves to report any errors in the SGML markup of a given transcript file, and will produce no output if the file is fully conformant with the given DTD (the "first-level" tokenization in this case): nsgmls -s h4_level1.dtd trans_file.sgml We have also checked that the lexical tokenization and token classification that results from parsing with the "first-level" DTD (in file "h4_level1.dtd") produces results that are consistent with the word segmentation and token marking established by the transcribers. DIFFERENT LEVELS OF SGML PARSING -------------------------------- This release of transcript data includes three different versions of DTD for parsing the transcripts. Their behavior is summarized as follows. Level 1 (h4_level1.dtd) : - provides full lexical tokenization, and identifies lexical tokens with respect to these token classes: "PNAME" (proper names) "NONLEXEME" (filled pauses "um", etc) "NONSPEECH" (cough, laugh, sneeze, breath, etc) "WORD" (common lexical items) - translates punctuation and spaces into SGML units: "PERIOD" "COMMA" "QMARK" "SEPARATOR" Level 2 (h4_level2.dtd) : - same as level 1, except that token classification is not performed; instead, the characters in the transcript that flag these token classes are left intact in the parser output -- i.e.: "^Name" for proper names "%um" etc for filled pauses "{cough}" etc for non-speech sounds These three classes of tokens are simply output as "WORD" elements, just like common lexical items. Level 3 (h4_level3.dtd) : - reproduces the text content of the transcript without further analysis or processing -- that is, the text content comes out exactly as it appears in the original SGML file. All three levels of parsing will have the same effect with regard to rendering the SGML tags and attribute values of the original SGML file. The particular output format produced by James Clark's "nsgmls" parser is described in documentation provided with his distribution packages, and some brief examples are discussed below. Please note that the "level-1" DTD represents the LDC's "official reference" DTD for this collection. The other levels have been included to give users some examples of alternative processing, and to serve as possible conveniences; the LDC does not make any specific recommendations or commitments regarding the use or support of the level-2 and level-3 DTD files. USING THE SGML PARSER TO PROCESS TRANSCRIPT FILES ------------------------------------------------- The following "typical" command will produce parsed transcript data on stdout: nsgmls dtd_file trans_file.sgml In order to illustrate the parser output format, and to demonstrate the differences among the three levels of parsing (using the three different DTD files), a brief excerpt of transcript data (sample.sgml) has been been provided, together with the results of each level of parsing. The basic structure of the parser output is a series of lines in which the first character of each line identifies the nature of the information on that line. The line-initial character is one of: A -- line contains an attribute name and value (i.e. derived from a "attrib_name=attrib_value" string contained within an SGML tag); these attribute lines always appear PRIOR to the name of the tag that contains them ( -- line contains an SGML element name, and defines the start of an instance of the named element (e.g. TURN) ) -- line contains an SGML element name, and defines the end of an instance of the named element (e.g. TURN) - -- line contains text data C -- indicates end of the document, has no further content Note that some SGML elements (such as TIME, PERIOD, COMMA) are defined as "empty" elements -- that is, they contain no text data. For these elements, the parser output will have a line with initial "(" followed immediately by a line with the corresponding initial ")". As a brief illustration, the following snippet of SGML transcription represents one speaker turn, which begins with a proper name, and includes a region of overlapping speech with a subsequent turn from another speaker -- the words spoken by this speaker during the overlap period were not transcribed because it was not readily clear what was said: Below we present, side by side, the output of the parser using the level-1 and level-2 DTD files, respectively; note that the only difference in this case is the classification of the proper name token (as PNAME versus WORD), and the retention of the token flag character "^" for the proper name in the level-2 output: LEVEL-1 LEVEL-2 ---------------------- ---------------------- ASPEAKER CDATA unknown ASPEAKER CDATA unknown ASEX TOKEN MALE ASEX TOKEN MALE ASTARTTIME CDATA 4.773 ASTARTTIME CDATA 4.773 AENDTIME CDATA 6.841 AENDTIME CDATA 6.841 (TURN (TURN ASEC CDATA 4.773 ASEC CDATA 4.773 (TIME (TIME )TIME )TIME (PNAME (WORD -Paco -^Paco )PNAME )WORD (SEPARATOR (SEPARATOR )SEPARATOR )SEPARATOR (WORD (WORD -pues -pues )WORD )WORD (SEPARATOR (SEPARATOR )SEPARATOR )SEPARATOR (WORD (WORD -ya -ya )WORD )WORD (SEPARATOR (SEPARATOR )SEPARATOR )SEPARATOR (WORD (WORD -el -el )WORD )WORD (SEPARATOR (SEPARATOR )SEPARATOR )SEPARATOR (WORD (WORD -fin -fin )WORD )WORD (SEPARATOR (SEPARATOR )SEPARATOR )SEPARATOR (WORD (WORD -de -de )WORD )WORD (SEPARATOR (SEPARATOR )SEPARATOR )SEPARATOR (WORD (WORD -semana -semana )WORD )WORD (PERIOD (PERIOD )PERIOD )PERIOD ASTARTTIME CDATA 5.966 ASTARTTIME CDATA 5.966 AENDTIME CDATA 6.841 AENDTIME CDATA 6.841 (OVERLAP (OVERLAP ASEC CDATA 5.966 ASEC CDATA 5.966 (TIME (TIME )TIME )TIME (UNCLEAR (UNCLEAR )UNCLEAR )UNCLEAR )OVERLAP )OVERLAP )TURN )TURN Below is the output for the same snippet using the level-3 DTD; note that, in addition to retaining the token flag character "^", this version also retains the space and punctuation characters intact, but translates new-line characters (0x0a) into the character sequence "\n". (This is a documented feature of the parser, not an artifact of the DTD. The reason why new-line characters are not treated this way in the level-1 and level-2 parsing is that those DTD's redefine the new-line character as one of the forms of the "SEPARATOR" element.) ASPEAKER CDATA unknown ASEX TOKEN MALE ASTARTTIME CDATA 4.773 AENDTIME CDATA 6.841 (TURN ASEC CDATA 4.773 (TIME )TIME -\n ^Paco pues ya el fin de semana.\n ASTARTTIME CDATA 5.966 AENDTIME CDATA 6.841 (OVERLAP ASEC CDATA 5.966 (TIME )TIME -\n (UNCLEAR )UNCLEAR )OVERLAP )TURN While the parser output is generally bulkier in appearance, it is certainly no more difficult to handle -- and for some purposes is actually simpler -- than the original SGML text, in terms of automatic filtering of the transcript data (e.g. eliminating overlap regions or speech in a non-target language). SPECIAL TRANSCRIPT NOTATIONS NOT TREATED BY SGML PARSING -------------------------------------------------------- As of this release, there are three notations used in the SGML transcript files that are not treated in any way by the SGML parser -- they are passed through to the parser output "as is". These are: (1) Extended pauses or breaks within a speaker's turn ("[[NS]]") (2) Word fragment indicators (hyphens) (3) Spoken aphabet letters (initials or acronyms: _M, _F_B_I) An extended pause within a turn is marked whenever a speaker stops talking for a period of two seconds or more and then begins talking again, continuing with the same topical "section" of the broadcast (i.e. a news story or a "filler" of miscellaneous content). The period of time during which the speaker is not talking may be filled with background noise, music, or silence. This is noted in the SGML file by a sequence like the following: text of what is said before the break