SGML Parsing Documentation for the UN Parallel Texts Version 1.0 by David Graff Linguistic Data Consortium University of Pennsylvania March 3, 1994 All of the UN document files in this corpus have been run through an SGML parser to verify their compliance with SGML. The parser used for verification was James Clark's program "sgmls". In parsing each document file, two supplemental files were needed, and they are provided in this directory: sgmls.hdr unptsgml.dtd The first of these files is simply a single line to be included at the beginning of each document file for input to the parser; the single line provides the name of the DTD file needed to define the particular markup used here for UN texts. The second file is the DTD itself. The procedure used to run the parser on each document was: cat sgmls.hdr document.file | sgmls -s -f logfile where "document.file" represents the name of the text file to be parsed. The -s flag suppresses output of the parsed data, and the -f directs all parsing errors to be written into "logfile". The document was considered "fully compliant" when no errors were reported. In addition to the DTD, there is a character-set specification, stored in the file "unptsgml.chr". A brief listing of all the SGML markup used in the texts is provided in "sgmltags.lst", and a more detailed explanation about the usage, meaning, origins and rational for the various markup conventions is given in "wang2iso.doc", which also describes the character set (ISO 8859-1 "Latin1") in a more human-readable form. For those who would prefer the data in a more "raw" form, and have no need for the SGML markup, there is a sample script, suitable for use with the UNIX "sed" utility, that will eliminate (or replace, as appropriate) all the SGML markup in the text files. So, using a UNIX system (or equivalent), the SGML text can be transformed to "raw" text by means of the following: sed -f rm_sgml.sed document.file > raw.file (Under SunOS 4.1.3, this process also eliminates empty lines.) For those who would like to do this, but are unable to use sed, the script is repeated below, with an explanation of its effects: 1,14d # delete the first 14 lines ("" and initial # "") structure /filenam>/d # delete the content of any remaining "" /str.[0-3]>/d # structures later in the document (some larger /nums.[0-3]>/d # files were split up for storage at the UN; the # 's of each piece have been retained) s/<[./a-z]*>//g # replace all occurrences of "<...>" with nothing # ("..." includes only ".", "/" and lower-case) /^/g # their normal representation s/<//d # delete the last line of the file