File Formats and Processing

This file describes the file formats in more detail, as well as describing
the text processing operations used to create the data in this release.

Users of the data should note that each stage contains the potential for
errors, so some noise must be expected in the end.  In particular, the
"body text" may occasionally include non-text information such as tables
and final credits, and a few nonprinting characters may have slipped
through, despite several steps taken to prevent this.


I. SGML markup and sentence tagging

We used several tools graciously provided by BBN to convert the Primary
Source Media CDROMs to one-month chunks of material tagged with rough
SGML markup, adapting the last stage considerably to handle transcriber
comments and other problems.  We produced a similar pipeline for the
May/June 1996 material that came to us on floppy disks.  Both versions
were then run through a sentence tagger (included with this release)
that propagated the document IDs to the paragraph tags and divided the
text at approximate sentence boundaries.  The resulting files were then
divided into training and test (eval) material.

Finally, some common character-level problems (e.g. high-bit or control
characters) were corrected by the "tr-bn-char" script, and a few
relatively infrequent problems were corrected by hand in emacs:

  fix less-common or unpredictable character-level problems,
  fix dollar signs in odd contexts,
  replace expletives transcribed with underscores with likely words,
  remove a single garbage article from bn9502.199507.testset,
  remove chunk of malformatted material from postscript of article
	in bn9604 (several articles merged together, with bad headers)
  

The result of this process has the following schematic form:

<DOC id=docid>
<program>
 network, show name, etc.
</program>
<summary>
 headline, one-sentence summary, subject headings, etc.
</summary>
<TEXT>
 <speaker> Speaker name: </speaker>
OR
 <comment> transcriber comment </comment>
OR
 <p id=docid.paraid>
  <s>
   sentence 1.
  <s>
   sentence 2.
 </p>
OR
 <comment> {graphic} line of text from apparent table </comment>
OR
 <comment> RWM: my comment </comment>
</TEXT>
<postscript>
 copyright, credits, disclaimer, repeat of one-sentence summary, keywords
</postscript>
</DOC>

The CD material lacked explicit markings for the types of information
included in the headers and footers of each article, so the material is
roughly segmented into <program>, <summary>, and <postscript>.  The
May/June 1996 material came with explicit markings for many of these
fields; we retained these while attempting to divide them into the same
SGML categories used for the CD material.  Also, the first line of the
<program> mimics the [network][show_name] found in the CD material, with
the network in canonicalized form ("@Publisher: WNET" -> "PBS",
"@Publisher: Cable News Network" -> "CNN", etc.), to facilitate sorting
the articles into training and test material.

The document IDs are based on the date plus a consecutive integer for
each file (so "960401.54" would be an article dated April 1, 1996 that
came as article number 54 from the file of April 1996 material.  The
May/June 1996 material came in several overlapping files, so the docIDs
for this material include the original filename,
e.g. "bn9622.rpi_segs.txt.960604.1".

Transcriber comments presented something of a problem.  Generally they are
enclosed in square brackets, and when this happened at the beginning or end
of a line, we separated them off as <comment>, while mid-line comments were
retained as-is.  This combination served to minimize the interference with
other parts of the processing, notably extraction of speaker IDs and
sentence tagging.

Two other classes of comments appear in the text.  Those starting with
"{graphic}" hold text that belongs with multi-line graphics such as
on-screen graphics.  Such tables generally lacked explicit boundaries,
so identification of such material is imprecise.  Comments starting with
"RWM" were inserted by hand during the manual corrections.  Some contain
descriptions of material deleted by hand; most contain short lines of
noisy text that has been removed.  These may still contain the control
characters that flagged the text as noisy in the first place.  (I tried
to remove the minimal amount of affected text in each case.)

Because of the input text generally lacked explicit markers of such things
as beginning/end of text, speaker IDs, start/end of graphic, etc., some
errors should be expected in the SGML tagging.  For example, a Speaker ID
may be identified as text, or actual text may be misidentified as a Speaker
ID.  Or (less frequently) header or trailer information may appear as body
text.  Such errors probably occur with almost all the tags, but we believe
the frequency of such errors to be acceptably low.

Finally, note that although the conditioning tools eliminate articles
that are explicit marked as repetitions of another article, we have
observed several instances of repeated material in the files.  Because
of time constraints, we have not attempted to systematically remove such
material.


II.  Vocabulary conditioning

The "raw" material described above is fed through a pipeline of perl
scripts (adapted from an original set of text conditioning tools
developed by Doug Paul for the CSR WSJ0 LM corpus) which replace
numeric strings with appropriate lexical strings, spell out common
abbreviations, and convert all punctuation and bracketing to a
standard set of "punctuation vocabulary" tokens, allowing the lexical
items adjacent to such characters stand alone.

	pare-sgml.perl $file |
	 bugproc.perl |
	 numhack.perl |
	 numproc.perl |
	 abbrproc.perl |
	 puncproc.perl > lm/$BASENM

In order to simplify matters, this pipeline begins by removing all
SGML markup external to the <TEXT> component of each article; in the
output of "strip-st-markup" (as in the final ".vp" file), the only
SGML markup is:

	<art id=x>	marks beginning of article unit
	<p id=x.n>	marks beginning of paragraph n
	<s>		marks sentences
	</p>		marks end of paragraph
	</art>		marks end of article unit

As with the output of (I), each sentence, like each instance of the
tags shown above, is placed alone on one line.

The LM pipeline is mostly unchanged from the 1995 CSRLM version.
However, a few changes were necessary:

 - The new "numhack" module handles zip codes and telephone numbers,
   which are much more common in broadcast materal than they were in
   newswires.
 - The pipeline used to change all hyphens in the input to " -HYPHEN " by
   the end.  This is a problem for speech transcriptions, where
	word-word
	word- word
	word -word
	word - word
   may all mean different things.  So the pipeline was modified to
   preserve most of this distinction -- the first three cases pass
   through "as is", while the fourth case is changed to
	word --DASH word
 - Some minor bugs in "numproc" were corrected.
 - The abbreviation processor "abbrproc" was optimized to run about 50%
   faster, without any output changes other than those described above.

The various programs in the conditioning pipeline produce error
messages on stderr to report (portions of) sentences that cannot be
handled properly; such sentences end up with incomplete or incorrect
conditioning in the final "vp" file (e.g. leaving residual digits or
punctuation).  We lacked the time to correct these errors manually.

Robert MacIntyre
Linguistic Data Consortium
August 1996