marks beginning of article unit
marks beginning of paragraph n
marks sentences
marks end of paragraph
marks end of article unit
As with the output of (I), each sentence, like each instance of the
tags shown above, is placed alone on one line.
The LM pipeline is mostly unchanged from the 1995 CSRLM version.
However, a few changes were necessary:
- The new "numhack" module handles zip codes and telephone numbers,
which are much more common in broadcast materal than they were in
newswires.
- The pipeline used to change all hyphens in the input to " -HYPHEN " by
the end. This is a problem for speech transcriptions, where
word-word
word- word
word -word
word - word
may all mean different things. So the pipeline was modified to
preserve most of this distinction -- the first three cases pass
through "as is", while the fourth case is changed to
word --DASH word
- Some minor bugs in "numproc" were corrected.
- The abbreviation processor "abbrproc" was optimized to run about 50%
faster, without any output changes other than those described above.
The various programs in the conditioning pipeline produce error
messages on stderr to report (portions of) sentences that cannot be
handled properly; such sentences end up with incomplete or incorrect
conditioning in the final "vp" file (e.g. leaving residual digits or
punctuation). We lacked the time to correct these errors manually.
Robert MacIntyre
Linguistic Data Consortium
August 1996