File Formats and Processing This file describes the file formats in more detail, as well as describing the text processing operations used to create the data in this release. Users of the data should note that each stage contains the potential for errors, so some noise must be expected in the end. In particular, the "body text" may occasionally include non-text information such as tables and final credits, and a few nonprinting characters may have slipped through, despite several steps taken to prevent this. I. SGML markup and sentence tagging We used several tools graciously provided by BBN to convert the Primary Source Media CDROMs to one-month chunks of material tagged with rough SGML markup, adapting the last stage considerably to handle transcriber comments and other problems. We produced a similar pipeline for the May/June 1996 material that came to us on floppy disks. Both versions were then run through a sentence tagger (included with this release) that propagated the document IDs to the paragraph tags and divided the text at approximate sentence boundaries. The resulting files were then divided into training and test (eval) material. Finally, some common character-level problems (e.g. high-bit or control characters) were corrected by the "tr-bn-char" script, and a few relatively infrequent problems were corrected by hand in emacs: fix less-common or unpredictable character-level problems, fix dollar signs in odd contexts, replace expletives transcribed with underscores with likely words, remove a single garbage article from bn9502.199507.testset, remove chunk of malformatted material from postscript of article in bn9604 (several articles merged together, with bad headers) The result of this process has the following schematic form: network, show name, etc. headline, one-sentence summary, subject headings, etc. Speaker name: OR transcriber comment OR

sentence 1. sentence 2.

OR {graphic} line of text from apparent table OR RWM: my comment
copyright, credits, disclaimer, repeat of one-sentence summary, keywords
The CD material lacked explicit markings for the types of information included in the headers and footers of each article, so the material is roughly segmented into , , and . The May/June 1996 material came with explicit markings for many of these fields; we retained these while attempting to divide them into the same SGML categories used for the CD material. Also, the first line of the mimics the [network][show_name] found in the CD material, with the network in canonicalized form ("@Publisher: WNET" -> "PBS", "@Publisher: Cable News Network" -> "CNN", etc.), to facilitate sorting the articles into training and test material. The document IDs are based on the date plus a consecutive integer for each file (so "960401.54" would be an article dated April 1, 1996 that came as article number 54 from the file of April 1996 material. The May/June 1996 material came in several overlapping files, so the docIDs for this material include the original filename, e.g. "bn9622.rpi_segs.txt.960604.1". Transcriber comments presented something of a problem. Generally they are enclosed in square brackets, and when this happened at the beginning or end of a line, we separated them off as , while mid-line comments were retained as-is. This combination served to minimize the interference with other parts of the processing, notably extraction of speaker IDs and sentence tagging. Two other classes of comments appear in the text. Those starting with "{graphic}" hold text that belongs with multi-line graphics such as on-screen graphics. Such tables generally lacked explicit boundaries, so identification of such material is imprecise. Comments starting with "RWM" were inserted by hand during the manual corrections. Some contain descriptions of material deleted by hand; most contain short lines of noisy text that has been removed. These may still contain the control characters that flagged the text as noisy in the first place. (I tried to remove the minimal amount of affected text in each case.) Because of the input text generally lacked explicit markers of such things as beginning/end of text, speaker IDs, start/end of graphic, etc., some errors should be expected in the SGML tagging. For example, a Speaker ID may be identified as text, or actual text may be misidentified as a Speaker ID. Or (less frequently) header or trailer information may appear as body text. Such errors probably occur with almost all the tags, but we believe the frequency of such errors to be acceptably low. Finally, note that although the conditioning tools eliminate articles that are explicit marked as repetitions of another article, we have observed several instances of repeated material in the files. Because of time constraints, we have not attempted to systematically remove such material. II. Vocabulary conditioning The "raw" material described above is fed through a pipeline of perl scripts (adapted from an original set of text conditioning tools developed by Doug Paul for the CSR WSJ0 LM corpus) which replace numeric strings with appropriate lexical strings, spell out common abbreviations, and convert all punctuation and bracketing to a standard set of "punctuation vocabulary" tokens, allowing the lexical items adjacent to such characters stand alone. pare-sgml.perl $file | bugproc.perl | numhack.perl | numproc.perl | abbrproc.perl | puncproc.perl > lm/$BASENM In order to simplify matters, this pipeline begins by removing all SGML markup external to the component of each article; in the output of "strip-st-markup" (as in the final ".vp" file), the only SGML markup is: marks beginning of article unit

marks beginning of paragraph n marks sentences

marks end of paragraph
marks end of article unit As with the output of (I), each sentence, like each instance of the tags shown above, is placed alone on one line. The LM pipeline is mostly unchanged from the 1995 CSRLM version. However, a few changes were necessary: - The new "numhack" module handles zip codes and telephone numbers, which are much more common in broadcast materal than they were in newswires. - The pipeline used to change all hyphens in the input to " -HYPHEN " by the end. This is a problem for speech transcriptions, where word-word word- word word -word word - word may all mean different things. So the pipeline was modified to preserve most of this distinction -- the first three cases pass through "as is", while the fourth case is changed to word --DASH word - Some minor bugs in "numproc" were corrected. - The abbreviation processor "abbrproc" was optimized to run about 50% faster, without any output changes other than those described above. The various programs in the conditioning pipeline produce error messages on stderr to report (portions of) sentences that cannot be handled properly; such sentences end up with incomplete or incorrect conditioning in the final "vp" file (e.g. leaving residual digits or punctuation). We lacked the time to correct these errors manually. Robert MacIntyre Linguistic Data Consortium August 1996