Introduction to 1996 CSR LM 1996 Release This 2-cdrom set contains data from transcribed news broadcasts, designated for use in the baseline language model (LM) for the 1996 CSR Hub 4 Evaluation. We obtained the bulk of the data from "Broadcast News" CDROMs produced by Primary Source Media, Inc. This portion includes the period from January 1992 to April 1996, and contains approximately 1 gigabyte of data uncompressed. We also received about 36 megabytes of material on floppy disks covering the period from late May through June 1996. This material has somewhat different internal format and a different file naming scheme (see below). The text data are present in two forms: (1) a relatively unprocessed ("raw" or "sentence-tagged") form, and (2) a fully processed ("conditioned", "verbalized-punctuation") form. The "raw" form includes the header and footer information accompanying the articles, such as network, show name, headline, copyright, credits, and so forth. The text and ancillary data are presented in a fairly consistent (though simple) SGML format. The "processed" form contains only the text content of the articles, together with SGML tags to mark the boundaries of articles, paragraphs and sentences. Material from reserved eval shows is separated as "test" material, with the remainder identified as "train". This distinction appears both in the directory structure and in the filenames. For example, the file "csrlm96_h4_2/HUB4_LM/st_test/bn9604ts.stZ" contains: bn - Broadcast News data 9604 - from April 1996 ts - test material (also shown by "st_test" dirname) st - "sentence tagged" (or "raw") data (ditto) Z - data compressed with Unix "compress" Similarly, "csrlm96_h4_1/HUB4_LM/vp_train/bn9604tr.vpZ" contains compressed, conditioned training material from the same month. This naming scheme sacrifices the Unix ".Z" compressed-file extension in favor of including more information within the 8.3 filenames permitted on CDROMs. Since the May/June 1996 material arrived in several files with overlapping time periods, these files are not divided by month. Instead, we have retained the file-divisions that we received, translating the filenames as follows: bn622btr.stZ contains compressed, "raw", training material from 9622/rpi_segs.txt bn623ats.vpZ contains compressed, conditioned, test material from 9623/rpi.txt Thus the "a" files come files originally called from "rpi.txt", while the "b" files come from files called "rpi_segs.txt", in case that distinction is important. The "9622" names look like they may be related to production date or the like, but each file appears to contain an overlapping range of broadcast dates. The Unix tarfile "utils.tar" contains perl scripts and C source code for various processes that were used in preparing the collection. The text file "process.doc" describes the text preparation process, and "st2lm.log" contains the error reports produced by the conditioning pipeline used to create the "vp" version from the "st" files. Robert MacIntyre Linguistic Data Consortium August 27, 1996