Top-Level README file for CSR LM-1

This file explains the general layout and content of the CSR LM-1
corpus.  This corpus contains the current Standard Language Model (LM)
for the Constinuous Speech Recognition (CSR) Project, which is
sponsored by the Advanced Research Projects Agency of the United
States (ARPA).  In addition to the LM, this corpus contains all text
data and software used to create the LM.


GENERAL OVERVIEW OF CONTENTS
----------------------------

The table below summarizes the contents of this release, in terms of
the top-level directories in the cdrom set:
	
 Dir		Content
---------------------------------------------------------------------

 st		sentence-tagged, unprocessed form of all text data
		used to generate the LM

 vp		verbalized-punctuation, processed form of all text
		data used to generate the LM

 lm		the files comprising and supporting the Language Model
		itself

 setaside	text extracted from the LM data sources that has been
		used, or reserved for use, as prompting text for
		acoustic testing data; the text data in this directory
		has not been (and should not be) used in building the
		LM.

 logs		error reports from the preparation of text data into
		"st" and "vp" forms

 tx_utils	software and scripts used to create the "st" and "vp"
		data from original sources, plus documentation;
		developed by Doug Paul (MIT), David Graff (LDC) and
		Roni Rosenfeld (CMU)

 lm_utils	software and scripts used to generate the LM, plus
		documentation and examples; developed by Roni
		Rosenfeld

Three of these directories, "logs", "tx_utils" and "lm_utils", are
replicated on all cdroms in this release, as is this README.1ST file.


SUMMARY OF TEXT DATA IN THIS RELEASE
------------------------------------

 -- The LM training texts

The "st" and "vp" directories have a parallel structure of
subdirectories and files.  The subdirectories divide the data
primarily by source, and secondarily by year of origin.  The table
below lists the subdivisions:

Source  Period    Original Publication(s)
-----------------------------------------

  AP    1989      TIPSTER Vol. 1, version 2 (1994)
        1988      TIPSTER Vol. 2, version 2 (1994)
        1990      TIPSTER Vol. 3, version 2 (1994)

  SJM   1991      TIPSTER Vol. 3, version 2 (1994)

  WSJ   1987-9    ACL/DCI (1991) and CSR WSJ0 (1993)
        1990-2*   TIPSTER Vol. 2, version 2 (1994)
        1992-4*   (to be published)

* The WSJ 1992 collection on TIPSTER covers only the first three
months of the year.  Data from the remaining months of 1992 were made
available to the LDC, along with all of 1993 and the first three
months of 1994, in June of this year; these later materials will be
published by LDC on cdrom, in the common TIPSTER SGML format, in the
near future.

Both the "ap" and "wsj" directories under "st" and "vp" contain
subdirectories whose two-digit names represent the years of origin; in
the case of "wsj", the two sets of 1992 data (one from TIPSTER and the
other from the more recent DJIS archives) are kept distinct as "92"
and "92b".  (Since only one year of SJM data is present, there are no
subdirectories under "sjm".)

With regard to the organization and naming of individual text files,
the AP and the WSJ 90-92 files from TIPSTER are partitioned and named
according to the calendar date on which the text data originated.  The
remaining sources have texts grouped arbitrarily into files of about
one megabyte each, with uniform sequence numbers assigned to
successive files; in this case, there is no assurance that the data
are in chronological order (indeed, most of these texts may be in
reverse chronological order), and the dates associated with articles
are not recoverable from the CSR markup format, though it is possible,
using the CSR article tags, to trace back to the original ACL/DCI or
TIPSTER format to determine the dates of the articles.

 -- The "set-aside" texts

The "setaside" directory contains the text material that was
previously extracted from the periods listed above for use in acoustic
testing.  These data are divided according to intended use
(development test versus evaluation test), and then subdivided
according to source and period of origin.

These set-asides were imposed only on the following sets:

  WSJ 87-89  (the WSJ0 and WSJ1 test collections from ACL/DCI)
  WSJ 90-92  (part of the WSJ1 test collection from TIPSTER 2)
  SJM 91     (part of the WSJ1 test collection from TIPSTER 3)
  AP  88,90  (material reserved for testing from TIPSTER 2 & 3)

The AP data has never actually been used for acoustic testing, and
only some of the WSJ and SJM data set aside from TIPSTER 2 & 3 were
used in the acoustic tests for WSJ1.  Still, it was deemed useful to
reserve all data that had been so designated, and withold it from use
in building the LM.

In addition to these materials, the "setaside" directory also contains
the text data that is being used in the 1994 CSR development test set.
This set includes WSJ text from April 1994, as well as text from the
following sources, all from the period of May 25 to June 8, 1994: New
York Times, Washington Post, Los Angeles Times, Reuters North American
Business News Service.


SUMMARY OF SOFTWARE IN THIS RELEASE
-----------------------------------

The "tx_utils" directory contains a collection of programs written in
perl and C.  The C programs are provided in source code only, and have
only been tested under SunOS 4.1.3, using the native C compiler on
this system.

In addition to the program files, there are some supporting data files
used by the text processing programs, and a collection of specialized
scripts that can be used in conjunction with one of the perl programs
to replicate all the steps of text preparation that went into this
release.  The perl program involved in this usage is called
"cmdloop.prl", and the scripts that drive it all have a ".cml"
extension in their file names.  The README file in that directory,
along with the "Usage" messages from the various programs, explain the
procedures involved.

The "lm_utils" directory contains the CMU Standard Language Model
Toolkit, version 1.0 (a "beta" release).  This is the package used to
create the LM in this release, using the text data in the "vp"
directory.

The "lm_utils" directory contains source code in C, C-chell scripts,
object files, and complete instructions for installation and use of
the package.  This portion of the LM-1 corpus has been mastered onto
cdrom without conversion of file names to the ISO 9660 limitations,
through the use of the "Rock Ridge" extensions to ISO 9660.  Users on
UNIX systems (which typically support the Rock Ridge extensions)
should be able to install and use the package without difficulty.

Users on other systems (e.g. DOS or Macintosh) will have some
difficulty in identifying some file names and using the package (all
files will be present and uniquely identified, but names that appear
in directory listings may not match the names given in documentation
files -- also, C-shell scripts may be of little or no use).

-----------
David Graff
Linguistic Data Consortium
August 1994