Top-Level README file for CSR LM-1 This file explains the general layout and content of the CSR LM-1 corpus. This corpus contains the current Standard Language Model (LM) for the Constinuous Speech Recognition (CSR) Project, which is sponsored by the Advanced Research Projects Agency of the United States (ARPA). In addition to the LM, this corpus contains all text data and software used to create the LM. GENERAL OVERVIEW OF CONTENTS ---------------------------- The table below summarizes the contents of this release, in terms of the top-level directories in the cdrom set: Dir Content --------------------------------------------------------------------- st sentence-tagged, unprocessed form of all text data used to generate the LM vp verbalized-punctuation, processed form of all text data used to generate the LM lm the files comprising and supporting the Language Model itself setaside text extracted from the LM data sources that has been used, or reserved for use, as prompting text for acoustic testing data; the text data in this directory has not been (and should not be) used in building the LM. logs error reports from the preparation of text data into "st" and "vp" forms tx_utils software and scripts used to create the "st" and "vp" data from original sources, plus documentation; developed by Doug Paul (MIT), David Graff (LDC) and Roni Rosenfeld (CMU) lm_utils software and scripts used to generate the LM, plus documentation and examples; developed by Roni Rosenfeld Three of these directories, "logs", "tx_utils" and "lm_utils", are replicated on all cdroms in this release, as is this README.1ST file. SUMMARY OF TEXT DATA IN THIS RELEASE ------------------------------------ -- The LM training texts The "st" and "vp" directories have a parallel structure of subdirectories and files. The subdirectories divide the data primarily by source, and secondarily by year of origin. The table below lists the subdivisions: Source Period Original Publication(s) ----------------------------------------- AP 1989 TIPSTER Vol. 1, version 2 (1994) 1988 TIPSTER Vol. 2, version 2 (1994) 1990 TIPSTER Vol. 3, version 2 (1994) SJM 1991 TIPSTER Vol. 3, version 2 (1994) WSJ 1987-9 ACL/DCI (1991) and CSR WSJ0 (1993) 1990-2* TIPSTER Vol. 2, version 2 (1994) 1992-4* (to be published) * The WSJ 1992 collection on TIPSTER covers only the first three months of the year. Data from the remaining months of 1992 were made available to the LDC, along with all of 1993 and the first three months of 1994, in June of this year; these later materials will be published by LDC on cdrom, in the common TIPSTER SGML format, in the near future. Both the "ap" and "wsj" directories under "st" and "vp" contain subdirectories whose two-digit names represent the years of origin; in the case of "wsj", the two sets of 1992 data (one from TIPSTER and the other from the more recent DJIS archives) are kept distinct as "92" and "92b". (Since only one year of SJM data is present, there are no subdirectories under "sjm".) With regard to the organization and naming of individual text files, the AP and the WSJ 90-92 files from TIPSTER are partitioned and named according to the calendar date on which the text data originated. The remaining sources have texts grouped arbitrarily into files of about one megabyte each, with uniform sequence numbers assigned to successive files; in this case, there is no assurance that the data are in chronological order (indeed, most of these texts may be in reverse chronological order), and the dates associated with articles are not recoverable from the CSR markup format, though it is possible, using the CSR article tags, to trace back to the original ACL/DCI or TIPSTER format to determine the dates of the articles. -- The "set-aside" texts The "setaside" directory contains the text material that was previously extracted from the periods listed above for use in acoustic testing. These data are divided according to intended use (development test versus evaluation test), and then subdivided according to source and period of origin. These set-asides were imposed only on the following sets: WSJ 87-89 (the WSJ0 and WSJ1 test collections from ACL/DCI) WSJ 90-92 (part of the WSJ1 test collection from TIPSTER 2) SJM 91 (part of the WSJ1 test collection from TIPSTER 3) AP 88,90 (material reserved for testing from TIPSTER 2 & 3) The AP data has never actually been used for acoustic testing, and only some of the WSJ and SJM data set aside from TIPSTER 2 & 3 were used in the acoustic tests for WSJ1. Still, it was deemed useful to reserve all data that had been so designated, and withold it from use in building the LM. In addition to these materials, the "setaside" directory also contains the text data that is being used in the 1994 CSR development test set. This set includes WSJ text from April 1994, as well as text from the following sources, all from the period of May 25 to June 8, 1994: New York Times, Washington Post, Los Angeles Times, Reuters North American Business News Service. SUMMARY OF SOFTWARE IN THIS RELEASE ----------------------------------- The "tx_utils" directory contains a collection of programs written in perl and C. The C programs are provided in source code only, and have only been tested under SunOS 4.1.3, using the native C compiler on this system. In addition to the program files, there are some supporting data files used by the text processing programs, and a collection of specialized scripts that can be used in conjunction with one of the perl programs to replicate all the steps of text preparation that went into this release. The perl program involved in this usage is called "cmdloop.prl", and the scripts that drive it all have a ".cml" extension in their file names. The README file in that directory, along with the "Usage" messages from the various programs, explain the procedures involved. The "lm_utils" directory contains the CMU Standard Language Model Toolkit, version 1.0 (a "beta" release). This is the package used to create the LM in this release, using the text data in the "vp" directory. The "lm_utils" directory contains source code in C, C-chell scripts, object files, and complete instructions for installation and use of the package. This portion of the LM-1 corpus has been mastered onto cdrom without conversion of file names to the ISO 9660 limitations, through the use of the "Rock Ridge" extensions to ISO 9660. Users on UNIX systems (which typically support the Rock Ridge extensions) should be able to install and use the package without difficulty. Users on other systems (e.g. DOS or Macintosh) will have some difficulty in identifying some file names and using the package (all files will be present and uniquely identified, but names that appear in directory listings may not match the names given in documentation files -- also, C-shell scripts may be of little or no use). ----------- David Graff Linguistic Data Consortium August 1994