CSR-II documentation.

This readme file is taken from one of the disks. CSR-II is a large corpus containing more than 100 Mb of on-line documentation. We hope to have incorporated a representative sample (omitting .ps files and long lists). If you feel we are missing something important please do not hesitate to contact us.
Please send comments to :member-service@ldc.upenn.edu

November 1993 ARPA CSR Hub and Spoke Benchmark Tests Corpora and Instructions

NIST Speech Discs 13-32.1, 13-33.1

Public Release, May, 1994

W A R N I N G

If you intend to implement the protocols for the November '93 ARPA CSR Benchmark Tests, please read the file, "csrnov93.doc", in the top-level directory of NIST Speech Disc 13-32.1 in its entirety before proceeding and do not examine the included transcriptions, calibration recordings, adaptation recordings, or documentation unless such examination is specifically permitted in the guidelines for the test(s) being run. Index files have been included which specify the exact data to be used for each test. To avoid testing on erroneous data, please refer to these files when running the tests.

W A R N I N G

The test corpora for the November 1993 ARPA CSR Hub and Spoke Evaluation Test suite is contained on 2 CD-ROMs, NIST speech discs 13-32.1 and 13-33.1.

Disc 13-32.1 contains the documentation for the tests and the test corpora for the Spoke 9 tests. With the exception of the primary documentation file, "csrnov93.doc", located in this directory, the documentation for the tests and corpora are located under the "wsj1/doc" directory on disc 13-32.1. In addition to online documentation, Disc 13-32.1 contains software packages useful in processing the speech corpora and tabulating speech recognition scores.

The top-level directory of Disc 13-32.1 contains the following files and subdirectories:

    13_32_1.dir File containing a listing of the all the files and
		directories on this disc.

    13_32_1.txt File containing a summary of the corpora on this disc.

   csrnov93.doc File containing an overview of the tests and corpora
                to implement the November '93 ARPA CSR Hub and Spoke
                Evaluation Tests.  This file also describes the directory
                and file structure of the discs.

   discinfo.dir Same contents as file "13_32_1.txt".

         hgrep/ Utility to search a collated SPHERE header contents file.

     readme.doc This file.

         score/ NIST speech recognition scoring software.  Includes 
	        dynamic programming string-alignment scoring code and
		statistical significance tests.

	sphere/ NIST SPeech HEader REsources toolkit.  Provides command-
	        line and programmer interface to NIST-headered speech 
	        waveform files.  Also provides for automatic decompression
	        of the Shorten-compressed waveform files on these discs.

      tranfilt/ Directory containing a UNIX shell script used to perform
		the post-adjudication transcription filter process.

	  wsj1/ Test corpora for Spoke 9 and documentation.

General information files named "readme.doc" have been included in the high-level directories and throughout the documentation directory ("wsj1/doc") on Disc 13-32.1 and describe the contents of the directories.

wsj1/doc/readme.doc:

This directory contains the following subdirectories and files which contain documentation for the November 1993 ARPA CSR Hub and Spoke Benchmark Tests and test corpora:


data_sum.doc	file containing summary of the data collected for each WSJ1
                data type in the tests as well as a by-CD-ROM listing of 
                each data type;

dot_spec.doc    file containing the WSJ1 detailed orthographic transcription
                specifications;

evl_spok/	directory containing documentation on the collection of the
                test data;  includes a collated speaker information file,
                "spkrinfo.txt";

h1newart.ndx    listing of waveforms where article boundaries occur in the H1
                test corpora;

indices/	directory containing indices of corpora to be used for each
                test and training condition;

lng_modl/	directory containing MIT Lincoln Lab's '87-89 WSJ language
                model, source texts, vocabularies, and text selection tools;

nov93_h1/       directory containing the output from, and scored results for, 
		the LIMSI November '93 H1-C1 system;

nov93_h2/	directory containing the output from, and scored results for,
		the Cambridge University-HTK November '93 H2-C1 system;
		
readme.doc	this file;

s1_oov.lex	Listing of out-of-vocabulary lexemes in the Spoke 1 test 
                corpora with regard to the CSR-WSJ 20K Open vocabulary;

s1newart.ndx	listing of waveforms where article boundaries occur in the
                Spoke 1 test corpora;

si_et_93.hdr 	collated SPHERE header contents file for all November
                1993 ARPA CSR test corpora;