*************************************************************************
Multi-Language Telephone Speech Corpus Distribution

README.TXT

release January, 1994

Copyright 1994
Center for Spoken Language Understanding
Oregon Graduate Institute of Science & Technology
*************************************************************************


This disc contains the following files and directories:

1) readme.txt - this text file

2) overview.txt - file describing the OGI Multi-language Telephone
	Speech Corpus in detail

3) calls/ - A directory containing the directories for each language,
	which in turn contain the speech files for the respective
	languages.  Therefore, the directory structure for calls/ will
	look like the following:

	english/  french/  hindi/     korean/    spanish/   vietnam/
	farsi/    german/  japanese/  mandarin/  tamil/

	The speech files are organized according to a call number
	system (see 4 below, i.e. data.doc).  For ease of file
	handling, the files are divided in groups of 10 per directory.
	For example, the farsi/ directory contains the directories:

	00/     02/     04/     06/     08/     10/     12/     14/
	01/     03/     05/     07/     09/     11/     13/     15/

	Directory 00 contains calls identified by call numbers 0-9;
	directory 15 contains calls identified by call numbers
	150-159; etc.

4) doc/  - directory containing the following documentation files:
        
	data.doc -- file containing the conventions used for naming
		the data files

        formats.doc -- file containing documentation on the speech
		(.wav) and transcription (.ptlola) file formats.

        header.doc -- file giving details of the NIST SPHERE header
		structure

	ph1_logs.doc -- file describing the contents of the .log files
		created during Phase I of the development process

	ph2_logs.doc -- file describing the contents of the .log2
		files created during Phase II of the development
		process

	mltlngdb.ps -- postscript file containing the article: "The
		OGI Multi-language Telephone Speech Corpus" Y. K.
		Muthusamy, R. A. Cole and B. T. Oshika Proceedings of
		the International Conference on Spoken Language
		Processing, Banff, Alberta, Canada, October 1992.

5) seglola/ - directory containing broad phonetic transcriptions (i.e.
	.seg files --- see overview.txt and doc/formats.doc for more
	information on these files).

6) logs/ - directory containing logfiles of corpus development, Phase
	I, which consisted of preliminary verification, chopping,
	evaluation and broad phonetic transcriptions of each utterance
	  
7) logs2/ - directory containing logfiles of corpus development, Phase
	II, consisting of verification and evaluation of calls by
	native speakers of each individual language

8) trn_test/ - directory containing files describing the training,
	development and test sets used by Yeshwant Muthusamy for his
	Ph.D. Thesis research.

9) sphere/ - directory containing files needed to uncompress the .wav
	files


PLEASE NOTE:
------------

This publication of the OGI Multi-Language Telephone Speech Corpus,
produced on CD-ROM by the Linguistic Data Consortium, contains a few
minor modifications relative to the version distributed on tape by
OGI.  To begin with, directory and file names have been simplified
where necessary to conform to ISO 9660 conventions for file naming.
In addition, we have included the more current SPHERE package (version
2.0) from NIST, and have applied a more effective waveform compression
algorithm (the "shorten" compression method developed by Tony Robinson
of Cambridge University, as implemented in the current release of
SPHERE).  In performing this conversion of the waveform data, we also
supplemented the information in each file's SPHERE header to include
common header fields that were missing from the original files (sample
min & max, sample coding).  Relevant changes to the various log and
documentation files have been made as necessary.  Finally, the tape
distributed by OGI contained numerous log files for which there were
no corresponding speech data, as well as a few empty log files; these
have been eliminated from this release.