Documentation for Speech in Noisy Environments (SPINE) Training Corpora

Introduction

This publication contains the Speech in Noisy Environments (SPINE) Training Audio created for the Department of Defense (DoD) Digital Voice Processing Consortium (DDVPC) by Arcon Corp., and produced by the Linguistic Data Consortium (LDC) catalog number LDC2000S87, isbn 1-58563-173-6. A companion corpus, Speech in Noisy Environments (SPINE) Training Transcripts, was also produced by the Linguistic Data Consortium (LDC) catalog number LDC2000T49 and isbn 1-58563-174-4. These corpora support the 2000 Speech in Noisy Environments (SPINE1) evaluation.

The 2000 Speech in Noisy Environments Evaluation (SPINE1) is a first attempt to assess the state of the art and practice in speech recognition technology in noisy military environments and to exchange information on innovative speech recognition technology in the context of fully implemented systems that perform realistic tasks. It is intended to be of interest to all university, industrial and commercial speech system developers working on the problem of robust speech recognition. The evaluation gives participants the opportunity to participate in a flexible evaluation, suited to development needs and abilities.

Technical Objective

The SPINE1 evaluation focuses on the task of transcribing speech produced in noisy environments with emphasis on noisy military environments. The evaluation is designed to promote research progress in this area, to provide the opportunity for participants to try out new ideas for developing robust speech recognition systems that are of both scientific and practical interest, and to measure the performance of this technology.

Task

The evaluation task is to transcribe speech produced in noisy environments. The training and test speech data to be used for this evaluation were generated by ARCON Corp. for the DoD Digital Voice Processing Consortium (DDVPC) under controlled conditions. The speech data consists of conversations between two communicators working on a collaborative, Battleship-like task in which they seek and shoot at targets (ARCON Communicability Exercise, ACE). Participants may talk freely, but the total vocabulary used is fairly limited. Each person is seated in a sound chamber in which a previously recorded military background noise environment is accurately reproduced. The participants use handsets and transmission channels that are resident to the particular environment. The training data includes ten of twenty available talker pairs with fourteen five-minute conversations per talker pair (about 720 minutes total), which include four scenarios as described below.

The speech data is viewed as a sequence of "turns," where each turn is the period of time when one speaker is speaking. By its nature, the task induces short utterances with relatively long periods of silence intervening. There may be multiple speaker turns for each speaker, i.e. each successive turn may not result in a reversal of speaking and listening roles for the conversation participants. The transcription task is to produce the correct transcription for each of the specified turns.

Please see file.tbl for the directory structure of this publication, as well as a complete list of files.

Data Format

The audio files in this corpus are 2-channel, 16 KHz, 16 bit linear SPHERE files.

The file train_list.tbl has file information in six tab separated columns as follows:

File		Pair	Spkr 1	Spkr 2	Scen	Vocoder
----		----	------	------	----	-------
p02_dd_04	pair02	1266=A	1884=B	DoD	Cdr E
p02_dd_09	pair02	1266=A	1884=B	DoD	CELP
p02_dd_10	pair02	1266=A	1884=B	DoD	mnru 07
p02_dd_16	pair02	1266=B	1884=A	DoD	Cdr E

"File" contains the filename, without the .sph or .typ extension that indicate Sphere audio files and transcripts respectively. "Pair" contains the speaker pair number, while "Spkr 1" and "Spkr 2" contain the individual speaker id's and which environment they were in. Scenarios and Environments are described below. "Scen" contains the scenario, and "Vocoder" contains the vocoder type. The file name contains information which is either redundant or irrelevant; in the publication of the evaluation data, this column will contain totally arbitrary filenames.

There are ten speaker pairs in the training set, numbers 2-4, 6-11, and 22. The individual speaker ID's can be used to look up speaker information in speaker.tbl which has six tab separated columns, e.g.:

Pair	Speaker	Age	Sex	Educ	Dialect		Native
----	-------	---	---	----	-------		------
02	1266	34	M	16	New England	Y
02	1884	28	M	16	New England	Y
03	9693	32	F	12	New England	Y
03	2788	30	F		New England	Y

"Pair" contains the pair number, and "Speaker" contains the individual speaker ID. "Age" contains the age, calculated as of 9/1/1995; the recordings were made in the fall of 1995. "Sex" contains the sex, "Educ" contains the years of education, "Dialect" contains the region or dialect, and "Native" indicates whether or not the speaker is a native speaker of English. The information is self-reported; gaps indicate information that was not given.

The following Environment table shows the channel, the two noise environments, and the two handsets used in the recordings for each scenario.

Environment Table

Name        Environment A        Handset A  Environment B  Handset B Channel
----        -------------        ---------  -------------  --------- -------
DOD         Quiet                STU-III    Office         STU-III   POTS/STU-III
Navy        Aircraft Carrier CIC TA840      Office         STU-III   HF
Army        HMMWV                H250       Quiet          STU-III   Satellite Delay
Air Force   E3A AWACS            R215       MCE            EV M87    JTIDS

The combination of environment and scenario can be used to determine the noises, handsets, and channels involved in the recording from the environment table.

NOTE: After the initial creation of this corpus had been sent to evaluation participants, it was discovered that six of the one hundred forty files had been replaced by files that did not match the intended combinations of noises, vocoders, and speaker pairs. The six files are:

p02_nv_09
p03_nv_20
p04_nv_13
p08_nv_03
p08_nv_15 
p22_nv_13

These files are technically equivalent to the other one hundred thirty-four files, however, they do not conform to the original selection profile.

Updates

Should any additional information, updates or bug fixes become available, they will appear in the LDC catalog entry for this corpus: LDC2000S87.