File: tsidcorp.doc
==================

      Description of the Tactical Speaker Identification Corpus


0.  Introduction
----------------

This corpus was collected by Douglas Reynolds and Gerald C. O'Leary of
MIT Lincoln Labs.  It contains recordings of 35 speakers (4 female, 31
male), using a variety of different radio transmitters and receivers.
The recording sessions were conducted by assembling the speakers into
7 groups of 5, then having each speaker perform the following tasks:

 - read a list of TIMIT sentences
 - read a list of digit strings
 - give directions for traveling from one point to another using a map
	(unscripted map task)

Each speaker performed this set of tasks on each of three transmitters
(xmtr1-3), and the utterances were recorded simlutaneously on DAT
recorders attached to each of six receivers (rcvr1-6), which were
located at some distance (well out of ear-shot) from the transmitter.
Recordings were also made at the same time on a DAT recorder near the
speaker, using a head-mounted microphone, to provide a reference
wide-band recording of the speech (refwb).

As a result, the corpus is organized along four dimensions: speaker,
transmitter, receiver, and speaking task; this organization can be
viewed as a four-dimensional matrix, with 35x3x7x3 cells.  Due to some
occasional mishaps and malfunctions during the collection, some cells
in this matrix are either empty or only partially full.

In addition to the tasks listed above, three pairs of speakers also
participated in a two-way map task using xmtr3; in this case, one of
the speakers in the task gives directions to the other for tracing a
route on a map, and both speakers are recorded on a single audio
channel at each of the receivers (except for the "refwb" recording:
the two speakers were separated by some distance, using radio
communication to perform the task, and only one of them used a
head-mounted microphone and local DAT recorder for wide-band
recording).


1.  Corpus Organization
-----------------------

The speech data on each of the 10 CD-ROM's in the corpus is organized
in the following directory structure: 

	tsid/
	  spkrSS/
	    xmtrX/
	      rcvrR/  or  refwb/
		TASK/
		  filename.sph

where:
	SS = speaker number (two digits)
	X  = transmitter number (1,2,3)
	R  = receiver number (1,2,3,4,5,6)
	TASK = one of: "digits", "sentence", "maptask", "maptask2"

The root directory in each of the CD-ROM's contains a "tsid"
directory, and this in turn contains four to six speaker directories;
all data for a given speaker is contained under the one speaker
directory.

In the two digit speaker number, the first digit identifies the group
membership, and the second digit identifies the individual within the
group.

All file names reflect the directory path that contains them, and are
therefore unique across the entire corpus.  The structure of the file
names is:

		sSSXRTTT.sph

where:
	SS = speaker number (same as above)
	X  = transmitter number (same as above)
	R  = receiver number, or "w" for wide-band recording
	TTT = task+utterance utterance number

For the digit-string list and sentence list tasks, TTT is "d" or "s"
followed by a two-digit utterance number; each digit-string and
sentence utterance is stored in a separate speech file.

For the map tasks, TTT is "mt1" or "mt2"; each complete map task
session is stored in a single speech file.


2.  Supplementary Tables
------------------------

The "tables" directory on each CD-ROM contains the following table
files:

	filename.tbl : list of all speech file names, including
			CD-ROM volume-ids and directory paths

	spkrinfo.tbl : list of speakers, including gender and
			geographic background information

	xmtrX.tbl    : for each transmitter, list of the number of
			speech files present in the corpus, broken
			down by speaker, receiver and task

	mt2_S1S2.tbl : for each 2-way map task recording, list of time
			stamps for speaker turn boundaries

The "filename.tbl" listing can be used to determine which CD-ROM holds
the data for a given speaker, and to identify all the paths and file
names that are present for any chosen category or subset of data
(e.g. to locate all the files involving a particular combination of
transmitter and receiver).

Each "xmtrX.tbl" listing provides an inventory of the number of files
present for the corresponding transmitter.  The inventory is organized
as a table with one row for each speaker and one column for each
receiver (plus a column for the reference wide-band recordings).
Within each cell of the table, there are four numbers, separated by
colons, which indicate the number of speech files present for each of
the four speaking tasks: sentences, digit strings, map task 1, and map
task 2.  Below is a sampling of rows from "xmtr3.tbl":

# File inventory table for XMTR3
# Cell fields are Timit_sentences:Digit_strings:MapTask_1:MapTask_2
# Spkr  RCVR1           RCVR2           RCVR3           RCVR4           RCVR5           RCVR6           REFWB
spkr11  0:0:0:0         0:0:0:0         0:0:0:0         0:0:0:0         0:0:0:0         0:0:0:0         26:25:1:0 
[...]
spkr21  26:25:1:0       26:25:1:0       26:25:1:0       26:25:1:0       26:25:1:0       26:25:1:0       26:25:1:0 
spkr22  21:25:1:0       21:8:0:0        21:25:1:0       21:25:1:0       21:25:1:0       21:25:1:0       21:25:1:0 
spkr23  26:25:1:0       0:0:0:0         26:25:1:0       26:25:1:0       26:25:1:0       26:25:1:0       26:25:1:0 
[...]
spkr73  26:25:1:1       26:25:1:1       26:25:1:1       26:25:1:1       26:25:1:1       26:25:1:1       26:25:1:1 
[...]

Each of these tables has three header lines (with initial "#")
describing the content and providing column headers.  Columns are
separated by tab characters.  In samples shown above, it's apparent
that only the wide-band recordings were made successfully when
"spkr11" was using "xmtr3"; also, something went wrong with "rcvr2"
while "spkr22" was reading digit strings on this transmitter, and this
affected subsequent recordings from the same group; spkr73 is one of
the few who successfully completed a "map task 2" session.

Each "mt2_S1S2.tbl" file contains a header that identifies the 2-way
map task session that it applies to, followed by a list of labeled
time offsets, which establish the locations of speaker turn boundaries
within the associated waveform files.  The "S1S2" portion of the table
file name identifies the two speakers in the task -- the waveform
files are found under the speaker directory associated with "S1"
(e.g., for "mt2_7374.tbl", the waveform data will be found under
"spkr73", as indicated in the "xmtr3.tbl" extract shown above).
Each time offset record in the table identifies the beginning of a
speaker turn, and which speaker begins a turn at that point.


3.  How the corpus was created
------------------------------

MIT Lincoln Labs arranged the recruitment of speakers and carried out
the field recordings.  The recording sessions run as follows: the
members of a group took turns performing their three or four speaking
tasks with the first transmitter, then with the next transmitter, and
so on, until all transmitters had been used; as they spoke, seven
indepent DAT recorders were (generally) capturing the speech via their
respective receivers (or the reference wide-band microphone).

The DAT cartridges that were recorded in this way were sent to the
LDC, where the digital audio signal was downsampled from the original
DAT sampling frequency to a sample rate of 16 KHz (i.e. 8 KHz
bandwidth), and stored in computer files with NIST SPHERE headers.
This yielded one speech file for each combination of group,
transmitter and receiver.

The wide-band speech files were then manually segmented, using
software to display, play back and time-stamp waveform data, so as to
separate the speakers within each group session, separate the speaking
tasks within each individual speaker session, and separate the digit
strings and sentences within these two speaking tasks.  When a
wide-band speech file was not available for a given session, the best
quality receiver recording was used instead.

Once the time boundaries of speakers, tasks and utterances were known
from the high-quality recording of each session, the same waveform
editing software (the "xwaves" package from Entropic Research Labs)
was used to establish time alignments between the reference (segmented)
speech file and each of the associated receiver-recording files;
initial alignment points were selected visually in the corresponding
speech files, and the time offsets between the reference file and each
receiver file were measured and stored in each of the receiver file
headers.  This allowed the waveform display software to present all
the recordings of a session together on one screen, with proper time
alignment.

With all recordings of a session visible at one time, and with time
alignment established at the start of each recording, it was possible
to scroll through each session, and determine whether the initial time
alignment was correctly sustained throughout the entire session.  In
cases where one of the recordings fell out of alignment, or where one
of the receivers failed at some point during a session, an index was
kept of the last usable utterance in the affected receiver file.

The manual segmentation time stamps and the quality-check index
information were then combined to perform extraction of segments from
the original waveform files, to produce the directory structure and
file inventory for publication of the corpus.

For the lists of sentences and digit strings, the time stamps
established the beginning and ending points of these tasks in each
session, and dividing points between each of the utterances within the
task; when the utterance segments were extracted into separate files
for publication, all the "silences" (i.e. non-speech portions) between
the utterances were retained in the margins of the output files.

For the map task recordings, the time stamps established the beginning
and ending points of the task, and the entire task was extracted into
a single output file as one segment.  For the three instances of 2-way
map task sessions, additional time stamps were established at speaker
turn boundaries, but the entire 2-way map task session was still
extracted into a single output file; the additional time stamp
information for turn boundaries is presented in tables (see section 2
above).