File: tsidcorp.doc ================== Description of the Tactical Speaker Identification Corpus 0. Introduction ---------------- This corpus was collected by Douglas Reynolds and Gerald C. O'Leary of MIT Lincoln Labs. It contains recordings of 35 speakers (4 female, 31 male), using a variety of different radio transmitters and receivers. The recording sessions were conducted by assembling the speakers into 7 groups of 5, then having each speaker perform the following tasks: - read a list of TIMIT sentences - read a list of digit strings - give directions for traveling from one point to another using a map (unscripted map task) Each speaker performed this set of tasks on each of three transmitters (xmtr1-3), and the utterances were recorded simlutaneously on DAT recorders attached to each of six receivers (rcvr1-6), which were located at some distance (well out of ear-shot) from the transmitter. Recordings were also made at the same time on a DAT recorder near the speaker, using a head-mounted microphone, to provide a reference wide-band recording of the speech (refwb). As a result, the corpus is organized along four dimensions: speaker, transmitter, receiver, and speaking task; this organization can be viewed as a four-dimensional matrix, with 35x3x7x3 cells. Due to some occasional mishaps and malfunctions during the collection, some cells in this matrix are either empty or only partially full. In addition to the tasks listed above, three pairs of speakers also participated in a two-way map task using xmtr3; in this case, one of the speakers in the task gives directions to the other for tracing a route on a map, and both speakers are recorded on a single audio channel at each of the receivers (except for the "refwb" recording: the two speakers were separated by some distance, using radio communication to perform the task, and only one of them used a head-mounted microphone and local DAT recorder for wide-band recording). 1. Corpus Organization ----------------------- The speech data on each of the 10 CD-ROM's in the corpus is organized in the following directory structure: tsid/ spkrSS/ xmtrX/ rcvrR/ or refwb/ TASK/ filename.sph where: SS = speaker number (two digits) X = transmitter number (1,2,3) R = receiver number (1,2,3,4,5,6) TASK = one of: "digits", "sentence", "maptask", "maptask2" The root directory in each of the CD-ROM's contains a "tsid" directory, and this in turn contains four to six speaker directories; all data for a given speaker is contained under the one speaker directory. In the two digit speaker number, the first digit identifies the group membership, and the second digit identifies the individual within the group. All file names reflect the directory path that contains them, and are therefore unique across the entire corpus. The structure of the file names is: sSSXRTTT.sph where: SS = speaker number (same as above) X = transmitter number (same as above) R = receiver number, or "w" for wide-band recording TTT = task+utterance utterance number For the digit-string list and sentence list tasks, TTT is "d" or "s" followed by a two-digit utterance number; each digit-string and sentence utterance is stored in a separate speech file. For the map tasks, TTT is "mt1" or "mt2"; each complete map task session is stored in a single speech file. 2. Supplementary Tables ------------------------ The "tables" directory on each CD-ROM contains the following table files: filename.tbl : list of all speech file names, including CD-ROM volume-ids and directory paths spkrinfo.tbl : list of speakers, including gender and geographic background information xmtrX.tbl : for each transmitter, list of the number of speech files present in the corpus, broken down by speaker, receiver and task mt2_S1S2.tbl : for each 2-way map task recording, list of time stamps for speaker turn boundaries The "filename.tbl" listing can be used to determine which CD-ROM holds the data for a given speaker, and to identify all the paths and file names that are present for any chosen category or subset of data (e.g. to locate all the files involving a particular combination of transmitter and receiver). Each "xmtrX.tbl" listing provides an inventory of the number of files present for the corresponding transmitter. The inventory is organized as a table with one row for each speaker and one column for each receiver (plus a column for the reference wide-band recordings). Within each cell of the table, there are four numbers, separated by colons, which indicate the number of speech files present for each of the four speaking tasks: sentences, digit strings, map task 1, and map task 2. Below is a sampling of rows from "xmtr3.tbl": # File inventory table for XMTR3 # Cell fields are Timit_sentences:Digit_strings:MapTask_1:MapTask_2 # Spkr RCVR1 RCVR2 RCVR3 RCVR4 RCVR5 RCVR6 REFWB spkr11 0:0:0:0 0:0:0:0 0:0:0:0 0:0:0:0 0:0:0:0 0:0:0:0 26:25:1:0 [...] spkr21 26:25:1:0 26:25:1:0 26:25:1:0 26:25:1:0 26:25:1:0 26:25:1:0 26:25:1:0 spkr22 21:25:1:0 21:8:0:0 21:25:1:0 21:25:1:0 21:25:1:0 21:25:1:0 21:25:1:0 spkr23 26:25:1:0 0:0:0:0 26:25:1:0 26:25:1:0 26:25:1:0 26:25:1:0 26:25:1:0 [...] spkr73 26:25:1:1 26:25:1:1 26:25:1:1 26:25:1:1 26:25:1:1 26:25:1:1 26:25:1:1 [...] Each of these tables has three header lines (with initial "#") describing the content and providing column headers. Columns are separated by tab characters. In samples shown above, it's apparent that only the wide-band recordings were made successfully when "spkr11" was using "xmtr3"; also, something went wrong with "rcvr2" while "spkr22" was reading digit strings on this transmitter, and this affected subsequent recordings from the same group; spkr73 is one of the few who successfully completed a "map task 2" session. Each "mt2_S1S2.tbl" file contains a header that identifies the 2-way map task session that it applies to, followed by a list of labeled time offsets, which establish the locations of speaker turn boundaries within the associated waveform files. The "S1S2" portion of the table file name identifies the two speakers in the task -- the waveform files are found under the speaker directory associated with "S1" (e.g., for "mt2_7374.tbl", the waveform data will be found under "spkr73", as indicated in the "xmtr3.tbl" extract shown above). Each time offset record in the table identifies the beginning of a speaker turn, and which speaker begins a turn at that point. 3. How the corpus was created ------------------------------ MIT Lincoln Labs arranged the recruitment of speakers and carried out the field recordings. The recording sessions run as follows: the members of a group took turns performing their three or four speaking tasks with the first transmitter, then with the next transmitter, and so on, until all transmitters had been used; as they spoke, seven indepent DAT recorders were (generally) capturing the speech via their respective receivers (or the reference wide-band microphone). The DAT cartridges that were recorded in this way were sent to the LDC, where the digital audio signal was downsampled from the original DAT sampling frequency to a sample rate of 16 KHz (i.e. 8 KHz bandwidth), and stored in computer files with NIST SPHERE headers. This yielded one speech file for each combination of group, transmitter and receiver. The wide-band speech files were then manually segmented, using software to display, play back and time-stamp waveform data, so as to separate the speakers within each group session, separate the speaking tasks within each individual speaker session, and separate the digit strings and sentences within these two speaking tasks. When a wide-band speech file was not available for a given session, the best quality receiver recording was used instead. Once the time boundaries of speakers, tasks and utterances were known from the high-quality recording of each session, the same waveform editing software (the "xwaves" package from Entropic Research Labs) was used to establish time alignments between the reference (segmented) speech file and each of the associated receiver-recording files; initial alignment points were selected visually in the corresponding speech files, and the time offsets between the reference file and each receiver file were measured and stored in each of the receiver file headers. This allowed the waveform display software to present all the recordings of a session together on one screen, with proper time alignment. With all recordings of a session visible at one time, and with time alignment established at the start of each recording, it was possible to scroll through each session, and determine whether the initial time alignment was correctly sustained throughout the entire session. In cases where one of the recordings fell out of alignment, or where one of the receivers failed at some point during a session, an index was kept of the last usable utterance in the affected receiver file. The manual segmentation time stamps and the quality-check index information were then combined to perform extraction of segments from the original waveform files, to produce the directory structure and file inventory for publication of the corpus. For the lists of sentences and digit strings, the time stamps established the beginning and ending points of these tasks in each session, and dividing points between each of the utterances within the task; when the utterance segments were extracted into separate files for publication, all the "silences" (i.e. non-speech portions) between the utterances were retained in the margins of the output files. For the map task recordings, the time stamps established the beginning and ending points of the task, and the entire task was extracted into a single output file as one segment. For the three instances of 2-way map task sessions, additional time stamps were established at speaker turn boundaries, but the entire 2-way map task session was still extracted into a single output file; the additional time stamp information for turn boundaries is presented in tables (see section 2 above).