README FILE FOR: RATS Low Speech Density Data (LSDD) Corpus LDC Catalog-ID: LDC2024S03 Authors: Stephanie Strassel Kevin Walker Karen Jones 0.0 Overview of README contents 1.0 Introduction 2.0 Corpus Structure 2.1 Organization of Directories 2.2 File Name Patterns 3.0 Structure of Documentation Tables 4.0 Description of Data Files 5.0 Description of Audio Collection Process 5.1 Information on Transceivers and Receivers 1.0 Introduction The Low Speech Density Training, Development, and Progress Data Sets have been generated in order to focus on the measurement of false alarm performance in RATS SAD systems developed under the DARPA RATS (Robust Automated Transcription of Speech) program. The goal of the RATS program was to develop Human Language Technology (HLT) systems capable of performing speech detection, language identification, speaker identification and keyword spotting on the severely degraded audio signals that are typical of various radio communication channels, especially those employing various types of handheld portable transceiver systems. To support that goal, the LDC assembled a specialized system for transmission, reception and digital capture of audio data, such that a single source audio signal could be distributed and recorded over eight distinct transceiver configurations simultaneously. The relatively clear source audio data was annotated manually to provide the labels needed for a given HLT task - e.g. speaker identification labels - and these annotations were then projected onto the corresponding eight channels of audio that were recorded from the radio receivers. Further details are provided in later sections of the README file. The recordings in each partition have low percentages of speech, and the selected speech clips are short utterances. For Training and Development, the speech utterances were extracted from the RATS DEV-2 SAD and KWS Data Sets. For Progress, the speech utterances were extracted from the RATS Progress SAD Data Set. Non-speech samples were selected from communications systems sounds, including telephone network special information tones, radio selective calling signals, HF/VHF/UHF digital mode radio traffic, radio network control channel signals, two-way radio traffic containing roger beeps, and short duration shift-key modulated handset data transmissions. A total of 405 source audio files (135 in each partition) have been assembled by concatenating a randomized selection of speech, comms systems sounds, and silence. The average duration of the assembled audio files is about 13 minutes; this is equivalent to the average duration of original RATS SAD and KWS source audio files, each of which typically comprised one side of a telephone conversation. The total duration for the Low Density Training, Development, and Progress Data Sets is 87 hours (29 hours per partition). Acknowledgments: This material is based upon work supported by the Defense Advanced Research Projects Agency (DARPA) under Contract No. D10PC20016. The content does not necessarily reflect the position or the policy of the Government, and no official endorsement should be inferred. We would like to express special thanks to Dan Ellis at Columbia University, and John Hansen at the University of Texas at Dallas, for their substantial technical assistance during the creation of the RATS corpus. Henry Goldberg and David Longfellow at Leidos (formerly SAIC) provided the partitioning of corpus data and initial selection of target keywords. 2.0 Corpus Structure 2.1 Organization of Directories The directory structure is organized by partition; each partition has an "audio" subdirectory subdivided by channel; there are also directories for "re-inverted" audio and frame-by-frame skewview outputs, all subdivided by channel: data/ {dev,progress,train}/ {audio,skewview}/ {A,B,C,D,E,F,G,H,I,J,K,L,M,N,O,P,Q,R,S,T,U,V,src,XMT}/ audio_rvrt/ {B,E,L,O}/ docs/ *.tab -- (4 files) see explanations in section 2.2 below segmentation_docs -- see segmentation_docs/DOCUMENTATION_README.txt 2.2 File Name Patterns All data file names fall into three basic patterns: (a) Original source (clean) audio file names: {srcid}_src.flac (b) Transmit Station (clean) audio filenames: {transmission datetime}_{srcid}_XMT.flac (c) Over-The-Air Transmission Recording audio file names: {transmission datetime}_{srcid}_{transmission channel}.flac "srcid" can take the following forms, according to the partition: {part}_sch[ABCD]_NN ("part"= "dev","progress","train") The middle field of the file name indicates which one of four distinct "schema" was used to assemble the series of varied audio snippets that were concatenated to build the source audio file. The two-digit final field is a sequence number that starts at "01" for each schema; so within each partition (dev, progress, train), each of the four schema (A, B, C, D) has a set of files numbered "01", "02", etc. (See the "segmentation_docs" directory for more information about this data set.) 3.0 Structure of Documentation Tables The "docs/segmentation_docs" directory contains DOCUMENTATION_README.txt and a collection of tables to describe the design and assembly of the source audio files for the Low Speech Density data set. The "docs/segmentation_docs/map_files" directory contains annotation for each source audio file. The map files are organized by partition. Each map file shows the start time, end time, and segment type for the segments which were used to construct a given source audio file. The "docs/segmentation_docs/srcid2lng.tab provides the language for each file. Language codes are as follows: Language Code Language alv Levantine Arabic eng English fas Farsi pus Pashto urd Urdu The "docs" directory also contains four table files; except as noted, these are tab-delimited tables with an initial line of column headings: * file_md5s.list list of all data files (*.flac and *.skvw-dat) with file MD5 checksums (space-delimited, no header line) * flac_info.tab list of all flac files with duration, compressed KB, uncompressed KB, compression ratio, and MD5 checksum of sample data * skvw_results.tab list of all retransmitted (degraded) channel flac files with 5 columns: 1. task (lsdd) 2. part (dv2, prg or trn) 3. file_id 4. lagsec: seconds offset relative to a clean version of the audio 5. stddev: the skewview measure of variability for alignment Columns 4 and 5 were derived by running skewview on the specific channel file, using as a "reference" the XMT version of the source audio (the recording of the signal as presented to the transmitters during the session). High magnitude values for the "lagsec" and/or "stddev" columns indicate that skewview was not able to form a reliable alignment to between the clean and radio-channel versions of the given source audio. * src_xmt_skvw_res.tab list of all *_XMT files, with skewview results computed using the corresponding *_src audio file as the reference; this provides the relative time offset between a given "src" audio file and the corresponding "XMT" channel that was recorded during the session. (Skewview results for the receiver channels were then computed using the XMT file as the reference.) This table uses the same column arrangement as skvw_results.tab. 4.0 Description of Data Files All audio files are presented here as single-channel, 16-bit PCM, 16000 samples per second; lossless FLAC compression is used on all files; when uncompressed, the files have typical "MS-WAV" (RIFF) file headers. Each LSDD source file was constructed according to one of four "schema", which defined the overall proportions of speech, comms sounds and silence; in all four schema, the amount of speech, ranging between 0 and about 12% of file duration, is much lower than in the original full-call-side recordings, which were typically 45% to 50% speech. The specifications for the four schema are described in detail in the file "schema_specs.txt". In the retransmission sessions recorded for this release, there were some variations in the resulting inventory of channels represented. There were intermittent failures on channel D that affected 56 of the 135 LSDD train sessions and 3 of the 135 LSDD dev set sessions. Except as noted above, all sessions for all tasks have a full complement of channel recordings. 5.0 Description of Audio Collection Process 5.1 Information on Transmitters and Receivers All radio transmitters were located in the LDC office suite at 3600 Market Street, Philadelphia. The radio receivers where installed at three locations: L: 3600 Market Street, 16 receivers R: 3401 Walnut Street, 2 receivers I: 3600 Market Street, 2 receivers (The "L" and "I" locations were different rooms within the LDC office suite.) The table below lists the specifics of the 20 channels represented in this release: LTR Loc Receiver_name Transmitter_name Properties A L01 TenTec RX331 chA Alinco DX-SR8T SSB B L02 Trisquare TSX300 Trisquare TSX300 Inverted Spectrum over FHSS C L03 TenTec RX400 chA Motorola CDM1550 G4GUO AMBE Vocoder Digital Voice D L04 Motorola DTR650 Motorola DTR650 Spread Spectrum E L05 Icom IC-R8500 Wouxun KG-UV3D Discriminator Tap & Inverted Spectrum F L06 Motorola XPR6580 Motorola XPR4580 MOTOTrbo DMR Vocoder Digital Voice G L07 TenTec RX400 chC Motorola CDM1550 Rowetel Codec 2 Digital Voice H L08 Vostek VRX-24LTS Vostex LX-3000 Wideband FM I L09 TenTec RX331 chB Alinco DX-SR8T Digital Noise Reduction J L10 AOR AR5001D_01 Motorola HT1250 NFM & 15Khz IF K L11 TenTec RX400 chB Motorola CDM1550 Co-channel Interference L L12 Icom IC-R75 Ranger RCI2950DX Inverted Spectrum over SSB M L13 TenTec RX340 Uniden BC980SSB Variable BFO & Slow AGC SSB N L14 AOR AR5001D_02 Motorola CDM1550 NFM & 6KHz IF O L15 Icom IC-R8500 Wouxun KG-UV3D Inverted Spectrum over NFM P L16 AOR AR5001D_03 Kenwood TK-5120 APCO P25 Q R01 AOR AR8200mk3 Motorola CDM1550 Remote Receiver, AFC, Wide IF R R02 AOR AR8200mk3 Motorola CDM1550 Remote Receiver, AFC, Narrow IF S I01 Alinco DJ-X11 Icom IC-F2821-UT110 Rolling Code Inverted Spectrum, Scrambled T I02 Icom IC-F2821-UT110 Icom IC-F2821-UT110 Roiling Code Inverted Spectrum, Unscrambled U L17 RadioShack Pro-96 Motorola CDM1550 Active Scanner using Priority Channel V L18 Kenwood NX200 Kenwood NX800 NXDN CAI with AMBE+2 Vocoder SPECIAL NOTES ABOUT CHANNEL PROPERTIES: (1) Channels S and T are essentially equivalent to each other, except for the presence/absence of encryption in the transmission. Because channel S was encrypted, it was not subject to the "skewview" analysis performed on other channels, but the results of skewview and other time-based analyses from channel T are applicable to channel S. (2) Channels B, E, L, O employed an "inverted spectrum" protocol; for these four channels, LDC did post-processing on the four audio files from each session, to create the "re-inverted" (i.e. non-inverted) version of the audio. The "rvrt" versions of the audio are stored in separate directory paths for each task/test-set. (3) Channels Q and R use Automatic Frequency Control (AFC), which means that the receiver will try to find the strongest signal within a certain range of the frequency set by the user. For weak signals in areas with lots of adjacent channel interference, AFC behavior can be unpredictable, which is why we enabled it for these channels. -------- README Created by David Graff March 30, 2018 Updated by Kevin Walker and Karen Jones August 29, 2019