README FILE FOR LDC CATALOG ID: LDC2023S06 TITLE: 2019 OpenSAT Public Safety Communications Simulation AUTHORS: Dana Delgado, Karen Jones, Stephanie Strassel, Kevin Walker, Christopher Caruso, David Graff 1. Overview This package contains the simulated public safety communications training, development and evaluation/test data, along with the evaluation reference materials used in the National Institute of Standards and Technology (NIST) Open Speech Analytic Technologies (OpenSAT) 2019 Evaluation's Automatic Speech Recognition (ASR), Speech Activity Detection (SAD), and Keyword Search (KWS) tasks. The data is a portion of the Speech Analysis For Emergency Response Technology (SAFE-T) corpus, which was created by LDC under the NIST Public Safety project in support of NIST's OpenSAT evaluation campaign. For more information on OpenSAT19 and to access the evaluation plan visit: https://www.nist.gov/itl/iad/mig/opensat The SAFE-T corpus was designed to record speakers engaged in a collaborative problem-solving activity representative of public safety communications in terms of speech content, noise types and noise levels. The goal was to elicit speech exhibiting specific features found in first responder communications including: the Lombard effect, in which speech behavior is altered due to background noise; a range of high and low vocal effort; speaker stress due to the perception of situational urgency; spontaneous speech; lexical items that occur in the public safety domain. This release contains 141 hours of speech recording data and transcripts for 60 hours of that data. Transcripts are included for all development and evaluation data and for 50 hours of training data. Each recording in this release is single-channel and consists of a U.S. English speaker playing the board game Flash Point Fire Rescue. Background noise, which was played through a participant's headset during each recording, has been mixed into the single channel recording at a reduced level, see sections 5 and 6 for more details. Included in this release are: - 5 hours of development data, as 100 3-minute recording files - 100 development transcript files - 5 hours of evaluation data, as 100 3-minute recording files - 100 evaluation transcript files - 131 hours of training data, as 372 2-minute, 10-minute or 30-minute recording files - 100 training transcript files, corresponding to 30-minute recording files This package contains all of the data released to OpenSAT19 performers and the OpenSAT19 evaluation transcripts released to NIST. The data was originally released as the following project corpora: LDC2019E36 SAFE-T Speech Recording Training Data Transcripts V1.1 LDC2019E37 V1.1 SAFE-T Corpus Speech Recording Audio Training Data R1 V1.1 LDC2019E50 V1.1 2019 OpenSAT Public Safety Communications Simulation Evaluation Data V1.1 LDC2019E53 V1.1 2019 OpenSAT Public Safety Communications Simulation Development Data V1.1 LDC2019R17 SAFE-T Speech Recording Evaluation Data Transcripts V1.1 2. Directory Structure and Content Summary The directory structure and contents of the package are summarized below - paths shown are relative to the root directory of the package: README.txt -- this file docs/ audio_stats.tab md5sum.txt OpenSAT19_psc_eval.ecf.xml OpenSAT19_psc_eval.kwlist.xml OpenSAT19_psc_eval_sad_trials.tsv SAFE-T_CTR-V0.5.pdf SAFET_QTR-V0.6.pdf sessions.tab subjects.tab transcript_info.tab data/ audio/ dev/ -- development set, 100 flac files eval/ -- evaluation/test set, 100 flac files train/ -- training set, 372 flac files transcripts/ dev/ -- development set, 100 tsv files eval/ -- evaluation/test set, 100 tsv files train/ -- training set, 100 tsv files 2.1. Data composition 2.1.1 Audio files All audio files are 48KHz 16-bit mono flac files. Naming conventions: Each recording session consists of 2 30-minute games. Part 1/2 designate whether the recording is from the first or second game of the session. The PIN is the participant ID for the audio channel recorded in the file. Dev and eval audio files consist of four 3-minute snippets selected from the six sections of 5 minutes each, drawn from the 30-minute recording. See section 4 for more details. Training audio files consist of two types of recordings: 1) Single-channel audio files of a single 30 minute game session: ___part<1|2>_train_mixed.flac 2) Snippet files consisting of the remainder of audio data from recordings selected for dev data: ___part<1|2>__train_mixed.flac The lettering scheme mirrors the section names - showing from which 5-minute "section" of the 30-minute game, the 2-minute or 10-minute training data portion has been selected. Note: 10-minute files have two section names (e.g. 9256_20190118_163117_part1_FG_train.flac), meaning that this file contains 2 5-minute sections 'F' and 'G'. However 'AB' is one 5-minute section. The "_mixed" portion of the file name is there to indicate that the audio content is a mixture of the clean speech channel and the background noise signal used during the recording session. (Note that in the corresponding transcription data, the string "_mixed" is NOT included in the transcript file name.) 2.1.2 Transcript files All transcript files are tab-separated, .tsv and are in UTF-8 encoding with this format: start.timeend.timespkrIDtranscript Naming conventions: -- training transcript: ___part<1|2>_train.tsv -- dev or eval transcript: ___part<1|2>__.tsv The lettering scheme mirrors the section names - showing from which 5-munite "section" of the 30-minute game, the 3-minute dev portion has been selected. If no letter/s is present, the full approximately 30-minute audio file has been transcribed. 3. Documentation included in this release audio_info.tab -- lists file ID and duration for each flac file md5sum.txt -- checksum and file name list for flac files SAFE-T_CTR-V0.5.pdf -- Transcription guidelines for dev and eval transcripts. Please see this document for transcription mark-up and conventions. SAFE-T_QTR-V0.6.pdf -- Transcription guidelines for training transcripts. Please see this document for transcription mark-up and conventions. sessions.tab -- lists relevant recording sessions, one row per stereo recording: 1 year-mo-dy - Date of recording. 2 time - Time of recording. (UTC) 3 pinA - Participant PIN for A channel audio. 4 pinB - Participant PIN for B channel audio. 5 background_noise_audio_1 - Background audio for part 1 of session. 6 background_noise_audio_2 - Background audio for part 2 of session. 7 n_parts - Number of parts in recording sessions. 8 minutes - Length of recording session, in minutes. subjects.tab -- lists demographic information for all participants. It is tab-delimited with the following fields: 1 pin (4 digits) 2 n_ssns (number of sessions containing this pin) 3 minutes (total minutes recorded in those sessions for this subject) 4 gender (m,f,o) 5 yr_born (four digits) 6 edu_lvl (degree type) 7 born_in (city name, state abbreviation) 8 raised (city name, state abbreviation) transcript_info.tab -- lists the following for each transcript file: 1 "file_id" 2 "span_sec" -- total duration covered by transcript file 3 "speech_sec" -- sum of durations of transcribed speech segments 4 "speech_nsegs" -- total number of transcribed speech segments NOTE: Transcripts also contain two types of non-speech segments: "background" -- other voices audible, but not transcribed "background_noise" -- presence of other noise 4. Background noise levels and babble Each recording consists of one 30-minute game, in which the participants hear background noise at different levels and babble, following the composition in the tables below. Recordings on or before 02/05/2019 follow the table 1 specification, and recordings after 02/05/19 follow table 2, the difference being that the first two minutes of each 30 minute game were recorded with the participants' headphones off after 02/05/19. Recordings before this date were recorded with the participants' headphones on for the entire 30 minute recording, even though the first two minutes of recording did not have any background noise. Two noise levels were used, with the following dB ranges: - quiet 0-14dB - loud 70-85 dB 4.1 Table 1 - files recorded on 02/05/2019 (filenames with 20190205) or earlier: first row: A-G are the sections of each 30-minute recording second row: m = minutes S = no background, headphones on Q = quiet background L = loud background B = with babble +-----------------------------------------------------+ | A | B | C | D | E | F | G | |-------|------|--------|------|------|------|--------| | | | | | | | | | 2m S | 3m Q | 5m L/B | 5m Q | 5m L | 5m Q | 5m L/B | | | | | | | | | +-----------------------------------------------------+ Babble is added to two of three loud background sections that the participants hear. The babble is removed from background files for this package. 4.2 Table 2 - files recorded after 02/05/2019 (filenames with 20190206 or later): first row: A-G are the sections of each 30-minute recording second row: m = minutes NH = no headphones Q = quiet background L = loud background B = with babble +-----------------------------------------------------+ | A | B | C | D | E | F | G | |-------|------|--------|------|------|------|--------| | | | | | | | | | 2m NH | 3m Q | 5m L/B | 5m Q | 5m L | 5m Q | 5m L/B | | | | | | | | | +-----------------------------------------------------+ Babble is added to two of three loud background sections that the participants hear. The babble is removed from background files for this package. 5. Background Files, Signal Chain Creation and Percentages 5.1 Background file creation Each background file is created by concatenating a set of seven corresponding component audio files (following the table in section 4.1) into a single 30-minute file using sox, with additional modifications to noise levels in order to make quiet or loud audio. Each component file is created by concatenating a set of randomly selected segments of original source audio. Relationships between original source audio, segment audio, component audio, and background audio are recorded in a database and can be used to recreate any of the audio files. 5.2 Signal Chain The background files are played through the Digigram audio interface output with 0dB gain/attenuation. The Digigram audio interface is connected to inputs 1 & 2 of the Lectrosonics Matrix Mixer. These inputs have been set to 0dB gain. The inputs are routed to both speaker's headsets with +3dB gain; The routing includes both the matrix crosspoints for the two headsets and the amplifier stage of the mixer connected to each headset. The amplifier stage of the mixer is set to 10dB gain for each headset. [ microphone ] -> \ [ microphone preamplifier with 50 dB gain ] \ -> [ matrix with 20dB attenuation ] -> [ amplifier stage with 10dB gain ] -> [ headset ] -> [ matrix with 0dB attenuation ] -> [ audio interface with 0dB gain ] -> [ audacity with 0dB attenuation ] "Quiet" [ background file with -36dBFS RMS signal ] -> \ [ audio interface output with 0dB gain ] -> \ [ mixer input with 0dB gain ] -> \ -> [ matrix with +3dB gain ] "Loud" [ background file with -3dBFS RMS signal ] -> \ [ audio interface output with 0dB gain ] -> \ [ mixer input with 0dB gain ] -> \ -> [ matrix with +3dB gain ] 5.3 Percentages Describing the mixture of background signal, microphone A, and microphone B, in terms of percentages: The microphones are both connected to microphone preamplifiers which are set to provide 50dB of gain. The signal from a given microphone feeds 3 significant crosspoints in the mixer: 1) the crosspoint which terminates at the recording input which is set to provide 0dB gain 2) the crosspoint which terminates at the speaker A headset which is set to provide 20dB attenuation 3) the crosspoint which terminates at the speaker B headset which is set to provide 20dB attenuation The background audio files are played with Audacity through the Digigram digital audio interface analog outputs 1 and 2. The audio interface audio outputs are set to provide 0dB gain/attenuation. The signal from the audio interface feeds the matrix mixer at inputs 1 and 2. These inputs are set to provide 0dB gain/attenuation. The inputs feed the matrix crosspoint stage. The mixture which is routed to the speaker's headset has 3 components: the sidetone, the interlocutor's speech, and the background audio signal. The background audio signal makes up 68% of the signal in the quiet condition and 86% of the signal in the loud condition. The sidetone and interlocutor signals are evenly split, making up the remainder of the signal (16%/16% quiet, 7%/7% loud). The effective levels of each component are dependent on the activity of the signal. For example, if neither the interlocutor nor the sidetone are active (i.e. no one is speaking), then the signal which reaches the headset is 100% background. As the level of speech from the interlocutor and sidetone increase, the effective percentages associated with each component will vary accordingly. 6. Mixed Files A goal of the SAFE-T corpus was to create speech recordings that would mimic first-responder communication with realistic background noise that would be reasonably challenging for developers in the OpenSAT19 evaluation. Through testing NIST found that mixing the speech recordings with the background files at the full level heard by participants made the loud sections too challenging for use as evaluation data. It was therefore decided that the background noise recordings should be mixed with the speech recordings at a reduced level. Various levels were tested until a consensus was reached. The mixed recording files were created post-recording. During recording, clean channel recordings are created while a background file with babble is played through the participants' headphones. The background file is then mixed with the clean channel recording at a reduced level to produce the mixed file. To prepare the mixed files, the background file signal levels are reduced in the loud sections to better match the signal levels of the clean channel recordings. SoX was used to reduce the levels, the background files were combined with clean channel recordings using the “sox --combine mix-power” command. The original background file consists of silence followed by alternating quiet (-27dBFS) and loud (-3dBFS max) sections. This is the version of the background file that is played through the headset of the participant during the recording session. The reduced level background file consists of silence followed by alternating quiet (-36dBFS max RMS) and loud (-12dBFS max RMS) sections, which are normalized - the amplitude of the digital audio is scaled down relative to the max RMS level. 7. Known Issues Subject Pins 6354 and 0544 represent claques; this data has been excluded from selection and the associated audio files are not included in this release (i.e. channel A from sessions with subject 6354 has been excluded). Subject Pin 7822 is possibly not a native English speaker. 3 files associated with this pin are however included in this release since they were released as training data to OpenSAT19 performers. Some sessions consist of only one part due to a technical error in one game of the session. Any recordings with technical errors have been excluded from release. Participants normally have up to 4 sessions, but in a few cases, there are 5 sessions due to only one game being recorded in a session and a make-up session being recorded. 8. Acknowledgments The authors acknowledge the following contributors to this data set: Frederick Byers (NIST) Omid Sadjadi (NIST) Jonathan Wright (LDC) 9. References Dana Delgado, Kevin Walker, Stephanie Strassel, Karen Jones, Christopher Caruso and David Graff The SAFE-T Corpus: A New Resource for Simulated Public Safety Communications LREC 2020: 12th Edition of the Language Resources and Evaluation Conference, Marseille, May 11-16 http://www.lrec-conf.org/proceedings/lrec2020/pdf/2020.lrec-1.794.pdf 10. Copyright Information (c) 2019 Trustees of the University of Pennsylvania 11. Contacts For further information about this data release, the NIST-PS project or the SAFE-T corpus, contact the following project staff at LDC: Dana Delgado - coordinator Stephanie Strassel - PI Kevin Walker