README FILE FOR:  RATS Keyword Spotting (KWS) Corpus
 LDC Catalog-ID:  LDC2017S20


Authors:
        David Graff <graff@ldc.upenn.edu>,
        Xiaoyi Ma <xma@ldc.upenn.edu>, 
        Stephanie Strassel <strassel@ldc.upenn.edu>
        Kevin Walker <walkerk@ldc.upenn.edu>
	Karen Jones <karj@ldc.upenn.edu>

0.0 Overview of README contents

    1.0 Introduction
    2.0 Corpus Structure
         2.1 Organization of Directories
         2.2 File Name Patterns
    3.0 Structure of Documentation Tables
         3.1 Source audio summary: source_file_info.tab
         3.2 Retransmitted audio set summary: retrans_fileset_info.tab
    4.0 Description of Data Files
         4.1 FLAC-compressed Audio
         4.2 Tab-Delimited Annotations
    5.0 Description of Audio Collection Process
         5.1 Transceiver Channel Specifications
         5.2 Recording, Signal Processing, Quality Control and Alignment
    6.0 Description of KWS Annotation Process
         6.1 Transcription
         6.2 Keyword Selection


1.0 Introduction

The RATS KWS corpus comprises audio and annotation data created by the
LDC to provide training, development and initial test sets for the
Keyword Spotting (KWS) task in the DARPA RATS (Robust Automated
Transcription of Speech) program.

The goal of the RATS program was to develop Human Language Technology
(HLT) systems capable of performing speech detection, language
identification, speaker identification and keyword spotting on the
severely degraded audio signals that are typical of various radio
communication channels, especially those employing various types of
handheld portable transceiver systems.

To support that goal, the LDC assembled a specialized system for
transmission, reception and digital capture of audio data, such that a
single source audio signal could be distributed and recorded over
eight distinct transceiver configurations simultaneously.  The
relatively clear source audio data was annotated manually to provide
the labels needed for a given HLT task - e.g. time boundaries and text
of speech utterances - and these annotations were then projected onto
the corresponding eight channels of audio that were recorded from the
radio receivers.  Further details are provided in later sections of
the README file.

The source audio used to create the KWS corpus consisted entirely of
conversational telephone speech (CTS) recordings, taken either from
previous LDC CTS corpora, or from CTS data collected specifically for
the RATS program.  Corpus creation for RATS KWS focused on two
languages: Levantine Arabic (abbreviated "alv" in this corpus), and
Farsi (abbreviated "fas").  For these languages, existing calls were
drawn from two previous corpora: CallFriend Farsi (LDC2014S01), and
Fisher Levantine Arabic (LDC2006S29).

The amount of source audio data in the corpus can be summarized by
language as follows:

     #SOURCE #SOURCE
LNG    FILES   HOURS
--------------------
alv    1350    223.7
fas     600    165.1
--------------------
Total  1950    388.8

All files are single-channel, and hours reported above are based on
file duration; each file is one side of a two-party conversation, so
the number of hours of actual speech is roughly half the amount shown
above.  Between 5 and 8 transmission channels were recorded for each
source file, so the total amount of audio in the corpus, counting both
source and retransmission data, is 3186.7 hours in 15948 files.

Acknowledgments:

This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract No. D10PC20016.  The
content does not necessarily reflect the position or the policy of the
Government, and no official endorsement should be inferred.

We would like to express special thanks to Dan Ellis at Columbia
University, and John Hansen at the University of Texas at Dallas, for their
substantial technical assistance during the creation of the RATS corpus.
Henry Goldberg and David Longfellow at Leidos (formerly SAIC) provided the
partitioning of corpus data and initial selection of target keywords.

2.0 Corpus Structure

 2.1 Organization of Directories

The corpus is presented on a single hard disk; the overall directory
structure is as follows:

  index.html

  docs/
    README.txt -- this file
    all_flac_info.tab -- list/attributes of all audio files
    source_file_info.tab -- list/attributes of source audio
    retrans_fileset_info.tab -- list/attributes of degraded audio sets
    Outline_of_KWS_Annotation_Task.pdf

  data/
    dev-1/
      audio/
        alv/
          {A,B,C,D,E,F,G,H,src}/
        fas/
          {A,B,C,D,E,F,G,H,src}/
      kws/
        alv/
          {A,B,C,D,E,F,G,H,src}/
        fas/
          {A,B,C,D,E,F,G,H,src}/

    dev-2/
      audio/
        alv/
          {A,B,C,D,E,F,G,H,src}/
      kws/
        alv/
          {A,B,C,D,E,F,G,H,src}/

    train/
      audio/
        alv/
          {A,B,C,D,E,F,G,H,src}/
        fas/
          {A,B,C,D,E,F,G,H,src}/

      kws/
        alv/
          {A,B,C,D,E,F,G,H,src}/
        fas/
          {A,B,C,D,E,F,G,H,src}/


To summarize the organization of the data:

 - The primary division marks the inventories designated for use as
training, initial development set (dev-1), and initial evaluation set
(dev-2 -- note that the initial evaluation only used Levantine Arabic
data).  These partitions were based on sampling recommendations
provided by a team at SAIC, whose task was to administer and score the
evaluations of the HLT systems developed for RATS.

 - Each partition is subdivided into separate paths for audio and KWS
annotation files.

 - Each audio and kws set is further subdivided according to the
language used in the source CTS recordings.

 - The lowest directory level divides the data according to the
channel condition, i.e. source audio (src) or one of the eight
transceiver channels, labeled "A" - "H".

 2.2 File Name Patterns

All data file names consist of a numeric ID, the language used in the
recording, and a channel identifier, as follows:

  source audio files:  {iiiii}_{lng}_src.{type}
  transceiver files:   {iiiii}_{jjjjj}_{lng}_{c}.{type}

where:

  {iiiii} is a 5-digit source audio identifier
  {jjjjj} is a 5-digit transmission session identifier
  {lng} is one of:
        alv: Levantine Arabic
        fas: Farsi (Persian)
  {c}    is one of: A B C D E F G H
  {type} is one of: "flac", "tab"

So, relating a given source file to all the transmitted files derived
from it, involves matching the first five digits of the file names;
all the files recorded simultaneously from transceivers in a given
transmission session will have identical file names, except for the
single letter representing the particular channel.

Over the course of the RATS corpus creation project, a given source
audio file could be used more than once for retransmission, so that
the 5-digit source ID ({iiiii} above) could be associated with two or
more 5-digit transmission session IDS ({jjjjj} above); this re-use of
a given source file does not occur in the data used for KWS.


3.0 Structure of Documentation Tables

 3.1 FLAC file information for all audio: all_flac_info.tab

This table has six tab-delimited columns, as follows:

  Col#  Content
     1  FILEPATH  -- starting at "data/"                        
     2  SECONDS   -- audio duration in seconds (floating point)
     3  FLC_KB    -- flac-compressed file size in kilobytes
     4  UNC_KB    -- uncompressed audio size in kilobytes
     5  C_RATIO   -- compression ratio (floating point)
     6  SAMP_MD5  -- checksum of uncompressed sample data

Note that the "SAMP_MD5" value, which comes from the FLAC header of
each file, is computed over just the uncompressed sample data,
excluding all file-header content; it therefore differs from the MD5
checksum of the flac file taken as a whole.


 3.2 Source audio summary: source_file_info.tab

This table has six tab-delimited columns, as follows:

     1  SRC_ID   -- 5-digit numeric ID assigned to the source audio
     2  CORPUS   -- corpus of origin: "cf", "fisher" or "rats"
     3  LNG      -- "alv" or "fas"
     4  SRC_FILE -- full name of source file in corpus of origin
     5  MINUTES  -- source recording duration (floating-point)
     6  PART     -- partition: "train", "dev-1" or "dev-2"

The SRC_ID field is used as the file-ID portion of the corresponding
file names in the "src" sub-directories under "audio" and "kws" (e.g.
"audio/LNG/src/12345_LNG_src.flac", "kws/LNG/src/12345_LNG_src.tab").


 3.3 Retransmitted audio set summary: retrans_fileset_info.tab

This table has five tab-delimited columns, as follows:

     1  RETRANS_ID     -- 5-digit_5-digit source_retrans ID
     2  RETRANS_BGN_AT -- transmit start time (EST): YYYYMMDD_hhmmss
     3  N_PTT          -- number of "push-to-talk" regions (integer)
     4  CHANNELS       -- indicator of presence/absence of channels
     5  PART           -- partition: "train", "dev-1", "dev-2"

The RETRANS_ID field matches the initial numeric ID fields of the
corresponding file names.

N_PTT is an _approximate_ count of the number of push-to-talk regions
in the transmission session, based on a log file created by the
voice-activation relay in the transmission system. The actual number
of button-on/button-off transitions may vary within the session from
one channel to another, due to mechanical limitations or other factors
in the various transceiver configurations.  (Note that channel G was
not mediated by any PTT device - its transceiver was always engaged
over the entire duration of every session.)

CHANNELS is an eight-character string in which a given position will
either be the letter for the given channel (ABCDEFGH), or an
underscore character, indicating that the particular channel failed
during that session.  In most sessions, all channels worked as
intended, but there were many sessions in which one or more channels
failed to produce usable recordings, for various reasons.  Here's a
summary of the distinct patterns in the CHANNELS field:

   733  ABCDEFGH
   146  ABCDEFG_
     1  ABCDE_GH
     7  ABC_EFGH
     2  AB_DEFGH
   846  _BCDEFGH
    32  _BCDEFG_
     2  _BC_EFGH
     1  _B_DEFGH
    10  __CDEFGH
   170  ___DEFGH
   ---
  1950 sessions total

This shows that two channels (E,G) worked as intended in all sessions;
channel F failed in only one session, but all other channels worked in
that session; channel A failed in more than half the sessions (1061 in
all), and channels B,C,D, and H also failed in some of those sessions;
B failed a total of 180 times, C failed 173 times, D failed 9 times,
and H failed 178 times.

Note that the term "worked as intended" is a generalization; there may
be portions of audio on a given channel in a given session where a
typical human user would say that the transmission failed, because one
or more utterances were unintelligible.  The metrics used to determine
whether a given channel "succeeded" were based on aggregate signal
processing measures over the entire session for the given channel
(described in more detail in section 5 below), and not on the
intelligibility of utterances contained in the session.


4.0 Description of Data Files

 4.1 FLAC-compressed Audio

All audio files are presented here as single-channel, 16-bit PCM,
16000 samples per second; lossless FLAC compression is used on all
files; when uncompressed, the files have typical "MS-WAV" (RIFF) file
headers.

All the source CTS audio data, whether previously published in Fisher
and CallFriend corpora or newly collected for RATS, was originally
captured as 8-bit mu-law or a-law, 8000 samples per second.  That
original format has been converted to 16-bit, 16-KHz for consistency
and ease of use in combination with the original capture format of the
retransmission audio channels.

 4.2 Tab-Delimited Annotations

All RATS annotation files are presented as tab-delimited tables with
12 columns per row, as follows:

     1  data partition (train, dev-1, dev-2)
     2  file_ID ("{iiiii}_{lng}_src" or "{iiiii}_{jjjjj}_{lng}_{c}")
     3  start time (floating-point seconds)
     4  end time (floating-point seconds)
     5  Speech Activity Detection (SAD) segment label
     6  SAD provenance
     7  speaker ID
     8  SID provenance
     9  language ID
    10  LID provenance
    11  transcript
    12  transcript provenance

This layout was designed to provide a consistent tabular format for
use across all RATS tasks; as such, only a subset of fields are
relevant to KWS annotation: fields 1-6 and fields 9-12; the others
(speaker ID, SID provenance) are not relevant, and are left empty in
the KWS annotation files.

The "language ID" field shows the expected language of the source
audio file (alv or fas); the "LID provenance" in all cases is shown as
"original", to indicate that, with respect to the KWS annotations, no
additional effort was made to confirm the language being used in any
given utterance in the file.

Field 5 (SAD segment label) can have one of the following values:

 S  : speech segment
 NS : non-speech segment
 NT : "button-off" segment (not transmitted according to PTT log)
 RX : "button-off" segment (not transmitted according to RMS scan)

Field 6 (SAD provenance) can have one of the following values:

 "manual"   : time-stamp boundaries were set by RATS SAD annotators

 "original" : time-stamp boundaries for the segment were drawn from
              original transcription (i.e. set by a transcriber, not
              necessarily following RATS annotation conventions for
              marking speech boundaries)

 "automatic": time-stamps result from integration of original/manual
              labels with subsequent automatic signal processing.

The "src" channel segment labels are all either "original" or
"manual", and comprise only "S" and "NS" labels.  The other labels,
which co-occur with "automatic" provenance, apply only to transmitted
audio channels.

Field 11 (transcript) is typically UTF-8 Arabic text, though the
transcription conventions provide for code switching into English or
other Latin-based languages, as well as a variety of annotations
rendered in ASCII.

Field 12 (transcript provenance) is either "original" or "manual".


5.0 Description of Audio Collection Process

 5.1 Transceiver Channel Specifications

The layout of hardware on the eight transceiver channels is as
follows:

     ::   Transmitter      ::    Receiver          :: RF Band /
 CHN :: Make      Model    :: Make      Model      :: Modulation
 ---------------------------------------------------------------
   A :: Motorola  HT1250   :: AOR       AR5001/D   :: UHF/NFM
   B :: Midland   GXT1050* :: AOR       AR5001/D   :: UHF/NFM
   C :: Midland   GXT1050* :: TenTec    RX400      :: UHF/NFM
   D :: Galaxy    DX2547   :: Icom      IC-R75     :: HF/SSB
   E :: Icom      IC-F70D  :: Icom      IC-R8500   :: VHF/NFM
   F :: Trisquare TSX300   :: Trisquare TSX300     :: UHF/FHSS
   G :: Vostek    LX-3000  :: Vostek    VRX-24LTS  :: UHF/WFM
   H :: Magnum    1012 HT  :: TenTec    RX340      :: HF/AM

Explanation of "RF Band / Modulation" acronyms:

   HF: High Frequency
  VHF: Very High Frequency
  UHF: Ultra High Frequency

   AM: Amplitude Modulation
 FHSS: Frequency Hopping Spread Spectrum
  NFM: Narrow-band Frequency Modulation
  SSB: Single-Side-Band
  WFM: Wide-band Frequency Modulation


 5.2 Recording, Signal Processing, Alignment and Quality Control

The transmission and capture of eight simultaneous channels from a
single source audio file ran as a continuous, automated process.  It
was managed via a database, in which source audio files were assigned
a numeric "src_id" as they were queued up for transmission, and each
resulting set of 8-channel received audio files was assigned a common
"sig_id" when the transmission system created the recordings.  These
"src_id" and "sig_id" numbers were used in the names various audio
files.

As the recordings were uploaded from the capture system to network
storage, additional automated processes were applied to establish the
time offsets for "button-on" regions in the PTT channels, measure
signal levels at 20 msec intervals on each file, and do cross-
correlation based alignment between the source audio data and channel
G (which was generally the least degraded, and was always engaged,
rather than being controlled by PTT button events).

Frame-based RMS signal level measures were used to determine the
"peak-to-valley" dynamic range for each transceiver audio file; if the
peak energy or the difference between highest and lowest frame energy
never reached given thresholds (established heuristically for each
channel), the particular file was flagged as unusable in the database.

"Button-on" regions were identified on the basis of time-stamp log
data from the voice-activated relay (VAR) system that triggered button
events for transmission on channels A-E and H.  For channel F, the PTT
button transitions were identified using a simple signal analysis
process for finding peak-clipped transients in the audio.

For channels A, B, C, E and H, the frame-based RMS data was also used
in combination with the log-based "button-on" regions, to identify
portions where the radio carrier signal dropped out prematurely (i.e.
where transmission on a given channel stopped before the VAR turned
the transmit button off - this could happen on extended utterances, if
the utterance duration exceeded operating parameters for the
transceiver, or the device overheated, etc).

Based on the results of these processes involving PTT events in each
session, the original "speech / non-speech" (S/NS) annotations on the
source audio data have been modified by further subdividing the "S"
regions, and assigning new labels to certain portions:

  "NT": "button-off" state indicated by VAR log or channel F analysis
  "RX": loss of carrier detected by frame-based RMS measures

In effect, both NT and RX segments in annotation files indicate that
(some portion of) a speech region was not transmitted.  The difference
between NT and RX has to do with how they are asserted, and how they
are distributed among the various channels:

 - NT regions are based on the VAR time-stamp log, and will be found
consistently across all the PTT-mediated channels (A-F,H) for a given
instance within a given transmission session; there may be slight
differences in NT offsets for channel F relative to the others,
because button events on channel F are determined directly from signal
analysis, rather than from the VAR log.

 - RX regions are based on per-frame RMS levels in a given channel,
and is typically unrelated to transmission behavior on other channels.

Regarding time alignment, initial study of the 8-channel capture data
showed small but noticeable and consistent differences in time offsets
among the 8 transceivers.  Relative to channel G, the other seven
channels showed time lags between 0.004 and 0.031 seconds.  Meanwhile,
the start-up of the 8-channel capture was physically isolated from the
start-up for playing the source signal into the transmitters - the
transmission and reception/capture systems were only approximately
coordinated in time.

The "S/NS" annotations, which were done manually on the clean source
audio file, were adapted to each of the 8 transceiver audio files as
follows:

First, an alignment tool called "skewview", created and made available
by Dan Ellis of Columbia University during the early stages of the
RATS program, was used to establish an alignment offset between the
source audio file and channel G.  This tool (which can be found at
http://labrosa.ee.columbia.edu/projects/skewview/) uses normalized
cross-correlation to compute the optimal time offset between two
related signals on a frame-by-frame basis.  Its output can reveal both
gradual drifts in offsets (due to marginally different sampling rates)
and abrupt discontinuities (e.g. due to sampling dropouts caused by
buffer overruns, etc); total absence of coherent alignment over
consecutive frames indicates that the two signals being compared have
nothing in common.

For the sessions that yielded a strong and steady correlation between
source audio and channel G, the median time offset for the session as
a whole was used as a baseline reference, and relative offsets for the
other 7 channels were added to this value.

Then the time stamps and labels of the source annotations were
transposed to the adjusted alignment offset for each transceiver
channel in the given session, and this version of the annotation was
used in combination with the button-on timing data to produce the
eight transceiver annotations.

A final step of QC was to arbitrarily select a speech ("S") segment up
to 5 seconds in duration from the session (typically from the middle
or latter part of the source audio file), use the computed alignment
offsets for each channel to extract the corresponding transceiver
segment (given that these were also labeled "S" - i.e. had presumably
been transmitted as expected), and then use skewview again, comparing
the source audio extract to each channel extract.  If this run yielded
less than 1 msec offset difference between the source and each
transceiver extract, the session data and alignments were flagged as
fully successful.


6.0 Description of KWS Annotation Process

 6.1 Transcription

The data for both languages in this task were drawn from previously
published CTS speech and transcript corpora, and were supplemented
with CTS that was newly collected under the RATS program.  The table
below breaks down the corpus content by language as source; LDC corpus
catalog-ID's are provided for the other publications that contain the
same source data:

CORPUS   LNG  #FILES  #HOURS

FISHER   alv    1242   196.5  (LDC2006S29, LDC2006T07)
RATS     alv     108    27.2

CALLFRND fas     181    77.8  (LDC2014S01, LDC2014T01)
RATS     fas     419    87.3

Note that the number of files refers to the number of single-channel
call sides; for Fisher and CallFriend calls, these call sides
typically fall into pairs to form two-sided conversations. (The
"source_file_info" table indicates which files form conversation
pairs.)  The RATS calls involved a small number of recruiters making
numerous calls to other people, so only the callee side of the
conversation is included in the corpus (not the claque side).

  6.1.1  Arabic Transcription

The guidelines provided to Arabic transcribers for the RATS CTS data
were equivalent to those used in the original Fisher Levantine Arabic
corpus.  The resulting UTF-8 Arabic script text is not diacritized.

  6.1.2  Farsi Transcription

For the CallFriend Farsi calls, which were originally recorded in
1995-6, the initial transcripts were created in 2000-1 using a quasi-
phoneme-based Romanization rather than Arabic script orthography.
Some years later, researchers outside the LDC produced a lexicon for
mapping the Romanized word forms into standard Persian orthography
using Arabic script letters.  The LDC used this mapping lexicon to
transliterate most of the transcript content, but the lexicon did not
provide complete coverage, and yielded a quantity of ambiguous
mappings (one Romanized form mapping to more than one script form, or
two Romanized forms mapping to the same script form).  Farsi
annotators went over each transcript using a customized editing tool
to focus on and resolve all unmapped or ambiguous tokens.

The RATS Farsi calls were transcribed by native Farsi speakers using
guidelines similar in nature and detail to those used for Arabic. The
goal was to produce a verbatim, time-aligned transcript with minimal
markup. The elements of the transcription process include producing
verbatim transcription, time-aligned segment boundaries, speaker
identification (A or B), and standard treatment of common spoken
phenomena.

  6.1.3  Transcript Quality Control

In both Arabic and Farsi transcription of new RATS calls, automatic
SAD-based segmentation was done first to provide a starting point for
transcription; transcribers were instructed to adjust the boundaries
of the segments where necessary to ensure that no speech is cut off at
the beginning or end of a segment and that segments do not contain
long periods of non-speech. Transcribers also deleted any segments
that did not contain speech and inserted segments for any speech that
was present in the audio but not detected by the SAD system.
Transcribers then transcribed the content of each segment in Arabic
script orthography without diacritic marks.

LDC conducted a quality audit of the transcripts, in which a native-
language annotator compared the transcription and audio for three
randomly chosen 1-minute sections of the recording. Auditors checked
the accuracy of segmentation and transcript content.


 6.2 Keyword Selection

Potential target keywords were selected by Leidos from the transcripts on
the basis of overall word frequencies, to fall within a given range of
target-word likelihood per hour of speech.  All the words selected as
targets were then reviewed by native speaker annotators at the LDC, to
confirm that each selection was a regular word or multi-word expression of
more than three syllables.  Annotators also had the option of giving a
rough English translation of the word to assist with quality control.

For items that were rejected, annotators had the option of giving
reasons for the rejection.

Items that consisted of multiple words were acceptable as keywords as
long as all constituent words were complete.  If the item consisted of
parts of incomplete words it would be rejected.

--------
README Created David Graff
Updated by Stephanie Strassel 10 November 2015