README FILE FOR:  RATS Language Identification (LID) Corpus
 LDC Catalog-ID:  LDC2018S10


Authors:
        David Graff <graff@ldc.upenn.edu>,
        Xiaoyi Ma <xma@ldc.upenn.edu>, 
        Stephanie Strassel <strassel@ldc.upenn.edu>
        Kevin Walker <walkerk@ldc.upenn.edu>
		Karen Jones <karj@ldc.upenn.edu>

0.0 Overview of README contents

  1.0 Introduction
  2.0 Corpus Structure
     2.1 Organization of Directories
     2.2 File Name Patterns
  3.0 Structure of Documentation Tables
     3.1 Source audio summary: fileset_info.tab
     3.2 Source audio provenance: concat_info.tab, nist_lre_info.tab
     3.3 Retransmitted audio details: annotations/*.tab
  4.0 Description of Data Files (FLAC audio)
  5.0 Description of Audio Collection Process
     5.1 Transceiver Channel Specifications
     5.2 Recording, Signal Processing, Quality Control and Alignment
  6.0 Description of LID  Annotation Process
     6.1 Annotation
     

1.0 Introduction

The RATS LID corpus comprises audio and annotation data created by the
LDC to provide training, development and initial test sets for the
Language Identification (LID) task in the DARPA RATS (Robust Automatic
Transcription of Speech) program.

The goal of the RATS program was to develop Human Language Technology
(HLT) systems capable of performing speech detection, language
identification, speaker identification and key-word spotting on the
severely degraded audio signals that are typical of various radio
communication channels, especially those employing various types of
handheld portable transceiver systems.

To support that goal, the LDC assembled a specialized system for
transmission, reception and digital capture of audio data, such that a
single source audio signal could be distributed and recorded over
eight distinct transceiver configurations simultaneously.  The
relatively clear source audio data was annotated manually to provide
the labels needed for a given HLT task, and these annotations were
then projected onto the corresponding eight channels of audio that
were recorded from the radio receivers.  Further details are provided
in later sections of the README file.

The source audio used to create the RATS LID corpus was drawn from two
types of sources:

(a) conversational telephone speech (CTS) recordings, taken either
    from previous LDC CTS corpora, or from CTS data collected
    specifically for the RATS program;

(b) portions of VOA broadcast news recordings containing narrow-band
    speech ("Broadcast Narrow-Band Speech", or BNBS), taken from data
    used in the 2009 NIST Language Recognition Evaluation (LRE2009).

The RATS LID task focused on five languages, listed here with the
abbreviations used for them in this corpus:

  alv : Levantine Arabic
  fas : Farsi
  prs : Dari
  pus : Pashto
  urd : Urdu

Additional data was selected from a wide assortment of other
languages, to serve as "background model" training and "distractor"
test segments; all such data are grouped together under the single
'pseudo-language' label "mul".

The amount of source audio data in the corpus can be summarized by
language as follows:

     #SOURCE #SOURCE
LNG    FILES   HOURS
--------------------
alv     4322   165.1
fas     1012    38.6
prs      251     9.6
pus     3047   112.6
urd     2201    80.9
mul     3948   203.7
--------------------
Total  14781   610.4

All files are single-channel, and hours reported above are based on
file duration.  As explained in section 5 below, the density of speech
in the source files is relatively high, because most files were either
assembled by concatenation of speech segments, or drawn from broadcast
sources, where speech tends to be much denser than in CTS.  (It's
notable that the "mul" category is actually a mixture of BNBS and CTS
source data, where the CTS sources were not subject to concatenation
of speech regions; the "mul" files from CTS sources tend to be longer
and have relatively less dense speech, but the "nominal" amount of
actual speech per file roughly equivalent.)

Between 1 and 8 transmission channels were recorded for each
source file, so the total amount of audio in the corpus, counting both
source and retransmission data, is 5437.3 hours in 127,282 files.

Acknowledgments:

This material is based upon work supported by the Defense Advanced
Research Projects Agency (DARPA) under Contract No. D10PC20016.  The
content does not necessarily reflect the position or the policy of the
Government, and no official endorsement should be inferred.

We would like to express special thanks to Dan Ellis at Columbia
University, and John Hansen at the University of Texas at Dallas, for their
substantial technical assistance during the creation of the RATS corpus.
Henry Goldberg and David Longfellow at Leidos (formerly SAIC) provided the
partitioning of corpus to support the RATS evaluations.


2.0 Corpus Structure

 2.1 Organization of Directories

The corpus is presented on a single hard disk; the overall directory
structure is as follows:

  index.html

  docs/  -- the tables here are explained in detail in section 3
    README.txt -- this file
    all_flac_info.tab -- paths and attributes of all audio files
    fileset_info.tab -- file-ID, source-type, radio-channels, language 
    
    annotations/ -- the annotation files are explained in section 4.2
    RATS_LID_Guidelines.pdf

  data/
    dev-1/
      audio/
          {A,B,C,D,E,F,G,H,src}/

    dev-2/
      audio/
          {A,B,C,D,E,F,G,H,src}/

    train/
      audio/
        alv/
          {A,B,C,D,E,F,G,H,src}/
        fas/
          {A,B,C,D,E,F,G,H,src}/
        prs/
          {A,B,C,D,E,F,G,H,src}/
        pus/
          {A,B,C,D,E,F,G,H,src}/
        urd/
          {A,B,C,D,E,F,G,H,src}/
        mul/
          {A,B,C,D,E,F,G,H,src}/


To summarize the organization of the data:

 - The primary division marks the inventories designated for use as
training, initial development set (dev-1), and initial evaluation set
(dev-2).  These partitions were based on sampling recommendations
provided by a team at SAIC, whose task was to administer and score the
evaluations of the HLT systems developed for RATS.

 - Each partition contains an "audio" direcotry.  (In data releases
 for other RATS tasks, such as KWS, each partition also contained a
 directory for the given type of annotation; for the LID task, there
 are relatively few annotations on individual files, so these have
 been grouped together into tables of annotations in the "docs"
 directory; see section 4.2 below.)

 - In the "train" partition, audio is further subdivided by language,
 with "mul" containing a mixed collection of "non-target" languages;
 this language subdivision does not apply to the "dev" partitions.

 - The lowest directory level divides the data according to the
channel condition, i.e. source audio (src) or one of the eight
transceiver channels, labeled "A" - "H".

 2.2 File Name Patterns

All audio file names consist of a partition label, a numeric ID, a
channel identifier, and a ".flac" file extension.  The partition label
is one of: trn, dv1, dv2; the most significant digit of the numeric ID
is 0 for train, 1 for dv1 and 2 for dv2.  For example:

 -- in train/audio/:
  src/trn_000001_src.flac, A/trn_000001_A.flac, ...

 -- in dev-1/audio/:
  src/dv1_111373_src.flac, A/dv1_111373_A.flac, ...

 -- in dev-2/audio/:
  src/dv2_200001_src.flac, A/dv2_200001_A.flac, ...


3.0 Structure of Documentation Tables

 3.1 FLAC file information for all audio: all_flac_info.tab

This table has one row for every FLAC audio file, with six
tab-delimited columns per row, as follows:

  Col#  Content
     1  FILEPATH  -- starting at "data/"                        
     2  SECONDS   -- audio duration in seconds (floating point)
     3  FLC_KB    -- flac-compressed file size in kilobytes
     4  UNC_KB    -- uncompressed audio size in kilobytes
     5  C_RATIO   -- compression ratio (floating point)
     6  SAMP_MD5  -- checksum of uncompressed sample data

Note that the "SAMP_MD5" value, which comes from the FLAC header of
each file, is computed over just the uncompressed sample data,
excluding all file-header content; it therefore differs from the MD5
checksum of the flac file taken as a whole.


 3.2 Content summary for all audio: fileset_info.tab

This table has one row for each clean-source audio file, with four
tab-delimited columns, as follows:

     1  FILE_ID  -- {PRT}_{IDNUM}, e.g. "trn_000001"
     2  SRC_TYPE -- one of: "cts_concat", "nist_lre", "voa_bnbs"
     3  CHANNELS -- indicates presence/absence of radio channels
     4  LANGUAGE -- see below

SRC_TYPE indicates whether the audio content was created by
concatenation of segments from CTS data, or copied directly from the
NIST LRE 2009 Test Set, or copied directly from VOA BNBS samples that
were provided as supplementary training data for NIST LRE 2009
participants.

CHANNELS is an eight-character string in which a given position will
either be the letter for the given channel (ABCDEFGH), or an
underscore character, indicating that the particular channel failed
during that session.  In most sessions, all channels worked as
intended, but there were many sessions in which one or more channels
failed to produce usable recordings, for various reasons.  Here's a
summary of the distinct patterns in the CHANNELS field:

 9848   ABCDEFGH
 1294   ABCDEFG_
    4   ABC_EFGH
    1   AB__E___
 3211   _BCDEFGH
   65   _BCDEFG_
    4   _B_DEFGH
    3   _B___FG_
    1   _B______
    9   __CDEFGH
    1   __CDEFG_
    1   __C_EFGH
  312   ___DEFGH
   25   ___DEFG_
    1   _____FG_
    1   _______H
-----
14781 sessions total

Channels F and G are lacking in only three sessions, and E is lacking
in only 6 sessions.  Channel A failed in about 25% of the sessions
(3634 failures, 11147 successes); channel H had about 9% loss (1391
failures, 13390 successes); channels B and C each had about 2% loss
(350 and 348 failures, respectively).

Note that the term "success" here is a generalization; there may be
portions of audio on a given channel in a given session where a
typical human user would say that the transmission failed, because
some portion(s) may be unintelligible.  The metrics used to determine
whether a given channel "succeeded" were based on aggregate signal
processing measures over the entire session for the given channel
(described in more detail in section 5 below), and not on the
intelligibility of utterances contained in the session.


 3.3 Annotation data: annotations/*.tab

A separate annotation table file is provided for each channel (A-H and
src) in each partition.  Contents of these files are explained in
section 4.2.


4.0 Description of Data Files

 4.1 FLAC-compressed Audio

All audio files are presented here as single-channel, 16-bit PCM,
16000 samples per second; lossless FLAC compression is used on all
files; when uncompressed, the files have typical "MS-WAV" (RIFF) file
headers.

All the source CTS audio data, whether previously published in Fisher,
CallFriend or NIST LRE corpora, or newly collected for RATS, was
originally captured as 8-bit mu-law or a-law, 8000 samples per second.
That original format has been converted to 16-bit, 16-KHz for
consistency and ease of use in combination with the original capture
format of the retransmission audio channels.

 4.2 Tab-Delimited Annotations

All RATS annotation files are presented as tab-delimited tables with
12 columns per row, as follows:

     1  data partition (train, dev-1, dev-2)
     2  file_ID ("{prt}_{iiiiii}_src" or "{prt}_{iiiiii}_{c}")
     3  start time (floating-point seconds)
     4  end time (floating-point seconds)
     5  Speech Activity Detection (SAD) segment label
     6  SAD provenance
     7  speaker ID
     8  SID provenance
     9  language ID
    10  LID provenance
    11  transcript
    12  transcript provenance

This layout was designed to provide a consistent tabular format for
use across all RATS tasks; as such, only a subset of fields are
relevant to LID annotation: fields 1-6 and fields 9-10; the others
(speaker ID, SID provenance, transcript, transcript provenance) are
not relevant, and are left empty in the LID annotation files.

Fields 5 and 6 are only relevant to the radio channels that use a
"push-to-talk" (PTT) protocol.  The *_lid_src.tab files have only one
line of annotation per audio file; start time is always 0.0, end time
is always end-of-file, fields 5-8 and 11-12 are always empty, and
fields 9-10 contain the only relevant annotations.  Since channel G
does not use PTT protocol, *_lid_G.tab files are nearly identical to
the *_lid_src.tab, except that fields 5 and 6 always contain "T"
(transmitted) and "automatic", respectively.

For the other seven radio channels, field 5 (SAD segment label) can
have one of the following values:

 T  : "button-on" segment (Transmitted)
 NT : "button-off" segment; not transmitted according to PTT log
 RX : segment expected to be "button-on" according to PTT log, but
      rejected based on RMS scan or "findNT" processing (this label
      appears only in dev-2 and train data)

Portions of dev-1 audio data were submitted to an "adjudication"
audit: when all RATS LID systems that were tested in an early
evaluation were scored as having "missed" the expected LID result on
certain segments, these segments were reviewed by LDC auditors to
determine if speech was entirely lacking in the segment (e.g. due to a
transmission failure), or if channel distortions had rendered the
speech fully unintelligble.

Instead of the "RX" label used in training and dev-2 annotations, the
dev-1 files have the following labels (in addition to "T" and "NT"):

 RS : segment was judged to contain no speech
 RI : segment was deemed to be unintelligible to a native speaker

Field 6 (SAD provenance) can have one of the following values:

 "manual"   : time-stamp boundaries were set by RATS SAD annotators

 "automatic": time-stamps result from integration of original/manual
              labels with subsequent automatic signal processing.

In field 9 (Language ID), the language label is either the 3-letter
abbreviation for one of the main RATS languages (alv, fas, prs, pus,
or urd), or is a string of varying length with the full name of a
given distractor language.

Because distractor segments have been drawn from a variety of NIST LRE
test sets and related training material, the amount of detail in the
distractor language labels varies.  In particular, there are a few
different regional varieties represented for English and Mandarin
Chinese ("zho.mandarin"), and a few other languages, but in each case,
there is a quantity of segments that were not originally identified as
to regional variety; e.g. some segments are "zho.mandarin.mainland",
"zho.mandarin.taiwan", etc., while others are simply "zho.mandarin".

Field 10 (LID provenance) is either "manual" or "original"; the former
applies to data newly collected for RATS project and audited to
confirm that speakers were using one of the primary languages, while
the latter applies to data derived from NIST LRE test sets.


5.0 Description of Audio Collection Process

 5.1 Transceiver Channel Specifications

The layout of hardware on the eight transceiver channels is as
follows:

     ::   Transmitter      ::    Receiver          :: RF Band /
 CHN :: Make      Model    :: Make      Model      :: Modulation
 ---------------------------------------------------------------
   A :: Motorola  HT1250   :: AOR       AR5001/D   :: UHF/NFM
   B :: Midland   GXT1050* :: AOR       AR5001/D   :: UHF/NFM
   C :: Midland   GXT1050* :: TenTec    RX400      :: UHF/NFM
   D :: Galaxy    DX2547   :: Icom      IC-R75     :: HF/SSB
   E :: Icom      IC-F70D  :: Icom      IC-R8500   :: VHF/NFM
   F :: Trisquare TSX300   :: Trisquare TSX300     :: UHF/FHSS
   G :: Vostek    LX-3000  :: Vostek    VRX-24LTS  :: UHF/WFM
   H :: Magnum    1012 HT  :: TenTec    RX340      :: HF/AM

Explanation of "RF Band / Modulation" acronyms:

   HF: High Frequency
  VHF: Very High Frequency
  UHF: Ultra High Frequency

   AM: Amplitude Modulation
 FHSS: Frequency Hopping Spread Spectrum
  NFM: Narrow-band Frequency Modulation
  SSB: Single-Side-Band
  WFM: Wide-band Frequency Modulation


 5.2 Recording, Signal Processing, Alignment and Quality Control

The transmission and capture of eight simultaneous channels from a
single source audio file ran as a continuous, automated process.  It
was managed via a database, in which source audio files were assigned
a numeric "src_id" as they were queued up for transmission, and each
resulting set of 8-channel received audio files was assigned a common
"sig_id" when the transmission system created the recordings.  These
"src_id" and "sig_id" numbers were used in the names various audio
files.

As the recordings were uploaded from the capture system to network
storage, additional automated processes were applied to establish the
time offsets for "button-on" regions in the PTT channels, measure
signal levels at 20 msec intervals on each file, and do cross-
correlation based alignment between the source audio data and channel
G (which was generally the least degraded, and was always engaged,
rather than being controlled by PTT button events).

Frame-based RMS signal level measures were used to determine the
"peak-to-valley" dynamic range for each transceiver audio file; if the
peak energy or the difference between highest and lowest frame energy
never reached given thresholds (established heuristically for each
channel), the particular file was flagged as unusable in the database.

"Button-on" regions were identified on the basis of time-stamp log
data from the voice-activated relay (VAR) system that triggered button
events for transmission on channels A-E and H.  For channel F, the PTT
button transitions were identified using a simple signal analysis
process for finding peak-clipped transients in the audio.

For channels A, B, C, E and H, the frame-based RMS data was also used
in combination with the log-based "button-on" regions, to identify
portions where the radio carrier signal dropped out prematurely (i.e.
where transmission on a given channel stopped before the VAR turned
the transmit button off - this could happen on extended utterances, if
the utterance duration exceeded operating parameters for the
transceiver, or the device overheated, etc).

Based on the results of these processes involving PTT events in each
session, the original "speech / non-speech" (S/NS) annotations on the
source audio data have been modified by further subdividing the "S"
regions, and assigning new labels to certain portions:

  "NT": "button-off" state indicated by VAR log or channel F analysis
  "RX": loss of carrier detected by frame-based RMS measures

In effect, both NT and RX segments in annotation files indicate that
(some portion of) a speech region was not transmitted.  The difference
between NT and RX has to do with how they are asserted, and how they
are distributed among the various channels:

 - NT regions are based on the VAR time-stamp log, and will be found
consistently across all the PTT-mediated channels (A-F,H) for a given
instance within a given transmission session; there may be slight
differences in NT offsets for channel F relative to the others,
because button events on channel F are determined directly from signal
analysis, rather than from the VAR log.

 - RX regions are based on per-frame RMS levels in a given channel,
and is typically unrelated to transmission behavior on other channels.

Regarding time alignment, initial study of the 8-channel capture data
showed small but noticeable and consistent differences in time offsets
among the 8 transceivers.  Relative to channel G, the other seven
channels showed time lags between 0.004 and 0.031 seconds.  Meanwhile,
the start-up of the 8-channel capture was physically isolated from the
start-up for playing the source signal into the transmitters - the
transmission and reception/capture systems were only approximately
coordinated in time.

The "S/NS" annotations, which were done manually on the clean source
audio file, were adapted to each of the 8 transceiver audio files as
follows:

First, an alignment tool called "skewview", created and made available
by Dan Ellis of Columbia University during the early stages of the
RATS program, was used to establish an alignment offset between the
source audio file and channel G.  This tool (which can be found at
http://labrosa.ee.columbia.edu/projects/skewview/) uses normalized
cross-correlation to compute the optimal time offset between two
related signals on a frame-by-frame basis.  Its output can reveal both
gradual drifts in offsets (due to marginally different sampling rates)
and abrupt discontinuities (e.g. due to sampling dropouts caused by
buffer overruns, etc); total absence of coherent alignment over
consecutive frames indicates that the two signals being compared have
nothing in common.

For the sessions that yielded a strong and steady correlation between
source audio and channel G, the median time offset for the session as
a whole was used as a baseline reference, and relative offsets for the
other 7 channels were added to this value.

Then the time stamps and labels of the source annotations were
transposed to the adjusted alignment offset for each transceiver
channel in the given session, and this version of the annotation was
used in combination with the button-on timing data to produce the
eight transceiver annotations.

A final step of QC was to arbitrarily select a speech ("S") segment up
to 5 seconds in duration from the session (typically from the middle
or latter part of the source audio file), use the computed alignment
offsets for each channel to extract the corresponding transceiver
segment (given that these were also labeled "S" - i.e. had presumably
been transmitted as expected), and then use skewview again, comparing
the source audio extract to each channel extract.  If this run yielded
less than 1 msec offset difference between the source and each
transceiver extract, the session data and alignments were flagged as
fully successful.

6.0 Description of LID Annotation Process
6.1 Annotation
 
For the LID annotation task, LDC recruited native speakers of Levantine 
Arabic, Pashto, Urdu, Farsi and Dari. Annotators listened to short 
recordings and determined whether the audio was in their language. The 
guidelines for the LID annotation task can be found in 
docs/RATS_LID_Guidelines.pdf.

-----

README Created by Dave Graff
Updated by Stephanie Strassel 10 November 2015