Second DIHARD Challenge Development - Eleven Sources

                               LDC2021S10
                            February 2, 2021

                       Linguistic Data Consortium


1. Overview
===========
This release contains development audio data and annotation for the Second
DIHARD Challenge with two exceptions:

- the child language domain audio and annotation are distributed in a separate
  LDC release (LDC2021S11)
- due to licensing restrictions, LDC is unable to release the CHiME-5 audio
  files used for the DIHARD II multichannel condition (track 3 and track 4).
  To obtain these files, please apply for and obtain the necessary license
  from University of Sheffield:

      https://licensing.sheffield.ac.uk/product/chime5

For additional details regarding the challenge, please see the evaluation plan:

    docs/second_dihard_eval_plan_v1.2.pdf

or consult the website:

    https://dihardchallenge.github.io/dihard2

For instructions on how to reconstruct the complete DIHARD II development set
(released during the challenge as LDC2019E31) from this release and
LDC2021S11, please see Section 7 of this README.


2. Directory structure
=======================
- data/single_channel/flac/  --  single channel condition FLAC files
- data/single_channel/rttm/  --  single channel condition RTTM files
  containing reference diarization
- data/single_channel/sad/  --  single channel condition HTK label files
  containing reference speech segmentation
- data/single_channel/uem/  --  single channel condition UEM files for use
  with "dscore"; one UEM file per domain as well as a file "all.uem" that
  contains **ALL** scoring regions for **ALL** recordings
- data/multichannel/rttm/  --  multichannel condition RTTM files
  containing reference diarization
- data/multichannel/sad/  --  multichannel condition HTK label files
  containing reference speech segmentation
- data/multichannel/uem/  --  multichannel condition UEM files for use
  with "dscore"; one UEM file per domain as well as a file "all.uem" that
  contains **ALL** scoring regions for **ALL** recordings
- docs/file.tbl   --  listing of md5 checksums, sizes, dates, and file names
- docs/README.txt -- this file
- docs/sources.tbl  --  listing of domain, source, and language for each file
- docs/second_dihard_eval_plan_v1.2.pdf  --  evaluation plan
- tools/combine_dev_releases.py  --  Python script for reconstructing LDC2019E31


3. File formats
===============
3.1 Audio

All single channel condition audio is provided in the form of 16 kHz, 16-bit
mono-channel FLAC files. When multiple channels were present in the source
audio, these were remixed to a single channel in the delivery version.
Similarly, if the source audio was recorded at a sample rate other than 16 kHz,
the audio was resampled to 16 kHz. All audio conversion was performed using
SoX.

Note that the multichannel condition audio is **NOT** provided as part of this
release. For these recordings, please apply for and obtain the necessary
license from University of Sheffield:

    https://licensing.sheffield.ac.uk/product/chime5


3.2 Diarization

The diarization for each recording is stored as a NIST Rich Transcription Time
Marked (RTTM) file. RTTM files are space-delimited text files containing one
turn per line, each line containing ten fields:

- type  --  segment type; should always by "SPEAKER"
- file id  --  unique identifier for recording that turn is on (e.g., "DH_0001")
- channel id  --  channel (1-indexed) that turn is on; should always be "1"
- turn onset  --  onset of turn in seconds from beginning of recording
- turn duration  -- duration of turn in seconds
- orthography field --  should always by "<NA>"
- speaker type  --  should always be "<NA>"
- speaker name  --  name of speaker of turn; should be unique within scope of
  each file
- confidence score  --  system confidence (probability) that information is
  correct; should always be "<NA>"
- signal lookahead time  --  should always be "<NA>"

While the RTTM format allows for a single file to contain turns from multiple
recordings, the DIHARD RTTM files always contain turns from a single recording;
e.g., "DH_0001.rttm" only contains turns from recording "DH_0001".


3.3 Speech segmentation

For each recording a reference speech segmentation is generated by merging all
overlapping speaker turns. These segmentations are stored as HTK label files
ending in the ".lab" extension. Each file contains one speech segment per
line, each line containing three tab-delimited fields:

- onset  --  onset of speech segment in seconds from begining of recording
- offset  --  offset of speech segment in seconds from beginning of recording
- label  --  label of segment; always "speech"


3.4 Scoring regions

The scoring regions for each recording are specified by un-partitioned
evaluation map (UEM) files. These files contain one line per scoring region,
each line consisting of four space-delimited fields:

- file id  --  unique identifier for recording that scoring region is on;
  (e.g., "DH_0001")
- channel  --  channel (1-indexed) that scoring region is on
- onset  --  onset of scoring region in seconds from beginning of recording
- offset  --  offset of scoring region in seconds from beginning of recording

For each audio condition (single channel and multichannel) there is a single
UEM listing **ALL** scoring regions for **ALL** recordings. For the single
channel condition, this is:

   data/single_channel/uem/all.uem

while for the multichannel condition this is:

   data/multichannel/uem/all.uem

Additionally, UEMs have been provided for each individual domain of the single
channel condition:

    data/single_channel/uem/audiobooks.uem
    data/single_channel/uem/broadcast_interview.uem
    data/single_channel/uem/clinical.uem
    ...


4. Recording metadata
======================

4.1 sources.tbl

Metadata for each recording is stored in "docs/sources.tbl". This file is a
tab-delimited table containing one recording per line, each line having the
following 4 fields:

- file id  --  unique identifier for recording
- lang  --  predominant language of recording; ISO 639-3 language code
- domain  --  domain of recording; see section 4.2 for details
- source  --  source of recording; see section 4.3 for details


4.2 Recording domains

Recordings are drawn from the following 12 domains:

- AUDIOBOOKS  --  amateur audiobooks
- BROADCAST-INTERVIEW  --  radio interviews
- CHILD  --  child language acquisition recordings*
- CLINICAL  --  Autism Diagnostic Observation Schedule (ADOS) interviews
- COURT  --  courtroom recordings
- DINNER  --  dinner party
- MAPTASK  --  map tasks
- MEETING  --  meeting speech
- RESTAURANT  --  conversational speech recorded during lunches in restaurants
- SOCIO-FIELD  --  sociolinguistic interviews recorded in the field
- SOCIO-LAB  --  sociolinguistic interviews recorded in laboratory setting
- WEBVIDEO  --  amateur video from sites such as YouTube

* All data from CHILD domain is distributed as LDC2021S11.


4.3 Recording sources

Recordings are drawn from the following 12 sources:

- ADOS  --  unpublished ADOS interviews recorded at CHOP
- CHIME  --  CHiME-5 recordings
- CIR  --  unpublished LDC collection of restaurant conversation
- DCIEM  --  DCIEM map task (LDC96S38)
- LIBRIVOX  --  audiobook recordings from LibriVox
- MIXER6  --  interviews from MIXER6 (LDC2013S03)
- RT04S  --  meeting speech from 2004 Spring NIST Rich Transcription (RT-04S)
  dev (LDC2007S11) and eval (LDC2007S12) releases
- SCOTUS  --  2001 U.S. Supreme Court oral arguments from OYEZ project
- SEEDLINGS  --  child language recordings collected as part of SEEDLingS*
- SLX  --  sociolinguistic interviews drawn from SLX (LDC2003T15)
- VAST  --  web video collected as part of the Video Annotation for Speech
  Technologies (VAST) project
- YOUTHPOINT  --  unpublished corpus of student-led radio interviews conducted
  for YouthPoint, a 1970s radio program

* All SEEDLingS data is distributed as LDC2021S11.


4.4 Additional information

For full information regarding each source and domain, please consult
Appendix A from the evaluation plan.


5. File table
=============
Expected sizes, modification times, and MD5 checksums for all files within the
"data/" directory are recorded in "docs/file.tbl". This is a tab-delimited
table containing one file per line, each line having the following 4 fields:

- checksum  --  MD5 checksum of file
- size  --  size of file in bytes
- datetime  --  last modification date in YYYY-MM-DD_HH:MM:SS format
- path  --  path to file relative to root of release directory


6. Relationship to DIHARD II tracks
===================================
Tracks 1 and 2 of DIHARD use the single channel condition audio and
annotations; that is, the contents of:

    data/single_channel

Tracks 3 and 4 use the multi-channel condition annotations:

    data/multichannel

As a reminder, LDC is **NOT** distributing the audio for the multichannel data.
To obtain the audio, please contact the CHiME organizers:

    https://licensing.sheffield.ac.uk/product/chime5


7. Reconstructing LDC2019E31
============================
To reconstruct the full DIHARD II development set (LDC2019E31) from this
release and LDC2021S11, please run the script "tools/combine_dev_releases.py"
as follows:

    python tools/combine_dev_releases.py 11-sources-dir seedlings-dir combined-dir

where:

- 11-sources-dir  --  path to LDC2021S10 (this release)
- seedlings-dir  --  path to LDC2021S11
- combined-dir  --  output path for reconstructed development set

Please note that this script does require that "pandas" is installed in your
Python environment.


8. Contacts
===========
If you have questions about this data release, please contact the following
LDC personnel:

    Neville Ryant
    <nryant@ldc.upenn.edu>