Third DIHARD Challenge Development

                               SUB2021Z01
                            September 17, 2020

                       Linguistic Data Consortium


1. Overview
===========
This release contains development audio data and annotation for the Third
DIHARD Challenge. It is identical to the release distributed during the
challenge to participants (LDC2020E12).

For additional details regarding the challenge, please see the evaluation
plan:

    docs/third_dihard_eval_plan_v1.2.pdf

or consult the website:

    https://dihardchallenge.github.io/dihard3


2. Directory structure
=======================
- data/flac/  --  FLAC files
- data/rttm/  --  RTTM files containing reference diarization
- data/sad/  --  HTK label files containing reference speech segmentation
- data/uem/  --  UEM files containing scoring regions for each recording; one
  UEM per recording
- data/uem_scoring/core/  --  UEM files for use with ``dscore`` for the core
  DEV set; one UEM file per domain as well as a file ``all.uem`` that
  contains **ALL** scoring regions for **ALL** recordings
- data/uem_scoring/full/  --  as ``data/uem_scoring/core/``, but for the full
  DEV  set
- docs/file.tbl   --  listing of md5 checksums, sizes, dates, and file names
- docs/README.txt -- this file
- docs/recordings.tbl  --  domain, source, language, duration, etc for each
  recording
- docs/third_dihard_eval_plan_v1.2.pdf  --  evaluation plan


3. File formats
===============
3.1 Audio

All audio is provided in the form of 16 kHz, 16-bit mono-channel FLAC files.
When multiple channels were present in the source audio, these were remixed to
a single channel in the delivery version. Similarly, if the source audio was
recorded at a sample rate other than 16 kHz, the audio was resampled to 16
kHz (e.g., telephone speech was upsampled from 8 kHz to 16 kHz). All audio
conversion was performed using SoX.


3.2 Diarization

The diarization for each recording is stored as a NIST Rich Transcription Time
Marked (RTTM) file. RTTM files are space-separated text files containing one
turn per line, each line containing ten fields:

- type  --  segment type; should always by "SPEAKER"
- uri  --  unique resource identifier (URI) identifying the recording;
  basename of the recording minus extension (e.g., "DH_DEV_0001")
- channel ID  --  channel (1-indexed) that turn is on; should always be "1"
- turn onset  --  onset of turn in seconds from beginning of recording
- turn duration  -- duration of turn in seconds
- orthography field --  should always by "<NA>"
- speaker type  --  should always be "<NA>"
- speaker name  --  name of speaker of turn; speaker names are unique within
  the scope of the release (e.g., if "speaker1" is present in two recordings,
  it is the same speaker for both)
- confidence score  --  system confidence (probability) that information is
  correct; should always be "<NA>"
- signal lookahead time  --  should always be "<NA>"

While the RTTM format allows for a single file to contain turns from multiple
recordings, the DIHARD RTTM files will always contain turns from a single
recording; e.g., "DH_DEV_0001.rttm" only contains turns from recording
"DH_DEV_0001".


3.3 Speech segmentation

For each recording a reference speech segmentation is generated by merging all
overlapping speaker turns. These segmentations are stored as HTK label files
ending in the ".lab" extension. Each file contains one speech segment per
line, each line containing three space-delimited fields:

- onset  --  onset of speech segment in seconds from begining of recording
- offset  --  offset of speech segment in seconds from beginning of recording
- label  --  label of segment; always "speech"


3.4 Scoring regions

The scoring regions for each recording are specified by un-partitioned
evaluation map (UEM) files. These files contain one line per scoring region,
each line consisting of four space-delimited fields:

- uri  --  recording URI (e.g., "DH_DEV_0001")
- channel  --  channel (1-indexed) that scoring region is on
- onset  --  onset of scoring region in seconds from beginning of recording
- offset  --  offset of scoring region in seconds from beginning of recording

Recording level UEM files are provided for each recording under the directory
"data/uem/"; e.g.

    data/uem/DH_DEV_0001.uem
    data/uem/DH_DEV_0002.uem
    data/uem/DH_DEV_0003.uem
    ...

As a convenience, we also  provide global UEM files listing **ALL** scoring
regions from **ALL** recordings from the full/core DEV set in each domain. For
the full DEV set, these domain-level UEM files are located under
"data/uem_scoring/full/":

    data/uem_scoring/full/all.uem
    data/uem_scoring/full/audiobooks.uem
    data/uem_scoring/full/broadcast_interview.uem
    ...

and for the core DEV set under "data/uem_scoring/core/":

    data/uem_scoring/core/all.uem
    data/uem_scoring/core/audiobooks.uem
    data/uem_scoring/core/broadcast_interview.uem

The domain "all" ("all.uem") corresponds to **ALL** recordings.


4. Recording metadata
======================
4.1 recordings.tbl

Metadata for each recording is stored in "docs/recordings.tbl". This file is a
tab-delimited table containing one recording per line, each line having the
following 9 fields:

- uri  --  recording URI
- in_core  --  whether or not the recording is in the core DEV set;
  True or False
- lang  --  predominant language of recording; ISO 639-3 language code
- domain  --  domain of recording; see section 4.2 for details
- source  --  source of recording; see section 4.3 for details
- duration  --  duration in seconds of recording
- speech_duration  --  duration in seconds of speech in recording;
  overlapped speech is only counted once
- overlap_duration  --  duration in seconds of overlapped speech in the
  recording
- num_speakers  --  number of speakers present


4.2 Recording domains

Recordings are drawn from the following 11 domains:

- AUDIOBOOKS  --  amateur audiobooks
- BROADCAST-INTERVIEW  --  radio interviews
- CLINICAL  --  Autism Diagnostic Observation Schedule (ADOS) interviews
- COURT  --  courtroom recordings
- CTS  --  conversational telephone speech
- MAPTASK  --  map tasks
- MEETING  --  meeting speech
- RESTAURANT  --  conversational speech recorded during lunches in restaurants
- SOCIO-FIELD  --  sociolinguistic interviews recorded in the field
- SOCIO-LAB  --  sociolinguistic interviews recorded in laboratory setting
- WEBVIDEO  --  amateur video from sites such as YouTube


4.3 Recording sources

Recordings are drawn from the following 11 sources:

- ADOS  --  unpublished ADOS interviews recorded at CHOP
- CIR  --  unpublished LDC collection of restaurant conversation
- DCIEM  --  DCIEM map task (LDC96S38)
- FISHER  --  unpublished telephone calls from the Fisher English collection
- LIBRIVOX  --  audiobook recordings from LibriVox
- MIXER6  --  interviews from MIXER6 (LDC2013S03)
- RT04S  --  meeting speech from 2004 Spring NIST Rich Transcription (RT-04S)
  dev (LDC2007S11) and eval (LDC2007S12) releases
- SCOTUS  --  2001 U.S. Supreme Court oral arguments from OYEZ project
- SLX  --  sociolinguistic interviews drawn from SLX (LDC2003T15)
- VAST  --  web video collected as part of the Video Annotation for Speech
  Technologies (VAST) project
- YOUTHPOINT  --  unpublished corpus of student-led radio interviews conducted
  for YouthPoint, a 1970s radio program


4.4 Additional information

For full information regarding each source and domain, please consult
Appendix A from the evaluation plan.


5. File table
=============
Expected sizes, modification times, and MD5 checksums for all files within the
"data/" directory are recorded in "docs/file.tbl". This is a tab-delimited table
containing one file per line, each line having the following 4 fields:

- checksum  --  MD5 checksum of file
- size  --  size of file in bytes
- datetime  --  last modification date in YYYY-MM-DD_HH:MM:SS format
- path  --  path to file relative to root of release directory


6. Contacts
===========
If you have questions about this data release, please contact the following
LDC personnel:

    Neville Ryant
    <nryant@ldc.upenn.edu>