Second DIHARD Challenge Evaluation - SEEDLinGS

                               LDC2022S07
                            February 2, 2021

                       Linguistic Data Consortium


1. Overview
===========
This release contains evaluation audio data and annotation for the child
language domain (i.e., SEEDLingS) from the Second DIHARD Challenge.

For additional details regarding the challenge, please see the evaluation plan:

    docs/second_dihard_eval_plan_v1.2.pdf

or consult the website:

    https://dihardchallenge.github.io/dihard2

For instructions on how to reconstruct the complete DIHARD II evaluation set
(released during the challenge as LDC2019E32) from this release and LDC2022S06,
please see Section 7 of this README.


2. Directory structure
=======================
- data/single_channel/flac/  --  SEEDLingS FLAC files
- data/single_channel/rttm/  --  RTTM files containing reference diarization
  for SEEDLingS
- data/single_channel/sad/  --  HTK label files containing reference speech
  segmentation for SEEDLingS
- data/single_channel/uem/  --   UEM files for use with "dscore"; one UEM file
  per domain as well as a file "all.uem" that contains **ALL** scoring regions
  for **ALL** recordings
- docs/README.txt -- this file
- docs/file.tbl   --  listing of md5 checksums, sizes, dates, and file names
- docs/sources.tbl  --  listing of domain, source, and language for each file
- docs/speakers.tbl  --  speaker metadata
- docs/second_dihard_eval_plan_v1.2.pdf  --  evaluation plan
- tools/combine_eval_releases.py  --  Python script for reconstructing LDC2019E32


3. File formats
===============
3.1 Audio

All audio is provided in the form of 16 kHz, 16-bit mono-channel FLAC files.
When multiple channels were present in the source audio, these were remixed to
a single channel in the delivery version. Similarly, if the source audio was
recorded at a sample rate other than 16 kHz, the audio was resampled to 16 kHz.
All audio conversion was performed using SoX.


3.2 Diarization

The diarization for each recording is stored as a NIST Rich Transcription Time
Marked (RTTM) file. RTTM files are space-delimited text files containing one
turn per line, each line containing ten fields:

- type  --  segment type; should always by "SPEAKER"
- file id  --  unique identifier for recording that turn is on (e.g., "DH_0001"
- channel id  --  channel (1-indexed) that turn is on; should always be "1"
- turn onset  --  onset of turn in seconds from beginning of recording
- turn duration  -- duration of turn in seconds
- orthography field --  should always by "<NA>"
- speaker type  --  should always be "<NA>"
- speaker name  --  name of speaker of turn; should be unique within scope of
  each file
- confidence score  --  system confidence (probability) that information is
  correct; should always be "<NA>"
- signal lookahead time  --  should always be "<NA>"

While the RTTM format allows for a single file to contain turns from multiple
recordings, the DIHARD RTTM files always contain turns from a single recording;
e.g., "DH_0001.rttm" only contains turns from recording "DH_0001".


3.3 Speech segmentation

For each recording a reference speech segmentation is generated by merging all
overlapping speaker turns. These segmentations are stored as HTK label files
ending in the ".lab" extension. Each file contains one speech segment per
line, each line containing three tab-delimited fields:

- onset  --  onset of speech segment in seconds from begining of recording
- offset  --  offset of speech segment in seconds from beginning of recording
- label  --  label of segment; always "speech"


3.4 Scoring regions

The scoring regions for each recording are specified by un-partitioned
evaluation map (UEM) files. These files contain one line per scoring region,
each line consisting of four space-delimited fields:

- file id  --  unique identifier for recording that scoring region is on;
  (e.g., "DH_0001")
- channel  --  channel (1-indexed) that scoring region is on
- onset  --  onset of scoring region in seconds from beginning of recording
- offset  --  offset of scoring region in seconds from beginning of recording

There are two UEM files:

- data/single_channel/uem/all.uem  --  **ALL** scoring regions for **ALL**
  recordings
- data/single_channel/uem/child.uem  --  identical to above; included to
  allow reconstruction of LDC2019E32


4. Recording metadata
======================

4.1 sources.tbl

Metadata for each recording is stored in "docs/sources.tbl". This file is a
tab-delimited table containing one recording per line, each line having the
following 4 fields:

- file id  --  unique identifier for recording
- lang  --  predominant language of recording; ISO 639-3 language code
- domain  --  domain of recording; see section 4.2 for details
- source  --  source of recording; see section 4.3 for details


4.2 Recording domains

Since this release only includes SEEDLingS data, there is a single domain:

- CHILD  --  child language acquisition recordings


4.3 Recording sources

Since this release only includes SEEDLingS data, there is a single source:

- SEEDLINGS  --  child language recordings collected as part of SEEDLingS


4.4 Additional information

For full information regarding each source and domain, please consult
Appendix A from the evaluation plan.


5. Speaker metadata
===================

5.1 speakers.tbl

Contains the annotator assigned speaker type and (biological) sex for each
SEEDLingS speaker. This file is a tab-delimited table containing one speaker
per line, each line having the following four fields:

- file_id  --  unique identifier for recording
- speaker_id  --  name of speaker used in RTTM files
- speaker_type  --  speaker label assigned by annotator
- speaker_sex  --  speaker (biological) sex assigned by annotator

For each speaker, annotators assigned one of the following types:
    - adult
    - child
    - radio
    - toy
    - tv
    - unknown

Annotators also made a best-guess at the biological sex of the speaker
("male" or "female"), indicating uncertainty using the label "unknown".


6. File table
=============
Expected sizes, modification times, and MD5 checksums for all files within the
"data/" directory are recorded in "docs/file.tbl". This is a tab-delimited
table containing one file per line, each line having the following 4 fields:

- checksum  --  MD5 checksum of file
- size  --  size of file in bytes
- datetime  --  last modification date in YYYY-MM-DD_HH:MM:SS format
- path  --  path to file relative to root of release directory


7. Reconstructing LDC2019E32
============================
To reconstruct the full DIHARD II evaluation set (LDC2019E32) from this
release and LDC2022S06, please run the script "tools/combine_eval_releases.py"
as follows:

    python tools/combine_eval_releases.py 11-domains-dir seedlings-dir combined-dir

where:

- 11-domains-dir  --  path to LDC2022S06
- seedlings-dir  --  path to LDC2022S07 (this release)
- combined-dir  --  output path for reconstructed evaluation set

Please note that this script does require that "pandas" is installed in your
Python environment.


8. Contacts
===========
If you have questions about this data release, please contact the following
LDC personnel:

    Neville Ryant
    <nryant@ldc.upenn.edu>