First DIHARD Challenge Development - Eight Sources LDC2019S09 November 28, 2018 Linguistic Data Consortium 1. Overview =========== The First DIHARD Challenge was an attempt to reinvigorate work on diarization through a shared task focusing on "hard" diarization; that is, speech diarization for challenging corpora where there was an expectation that existing state-of-the-art systems would fare poorly. As such, it included speech from a wide sampling of domains representing diversity in number of speakers, speaker demographics, interaction style, recording quality, and environmental conditions, including, but not limited to: - clinical interviews - extended child language acquisition recordings - YouTube recordings - conversations collected in restaurants The challenge ran from February 1, 2018 through March 23, 2018 with results presented at a special session at Interspeech 2018 in Hyderabad. For additional details regarding the structure of the challenge (annotation, scoring, reporting of results, etc.) please consult the evaluation plan: docs/first_dihard_eval_plan_v1.3.pdf. For the final leaderboard, links to participants' submissions, and system descriptions, please see the challenge website: https://coml.lscp.ens.fr/dihard/2018/index.html. This release, when combined with First DIHARD Challenge Development - SEEDLingS (LDC2019S10), contains the development set audio data and annotation as well as the official scoring tool. Specifically, this combination should involve merging the flac, rttm, and lab folders from each release’s data directory. It is identical in contents to the package made available to peformers during the challenge itself with the following exceptions: - all documentation has been updated and expanded, particularly documentation regarding use of the scoring tool 2. Directory structure ======================= - data/flac/ -- FLAC files - data/rttm/ -- RTTM files containing diarization - data/sad/ -- HTK LABEL files containing reference speech segmentation - docs/dev.uem -- unpartitioned evaluation map (UEM) file indicating scoring regions - docs/file.tbl -- listing of md5 checksums, sizes, dates, and file names for contents of "data/" - docs/first_dihard_eval_plan_v1.3.pdf -- evaluation plan - docs/sources.tbl -- listing of source and language for each file - tools/dscore-1.0.1 -- official scoring tool 3. File formats =============== 3.1 Audio All audio is provided in the form of 16 kHz, mono-channel FLAC files. In the case of files selected from Autism Diagnostic Observation Schedule (ADOS) interviews, regions containing personal identifying information have been filtered to make these portions of the recording unrecognizable. Pitch information in these regions is still recoverable, but the amplitude levels have been reduced relative to the original signal. Filtering was done with a 10th order Butterworth filter with a passband of 0 to 400 Hz. To avoid abrupt transitions in the resulting waveform, the effect of the filter was gradually faded in and out at the beginning and end of the regions using a ramp of 40 ms. 3.2 Diarization The diarization for each recording is stored as a NIST Rich Transcription Time Marked (RTTM) file. RTTM files are space-separated text files containing one turn per line, each line containing ten fields: - Type -- segment type; should always be "SPEAKER" - File ID -- file name; basename of the recording minus extension (e.g., "rec1_a") - Channel ID -- channel (1-indexed) that turn is on; should always be "1" - Turn Onset -- onset of turn in seconds from beginning of recording - Turn Duration -- duration of turn in seconds - Orthography Field -- should always by "" - Speaker Type -- should always be "" - Speaker Name -- name of speaker of turn; should be unique within scope of each file - Confidence Score -- system confidence (probability) that information is correct; should always be "" - Signal Lookahead Time -- should always be "" 3.3 Speech segmentation For each recording a reference speech segmentation is generated by merging all overlapping speaker turns. These segmentations are stored as HTK label files ending in the ".lab" extension. Each file contains one speech segment per line, each line containing three space-delimited fields: - Onset -- onset of speech segment in seconds from begining of recording - Offset -- offset of speech segment in seconds from beginning of recording - Label -- label of segment; always "speech" 3.4 Scoring regions The file "docs/dev.uem" provides the scoring regions for each recording. This file is a NIST unpartitioned evaluation map (UEM) file, which is a plaintext file containing one scoring region per line, each line consisting of four space-delimited fields: - File ID -- file name; basename of the recording minus extension (e.g., "rec1_a") - Channel ID -- channel (1-indexed) that scoring region is on; should always be "1" - Onset -- onset of scoring region in seconds from beginning of recording - Offset -- offset of scoring region in seconds from beginning of recording 4. Sources ========== Recordings are assigned unique ids that mask their source. However, this information is available from "docs/sources.tbl", which indicates the language and recording source of each file. Each line describes a single recording, represented by three tab-delimited fields: - File ID -- the unique, anonymous id assigned to a recording - Language -- three letter ISO 639-3 language code - Source -- the dataset from which the recording was drawn 4.1 Sources list - ADOS -- Autism Diagnostic Observation Schedule (ADOS) interviews - DCIEM -- DCIEM map task (LDC96S38) - LIBRIVOX -- audiobook recordings from LibriVox - RT04S -- meeting speech from 2004 Spring NIST Rich Transcription (RT-04S) dev (LDC2007S11) and eval (LDC2007S12) releases. - SCOTUS -- 2001 U.S. Supreme Court oral arguments - SLX -- sociolinguistic interviews drawn from SLX (LDC2003T15) - VAST -- web video collected as part of the Video Annotation for Speech Technologies (VAST) project - YP -- YouthPoint radio interviews For more detailed information about each source, please consult the evaluation plan. 5. Scoring tool =============== This release includes version 1.0.1 of "dscore" (https://github.com/nryant/dscore), the scoring tool used during the challenge. To score a set of system output RTTMs collected in a directory "sys_rttm/" against the corresponding reference RTTMs in this package (located under "data/rttm/"), the command line would be: python score.py -u docs/dev.uem -s sys_rttm/*.rttm -r data/rttm/*.rttm where "dev.uem" is the un-partitioned evaluation map file in the docs directory. The overall and per-file results will be printed to STDOUT as a table; for instance File DER B3-Precision B3-Recall B3-F1 GKT(ref, sys) GKT(sys, ref) H(ref|sys) H(sys|ref) MI NMI --------------- ------ -------------- ----------- ------- --------------- --------------- ------------ ------------ ---- ----- DH_0001 14.86 0.84 0.88 0.86 0.79 0.73 0.55 0.39 1.20 0.72 DH_0002 28.02 0.73 0.76 0.75 0.64 0.59 0.80 0.66 0.90 0.55 DH_0003 62.05 0.36 0.47 0.41 0.25 0.23 1.97 1.28 0.72 0.31 . . . DH_0164 81.97 0.14 0.78 0.23 0.30 0.04 3.70 0.54 0.27 0.15 *** OVERALL *** 36.05 0.66 0.75 0.70 0.75 0.66 1.12 0.69 8.08 0.90 For futher information, please consult "dscore-1.0.1/README.md". 6. Known issues =============== - DIHARD development set data were excerpted from a number of different corpora, some created for other purposes, and have undergone varying amounts of QC. For example, unlike the test set recordings, the development set SEEDLINGS recordings have undergone but a single annotation pass and still retain anomalies. In particular, there exist both regions of missed speech and and false alarms as well as inexact turn boundaries. However, as there is very little child speech annotated to support diarization, we release them as is in case they are helpful. - Some of the RT04S RTTMs and label files have boundaries that are slightly (< 500 ms) longer than the actual audio duration. These issues existed in the original release and have not been corrected. 7. Citing ========= Users of this corpus should provide appropriate acknowledgement in all presentations, articles, reports and other documents describing the Data and/or the results of work performed. Papers should cite the DIHARD challenge evaluation plan: Ryant, N., Church, K., Cieri, C., Cristia, A., Du, J., Ganapathy, S., and Liberman, M. (2018). First DIHARD Challenge Evaluation Plan. https://zenodo.org/record/1199638. and the DIHARD and SEEDLingS corpora: Bergelson, E. (2016). Bergelson Seedlings HomeBank Corpus. doi: 10.21415/T5PK6D. Ryant et al. (2018). DIHARD Corpus. Linguistic Data Consortium. 8. Contacts =========== If you have questions about this data release, please contact the following LDC personnel: Neville Ryant