README FILE FOR:

  2017 NIST Language Recognition Evaluation Training and Development Sets

LDC Catalog-ID:  ...
Authors:
 Craig Greenberg, Omid Sadjadi, Doug Reynolds, Elliot Singer, David Graff

1.0 Introduction

This release contains audio data that was designated for use as training and
development test material in the 2017 NIST Language Recognition Evaluation
(LRE17).  Much of the data has appeared in previous LDC publications,
including "CallFriend" corpora, earlier NIST LRE test sets, "Fisher" telephone
collections, the "VAST" video/audio collection, and other speech corpora in
various languages.  Some data in this release previously appeared only in
restricted LDC distributions (provided only to participants and/or sponsors of
earlier projects and evaluations), or are being released through the LDC for
the first time (having been collected and made available by other
organizations in the past for various projects).

The data in both "train" and "dev" partitions cover the 14 distinct language
varieties used in the LRE17 test set.  Each language variety is identified by
a 7-character string, where the first three letters represent a language
group, and the last three letters represent a language, dialect or variety
with that group, as follows:

  ara-acm : Arabic, Iraqi
  ara-apc : Arabic, Levantine
  ara-ary : Arabic, Maghrebi
  ara-arz : Arabic, Egyptian

  eng-gbr : English, British
  eng-usg : English, General American

  qsl-pol : Slavic, Polish
  qsl-rus : Slavic, Russian

  por-brz : Portuguese, Brazilian (grouped with Spanish as 'Iberian')
  spa-car : Spanish, Caribbean
  spa-eur : Spanish, European
  spa-lac : Spanish, Latin American Continental

  zho-cmn : Chinese, Mandarin
  zho-nan : Chinese, Min Nan

All of the "train" audio files are single-channel, 8-KHz sample rate in NIST
SPHERE format, but vary in sample encoding: most (about 70%) are mu-law, and
the rest are either A-law or 16-bit PCM (depending on what was used in the
original collection of the data).

The "dev" audio files are also all single-channel, but vary in format: either
SPHERE or FLAC-compressed MSWAV (RIFF).  All "*.flac" files are 16-bit PCM,
44.1 KHz sample rate; the "*.sph" files are all 8-KHz, with either mu-law or
16-bit PCM samples.


2.0 Directory Structure and Contents

The directory structure is as follows:

   ./docs/ -- contains 6 files (see section 3.0)
   ./data/
      dev/    --  3661 audio files, 62 hours
      train/  -- 15904 audio files, 2066.5 hours in 14 subdirectories

2.1 Distribution of dev data by language

lng	nsegs	hours
---------------------
ara-acm	  312	  4.5
ara-apc	  269	  5.4
ara-ary	  299	  5.0
ara-arz	  267	  2.3
eng-gbr	  281	  3.2
eng-usg	  272	  3.9
por-brz	  247	  5.0
qsl-pol	  241	  4.9
qsl-rus	  165	  4.8
spa-car	  152	  5.0
spa-eur	  259	  4.7
spa-lac	  332	  5.5
zho-cmn	  264	  3.8
zho-nan	  301	  4.0

2.2 Distribution of train data by language

lng	nsegs	hours
---------------------
ara-acm  1306	129.9
ara-apc  3409	439.8
ara-ary   819	 80.9
ara-arz   440	190.9
eng-gbr    98	  4.8
eng-usg  2448	327.7
por-brz   444	  4.1
qsl-pol   587	 59.3
qsl-rus  1221	 69.5
spa-car   688	166.3
spa-eur   121	 24.7
spa-lac   898	175.9
zho-cmn  3330	379.4
zho-nan    95	 13.3


3.0 Summary of documentation

The files in docs/ are described in the following subsections.

3.1  data_md5s.txt

This is a two-column list of all audio files under ./data/; each line contains
the MD5 checksum, then two spaces, then the path/name of the file (relative to
the data/ directory).

3.2  dev_info.tab

This is a table of 7 columns, with one row for each of the dev segments; the
first row of the file contains column labels:

    1.	language_code
    2.	segmentid
    3.	sample_coding
    4.	file_duration
    5.	sample_rate
    6.	length_condition
    7.	data_source

3.3  train_info.tab

This is a table of 4 columns, with one row for each of the train segments; the
first row of the file contains column labels:

    1.	language_code
    2.	segmentid
    3.	sample_coding
    4.	file_duration

3.4  lre17_dev_trials.txt, lre17_dev_segments.key

These two files were supplied by NIST as the sole documentation for the
original distribution of the LRE17 Development Test Set; "trials.txt" is
simply the list of dev segment file names; "segments.key" is a table of four
space-separated columns, which are simply a subset of the ones found in
"dev_info.tab" above (segmentid, language_code, data_source, speech_duration
-- note that this last column represents a duration category based roughly on
the amount of speech in the segment, rather than the full duration of the
segment based on sample count; this value is labeled "length_condition" in
"dev_info.tab").

3.5  README.txt -- this file.


4.0 Known Issues

The set of development segments is known to include two files with identical
content:

  dev/lre17_bjfsfjit.flac
  dev/lre17_ytgfvwpa.flac

In the initial, limited release of the training set (as provided to the
participants in LRE2017), there had been a few hundred cases of duplicate
files; those duplicates have been eliminated, so that the current release
contains only unique training files.

==================
README file created by David Graff, June 11, 2021
            updated by Stephanie Strassel, January 31, 2022