README FILE FOR: 2017 NIST Language Recognition Evaluation Training and Development Sets LDC Catalog-ID: ... Authors: Craig Greenberg, Omid Sadjadi, Doug Reynolds, Elliot Singer, David Graff 1.0 Introduction This release contains audio data that was designated for use as training and development test material in the 2017 NIST Language Recognition Evaluation (LRE17). Much of the data has appeared in previous LDC publications, including "CallFriend" corpora, earlier NIST LRE test sets, "Fisher" telephone collections, the "VAST" video/audio collection, and other speech corpora in various languages. Some data in this release previously appeared only in restricted LDC distributions (provided only to participants and/or sponsors of earlier projects and evaluations), or are being released through the LDC for the first time (having been collected and made available by other organizations in the past for various projects). The data in both "train" and "dev" partitions cover the 14 distinct language varieties used in the LRE17 test set. Each language variety is identified by a 7-character string, where the first three letters represent a language group, and the last three letters represent a language, dialect or variety with that group, as follows: ara-acm : Arabic, Iraqi ara-apc : Arabic, Levantine ara-ary : Arabic, Maghrebi ara-arz : Arabic, Egyptian eng-gbr : English, British eng-usg : English, General American qsl-pol : Slavic, Polish qsl-rus : Slavic, Russian por-brz : Portuguese, Brazilian (grouped with Spanish as 'Iberian') spa-car : Spanish, Caribbean spa-eur : Spanish, European spa-lac : Spanish, Latin American Continental zho-cmn : Chinese, Mandarin zho-nan : Chinese, Min Nan All of the "train" audio files are single-channel, 8-KHz sample rate in NIST SPHERE format, but vary in sample encoding: most (about 70%) are mu-law, and the rest are either A-law or 16-bit PCM (depending on what was used in the original collection of the data). The "dev" audio files are also all single-channel, but vary in format: either SPHERE or FLAC-compressed MSWAV (RIFF). All "*.flac" files are 16-bit PCM, 44.1 KHz sample rate; the "*.sph" files are all 8-KHz, with either mu-law or 16-bit PCM samples. 2.0 Directory Structure and Contents The directory structure is as follows: ./docs/ -- contains 6 files (see section 3.0) ./data/ dev/ -- 3661 audio files, 62 hours train/ -- 15904 audio files, 2066.5 hours in 14 subdirectories 2.1 Distribution of dev data by language lng nsegs hours --------------------- ara-acm 312 4.5 ara-apc 269 5.4 ara-ary 299 5.0 ara-arz 267 2.3 eng-gbr 281 3.2 eng-usg 272 3.9 por-brz 247 5.0 qsl-pol 241 4.9 qsl-rus 165 4.8 spa-car 152 5.0 spa-eur 259 4.7 spa-lac 332 5.5 zho-cmn 264 3.8 zho-nan 301 4.0 2.2 Distribution of train data by language lng nsegs hours --------------------- ara-acm 1306 129.9 ara-apc 3409 439.8 ara-ary 819 80.9 ara-arz 440 190.9 eng-gbr 98 4.8 eng-usg 2448 327.7 por-brz 444 4.1 qsl-pol 587 59.3 qsl-rus 1221 69.5 spa-car 688 166.3 spa-eur 121 24.7 spa-lac 898 175.9 zho-cmn 3330 379.4 zho-nan 95 13.3 3.0 Summary of documentation The files in docs/ are described in the following subsections. 3.1 data_md5s.txt This is a two-column list of all audio files under ./data/; each line contains the MD5 checksum, then two spaces, then the path/name of the file (relative to the data/ directory). 3.2 This is a table of 7 columns, with one row for each of the dev segments; the first row of the file contains column labels: 1. language_code 2. segmentid 3. sample_coding 4. file_duration 5. sample_rate 6. length_condition 7. data_source 3.3 This is a table of 4 columns, with one row for each of the train segments; the first row of the file contains column labels: 1. language_code 2. segmentid 3. sample_coding 4. file_duration 3.4 lre17_dev_trials.txt, lre17_dev_segments.key These two files were supplied by NIST as the sole documentation for the original distribution of the LRE17 Development Test Set; "trials.txt" is simply the list of dev segment file names; "segments.key" is a table of four space-separated columns, which are simply a subset of the ones found in "" above (segmentid, language_code, data_source, speech_duration -- note that this last column represents a duration category based roughly on the amount of speech in the segment, rather than the full duration of the segment based on sample count; this value is labeled "length_condition" in ""). 3.5 README.txt -- this file. 4.0 Known Issues The set of development segments is known to include two files with identical content: dev/lre17_bjfsfjit.flac dev/lre17_ytgfvwpa.flac In the initial, limited release of the training set (as provided to the participants in LRE2017), there had been a few hundred cases of duplicate files; those duplicates have been eliminated, so that the current release contains only unique training files. ================== README file created by David Graff, June 11, 2021 updated by Stephanie Strassel, January 31, 2022