2011 NIST Language Recognition Evaluation Test Set Authors: Craig Greenberg, Alvin Martin, David Graff, Kevin Walker, Karen Jones, Stephanie Strassel 1.0 Introduction The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition in conversational telephone speech, and to lay the groundwork for further research efforts in the field. NIST conducted prior language recognition evaluations in 1996, 2003, 2005, 2007 and 2009. The 2011 NIST Language Recognition Evaluation Plan (LRE11) can be found at: https://www.nist.gov/sites/default/files/documents/itl/iad/mig/LRE11_EvalPlan_releasev1.pdf The NIST Language Recognition Evaluation for 2011 (LRE11) makes use of 10,730 manually-audited speech segments in 24 distinct language varieties, and presumes that relevant audio segments from prior LRE cycles are available for use as training data. The audio presented in this release was selected from recordings made by the Linguistic Data Consortium (LDC), including both Conversational Telephone Speech (CTS) and narrow-band speech in broadcast audio (BNBS). The data distributed by NIST to participants in LRE11 includes training data for nine language varieties that had not been represented in prior LRE cycles. The training data is structured as follows (original NIST distribution labels are shown in parentheses): - 893 audited segments of roughly 30 seconds duration each (r136_1_1) 8-KHz - 400 full-length CTS (r137_1_1) 8-KHz (Note: the original NIST distribution to LRE11 participants also included a set of 59 full-length program recordings from a handful of Arabic broadcast sources. Due to constraints on intellectual property rights, the LDC is unable to include these full-length broadcasts in the present release; only the manually-audited 30-sec BNBS segments from these programs are provided here.) The evaluation test set (NIST distribution label r139_1_1) comprises a total 29,511 audio files, all manually audited at the LDC for language, and divided equally into three different test conditions according to the nominal amount of speech content per segment: - 9837 segments of 2 to 4 seconds - 9837 segments of 7 to 13 seconds - 9837 segments over 13 sec (subgrouped: 1311 13-to-25sec; 8526 25-to-35sec) Note that all the segments in the two shorter categories were extracted from segments in the longest category. (LDC auditing was performed only on the long segments.) The answer-key tables (see below) identifies the relationships between shorter and longer files. The actual durations of data files varies with the amount of non-speech content per segment. Also note that the 893 training segments described above were included among the test set inventory, and received the same treatment of extracting two shorter duration segments from the original audited segment. The answer-key table explicitly marks all these segments (893 * 3 = 2619 answer-key entries), to indicate that they are not to be used in scoring when system performance is reported. (An additional 38 audited segments -- 114 test files in the three duration conditions -- have also been marked as not scorable, for other reasons.) 2.0 Content Summary by Language The tables below give a breakdowns of data quantity by language and genre (CTS, BNBS) for each partition (evaluation, training); note that the "#Minutes" column is based on summing the audio durations of the files, and the actual amount of speech data may be slightly less. Table 1: Evaluation Test Set Language BNBS segs BNBS mins CTS segs CTS mins Total segs Total mins arabic_iraqi 0 0.0 408 224.3 408 224.3 arabic_levantine 0 0.0 408 224.0 408 224.0 arabic_maghrebi 0 0.0 405 220.8 405 220.8 arabic_msa 406 222.6 0 0.0 406 222.6 bengali 227 115.4 220 120.4 447 235.8 czech 279 127.4 179 100.0 458 227.4 dari 376 168.2 27 15.0 403 183.2 english_american 331 107.7 121 66.8 452 174.5 english_indian 366 191.4 50 27.6 416 219.0 farsi 208 113.8 197 111.1 405 224.9 hindi 348 131.5 70 38.7 418 170.2 lao 125 40.1 126 67.9 251 108.0 mandarin 173 77.0 259 141.3 432 218.3 panjabi 11 4.7 397 218.3 408 223.0 pashto 257 135.1 155 85.9 412 221.0 polish 239 100.5 242 136.4 481 236.9 russian 302 165.6 139 78.0 441 243.6 slovak 242 123.4 172 95.7 414 219.1 spanish 188 103.1 231 126.9 419 230.0 tamil 214 117.1 200 111.0 414 228.1 thai 338 176.7 65 36.2 403 212.9 turkish 305 105.5 167 93.5 472 199.0 ukrainian 67 32.5 119 66.9 186 99.4 urdu 256 140.3 222 121.0 478 261.3 Table 2: Training Data Language BNBS segs BNBS mins #CTS segs CTS mins Total segs Total mins arabic_iraqi 0 0.0 100 54.9 100 54.9 arabic_levantine 0 0.0 100 54.8 100 54.8 arabic_maghrebi 0 0.0 100 54.2 100 54.2 arabic_msa 100 54.8 0 0.0 100 54.8 czech 0 0.0 100 55.5 100 55.5 lao 0 0.0 93 49.8 93 49.8 panjabi 0 0.0 100 54.8 100 54.8 polish 0 0.0 100 56.2 100 56.2 slovak 0 0.0 100 56.1 100 56.1 3.0 Data Collection Methods The LRE11 data was collected by LDC in 2010 and 2011. The CTS data was obtained using a "claque" collection model in which speakers (claques) call friends or relatives in their own social network for a 10-minute conversation in the claque's native language, such that each call would involve a unique callee. Participants were free to speak on topics of their own choosing. All calls were routed through a telephone collection system at the LDC, which stored the raw mu-law sample stream into separate audio files for each call side. Auditing and selection were applied only to the callee side of every call, and also to the caller (claque) side in at most one call made by each claque. Contiguous regions containing between 25 and 35 seconds of speech were identified by signal analysis and extracted for manual audit. In some cases, shorter segments (down to a minimum of 13 seconds) were also selected for audit. Broadcast audio was recorded via capture of satellite-receiver MPEG streams or analog audio receivers digitizing at 16 KHz. Platforms for data capture were located at LDC and also in Tunisia and India. Recordings were analyzed to extract contiguous segments of narrow-band speech of at least 33 seconds duration; longer segments were trimmed to a maximum length of 35 seconds for audit. All audited segments for training and test are presented as 8-KHz, 16-bit PCM, single-channel audio files with NIST SPHERE headers. 4.0 Documentation and Answer-Key Data In addition to this README.txt file, the "docs/" directory of this release contains the following: 4.1 NIST_LRE11_EVAL_DATA_KEY.v0.tab (header line + 29,511 data rows) The columns of this tab-delimited file are: 1 segmentid -- file-ID ("lre11....") of test-segment audio 2 ldcid -- original segment-ID used in selection and auditing 3 language -- labels as tabulated in section 3 4 duration_category -- 02sec-04sec, 07sec-13sec, 13sec-25sec, 25sec-35sec 5 source -- telephone or broadcast_...(channel_label)... 6 is_only_speech -- Y/N 7 is_single_speaker -- Y/N 8 sounds_like_narrow_band -- Y/N 9 speaker_sex -- M/F 10 speaker_nativeness -- normal, marked, non-native 11 noise_amount -- easy, middling, difficult 12 noise_type -- 0 or more of: bgnoise distortion dropouts interference other 13 noise_comment -- empty or free text 14 speaker_comment -- empty or free text 15 language_comment -- empty or free text 16 is_scored -- Y/N Columns 12-15 may be empty, or contain 2 or more space-separated strings. Segments that have "N" in column 16 ("is_scored") must not be used when scoring system performance; these segments contain audio that was also used in the training partition, or was deemed unsuitable for test purposes. 4.2 NIST_LRE11_DEV_DATA_KEY.v2.tab (header line + 893 data rows) The columns of this tab-delimited file are: 1 segmentid -- segment file-ID (6 digits . 6 digits . 4 digits) 2 sourceid -- file-ID of original full-length source recording 3 language -- labels as tabulated in section 3 4 is_only_speech -- Y/N 5 is_single_speaker -- Y/N 6 sounds_like_narrow_band -- Y/N 7 speaker_sex -- M/F 8 speaker_nativeness -- normal, marked, non-native 9 noise_amount -- easy, middling, difficult 10 noise_type -- 0 or more of: bgnoise distortion dropouts interference other 11 noise_comment -- empty or free text 12 speaker_comment -- empty or free text 13 language_comment -- empty or free text Columns 10-13 may be empty, or contain 2 or more space-separated strings. For the entries in the "EVAL" table that were derived from training data (most of the rows having an "is_scored" value of "N" in column 16), the "ldcid" value in that table matches the "segmentid" value in the "DEV" table. 4.3 audio_info.tab (header line + 30804 data rows) The columns of this tab-delimited file are: 1 file_id -- file name minus extension 2 file_path -- directory_path/filename.extension (relative to "data/") 3 channel_count -- 1 or 2 4 sample_rate -- 8000 5 duration_sec -- floating-point number 4.4 file.tbl (no header) This is a 4-column, space-separated table with one line per data file in this release. Variable spacing is used for vertical column alignment; the columns are: 1 MD5 checksum 2 file size in bytes 3 file modification date_time 4 file path (relative to base directory of the release package) ------------- README file created by David Graff, Feb. 16, 2017 modified by David Graff, March 3, 2017