2022 NIST Language Recognition Evaluation Test and Development Sets Authors NIST: Craig Greenberg, Yooyoung Lee, Asad Butt LDC: Kevin Walker, Karen Jones, Christopher Caruso, Jonathan Wright, Stephanie Strassel 1.0 Introduction This release comprises the evaluation test and development sets for the 2022 NIST Language Recognition Evaluation (LRE22). LRE is an ongoing evaluation series designed to measure how well systems can automatically detect a target language given the test segment. The LRE22 evaluation involved both conversational telephone speech (CTS) data and broadcast narrowband speech (BNBS) data, with an emphasis on low resource African languages. This package contains a total of 30,673 LRE22 test and development segments, covering 14 evaluation languages which were drawn from the following datasets collected by LDC to support LRE: -Maghrebi Linguistic Information Corpus (MAGLIC) -Speech Archive of South African Languages (SASAL) -Low Resource African Languages (LRAL) The test data and development data and documentation as supplied to LRE22 evaluation participants is included in this package. 2.0 Directory Structure The contents of this package are organized as follows: /data/{dev,eval} ./docs README.txt -- this file languages.tab lre22_testset_key.tsv odyssey24-maglic.pdf file_md5s.txt -- md5sums for the dev audio NIST_LRE22_eval_plan_2022-08-31.pdf lre22_dev_metadata.tsv lre22_eval_trials.tsv README_dev.txt -- more information about the dev data and documentation README_test.txt -- more information about the eval (test) data and documentation 3.0 Languages LRE22 test data includes segments in 14 languages: language language_code Afrikaans afr-afr Tunisian Arabic ara-aeb Algerian Arabic ara-arq Libyan Arabic ara-ayl South African English eng-ens Indian-accented SA Engh eng-iaf North African French fra-ntf Ndebele nbl-nbl Oromo orm-orm Tigrinya tir-tir Tsonga tso-tso Venda ven-ven Xhosa xho-xho Zulu zul-zul 4.0 Data Sources Test segments were drawn from the MAGLIC, SASAL and LRAL corpora described below. SASAL: The SASAL Corpus was collected by LDC to support development and testing of language recognition and related technologies. SASAL consists of both CTS and BNBS data in a variety of South African languages. For the CTS collection, a small number of native speakers known as "claques" were recruited for each language to make single calls to multiple individuals in their social network. Both claques and callees provided consent to be recorded under a protocol approved by the University of Pennsylvania's IRB. Calls lasted 8-15 minutes and speakers were free to discuss any topic. The BNBS data were collected by LDC from streaming radio programming, focusing on programs that included narrowband speech (e.g. call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality. MAGLIC: The MAGLIC Corpus consisted of speech recordings in 3 varieties of Maghrebi Arabic (Tunisian, Libyan and Algerian) and North African French. These recordings were CTS only and collected in accordance with the CTS protocol described above for the SASAL collection. Additional information about the MLS14 Corpus can be found in the following paper, found in this release at /docs/odyssey24-maglic.pdf: Jones et al, MAGLIC: The Maghrebi Language Identification Corpus, Odyssey 2024: The Speaker and Language Recognition Workshop Quebec, June 18-21. LRAL The Oromo and Tigrinya recordings were derived from broadcast recordings collected off-the-air from broadcasts in Ethiopia and Eritrea. The parameters for this collection effort followed the same BNBS collection parameters as described under the SASAL collection. 5.0 Data Test and development segments were extracted by NIST from SASAL and MAGLIC CTS callee call sides and comparatively few claque sides, and narrowband portions of the SASAL and LRAL BNBS data. All test and dev segments are presented as 8-bit (a-law) SPHERE files sampled at 8kHz. The total number of segments per corpus, and the number of full recordings represented, are summarized in the table below. Corpus #test segs #test ldc_audio_ids #dev segs # dev ldc_audio_ids MAGLIC 9349 956 1200 120 LRAL 1280 168 600 60 SASAL 15844 1650 2400 240 TOTAL 26473 2774 4200 420 The ldc_audio_ids are the stretches of audited source recordings from which the test segments were extracted. The amount of speech contained in the test segments ranges from 3-30 seconds, whereas the amount of speech in the dev segments ranges from 3.02 to 93.15 seconds. The genre breakdown of test and dev segments by language is as follows: lang code #test CTS segs #test BNBS segs #dev CTS segs #dev BNBS segs afr-afr 2133 0 300 0 ara-aeb 2401 0 300 0 ara-arq 2622 0 300 0 ara-ayl 2332 0 300 0 eng-ens 1353 159 280 20 eng-iaf 260 310 160 140 fra-ntf 1994 0 300 0 nbl-nbl 647 1414 70 230 orm-orm 0 383 0 300 tir-tir 0 897 0 300 tso-tso 2439 0 300 0 ven-ven 1463 731 160 140 xho-xho 1556 1213 100 200 zul-zul 2166 0 300 0 TOTALS 21366 5107 2870 1330 6.0 Metadata and Answer Keys The ./docs directory includes: 6.1 languages.tab This table lists the language varieties included in the LRE22 evaluation along with their six letter LDC language code. Fields are: language_code - six letter LDC code language - name of language variety 6.2 lre22_testset_key.tsv This table reveals the language for each segment; fields include: segmentid language_code ldc_audio_id Example: segmentid language_code ldc_audio_id lre22_test_aaagk zul-zul 7042910 6.3 odyssey24-maglic.pdf This paper describes the design and creation of the MAGLIC collection. 6.4 Additional Metadata README_dev.txt and README_test.txt describe the dev and test packages as distributed to LRE22 participants, along with the documentation which was included in each. ------------- README file created by Karen Jones, December 20, 2024 updated by Dana Delgado, December 16, 2025