2022 NIST Language Recognition Evaluation Test and Development Sets

Authors
NIST: Craig Greenberg, Yooyoung Lee, Asad Butt
LDC: Kevin Walker, Karen Jones, Christopher Caruso, Jonathan
Wright, Stephanie Strassel

1.0 Introduction

This release comprises the evaluation test and development sets for the
2022 NIST Language Recognition Evaluation (LRE22). LRE is an ongoing
evaluation series designed to measure how well systems can automatically
detect a target language given the test segment. The LRE22 evaluation
involved both conversational telephone speech (CTS) data and broadcast
narrowband speech (BNBS) data, with an emphasis on low resource African
languages.

This package contains a total of 30,673 LRE22 test and development segments,
covering 14 evaluation languages which were drawn from the following
datasets collected by LDC to support LRE:

-Maghrebi Linguistic Information Corpus (MAGLIC)
-Speech Archive of South African Languages (SASAL)
-Low Resource African Languages (LRAL)

The test data and development data and documentation as supplied to LRE22
evaluation participants is included in this package.


2.0 Directory Structure
The contents of this package are organized as follows:

/data/{dev,eval}

./docs
     README.txt -- this file
     languages.tab
     lre22_testset_key.tsv
     odyssey24-maglic.pdf
     file_md5s.txt -- md5sums for the dev audio
     NIST_LRE22_eval_plan_2022-08-31.pdf
     lre22_dev_metadata.tsv
     lre22_eval_trials.tsv
     README_dev.txt -- more information about the dev data and documentation
     README_test.txt -- more information about the eval (test) data and
                        documentation


3.0 Languages

LRE22 test data includes segments in 14 languages:

language                 language_code
Afrikaans                afr-afr
Tunisian Arabic          ara-aeb
Algerian Arabic          ara-arq
Libyan Arabic            ara-ayl
South African English    eng-ens
Indian-accented SA Engh  eng-iaf
North African French     fra-ntf
Ndebele                  nbl-nbl
Oromo                    orm-orm
Tigrinya                 tir-tir
Tsonga                   tso-tso
Venda                    ven-ven
Xhosa                    xho-xho
Zulu                     zul-zul


4.0 Data Sources

Test segments were drawn from the MAGLIC, SASAL and LRAL corpora
described below.

SASAL: The SASAL Corpus was collected by LDC to support development and 
testing of language recognition and related technologies. SASAL consists 
of both CTS and BNBS data in a variety of South African languages. For 
the CTS collection, a small number of native speakers known as "claques" 
were recruited for each language to make single calls to multiple 
individuals in their social network. Both claques and callees provided 
consent to be recorded under a protocol approved by the University of 
Pennsylvania's IRB. Calls lasted 8-15 minutes and speakers were free to 
discuss any topic. The BNBS data were collected by LDC from streaming 
radio programming, focusing on programs that included narrowband speech 
(e.g. call-ins to a talk show). Portions of the CTS callee call sides 
and portions of each broadcast recording were manually audited by native 
speakers to verify language and quality.

MAGLIC: The MAGLIC Corpus consisted of speech recordings in 3 varieties 
of Maghrebi Arabic (Tunisian, Libyan and Algerian) and North African 
French.  These recordings were CTS only and collected in accordance with 
the CTS protocol described above for the SASAL collection. Additional 
information about the MLS14 Corpus can be found in the following paper, 
found in this release at /docs/odyssey24-maglic.pdf:
Jones et al, MAGLIC: The Maghrebi Language Identification Corpus,
 Odyssey 2024: The Speaker and Language Recognition Workshop
Quebec, June 18-21.

LRAL The Oromo and Tigrinya recordings were derived from broadcast 
recordings collected off-the-air from broadcasts in Ethiopia and 
Eritrea. The parameters for this collection effort followed the same 
BNBS collection parameters as described under the SASAL collection.

5.0 Data

Test and development segments were extracted by NIST from SASAL and 
MAGLIC CTS callee call sides and comparatively few claque sides, and 
narrowband portions of the SASAL and LRAL BNBS data. All test and dev 
segments are presented as 8-bit (a-law) SPHERE files sampled at 8kHz.

The total number of segments per corpus, and the number of full recordings
represented, are summarized in the table below.

Corpus	#test segs	#test ldc_audio_ids	#dev segs	# dev ldc_audio_ids
MAGLIC	9349	956	1200	120
LRAL	1280	168	600	60
SASAL	15844	1650	2400	240
TOTAL	26473	2774	4200	420

The ldc_audio_ids are the stretches of audited source recordings from 
which the test segments were extracted.

The amount of speech contained in the test segments ranges from 3-30 
seconds, whereas the amount of speech in the dev segments ranges from 
3.02 to 93.15 seconds.

The genre breakdown of test and dev segments by language is as follows:

lang code	#test CTS segs	#test BNBS segs	#dev CTS segs	#dev BNBS segs
afr-afr	2133	0	300	0
ara-aeb	2401	0	300	0
ara-arq	2622	0	300	0
ara-ayl	2332	0	300	0
eng-ens	1353	159	280	20
eng-iaf	260	310	160	140
fra-ntf	1994	0	300	0
nbl-nbl	647	1414	70	230
orm-orm	0	383	0	300
tir-tir	0	897	0	300
tso-tso	2439	0	300	0
ven-ven	1463	731	160	140
xho-xho	1556	1213	100	200
zul-zul	2166	0	300	0
TOTALS	21366	5107	2870	1330

6.0 Metadata and Answer Keys
The ./docs directory includes:

6.1 languages.tab

This table lists the language varieties included in the LRE22
evaluation along with their six letter LDC language code. Fields are:

    language_code  - six letter LDC code
    language   - name of language variety

6.2 lre22_testset_key.tsv 

This table reveals the language for each segment; fields include:

    segmentid
    language_code
    ldc_audio_id

Example:
        segmentid	language_code	ldc_audio_id
        lre22_test_aaagk	zul-zul	7042910

6.3 odyssey24-maglic.pdf
    This paper describes the design and creation of the MAGLIC collection.

6.4 Additional Metadata
     README_dev.txt and README_test.txt describe the dev and test packages
     as distributed to LRE22 participants, along with the documentation which
     was included in each.


-------------
README file created by Karen Jones, December 20, 2024
            updated by Dana Delgado, December 16, 2025