Home › Language Resources › Data

2022 NIST Language Recognition Evaluation Test and Development Sets

Item Name:	2022 NIST Language Recognition Evaluation Test and Development Sets
Author(s):	Craig Greenberg, Kevin Walker, Karen Jones, Jonathan Wright, Stephanie Strassel
LDC Catalog No.:	LDC2026S03
ISLRN:	266-982-188-107-3
DOI:	https://doi.org/10.35111/jzws-6m63
Release Date:	February 16, 2026
Member Year(s):	2026
DCMI Type(s):	Sound, Text
Sample Type:	8-bit a-law
Sample Rate:	8000
Data Source(s):	broadcast conversation, telephone speech
Project(s):	NIST LRE
Application(s):	language identification
Language(s):	Tunisian Arabic, Algerian Arabic, Libyan Arabic, South Ndebele, Oromo, Tigrinya, Tsonga, Venda, Xhosa, Zulu, Afrikaans, Algerian Saharan Arabic, English, French
Language ID(s):	aeb, arq, ayl, nbl, orm, tir, tso, ven, xho, zul, afr, aao, eng, fra
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2026S03 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Greenberg, Craig, et al. 2022 NIST Language Recognition Evaluation Test and Development Sets LDC2026S03. Web Download. Philadelphia: Linguistic Data Consortium, 2026.
Related Works: Hide	View isSimilarWith LDC2006S31 2003 NIST Language Recognition Evaluation LDC2008S05 2005 NIST Language Recognition Evaluation LDC2009S04 2007 NIST Language Recognition Evaluation Test Set LDC2014S06 2009 NIST Language Recognition Evaluation Test Set LDC2018S06 2011 NIST Language Recognition Evaluation Test Set LDC2022S10 2017 NIST Language Recognition Evaluation Training and Development Sets

Introduction

2022 NIST Language Recognition Evaluation Test and Development Sets was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). This release contains the test and development data, metadata, answer keys, and documentation for the 2022 NIST Language Recognition Evaluation (LRE22). The source speech data is comprised of approximately 222 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) in 14 languages: Afrikaans, Tunisian Arabic, Algerian Arabic, Libyan Arabic, South African English, Indian-accented South African English, North African French, Ndebele, Oromo, Tigrinya, Tsonga, Venda, Xhosa and Zulu.

The goals of NIST's Language Recognition Evaluation are to advance language recognition technologies, to facilitate technology development, and to measure the performance of current state-of-the-art technology. LRE22 emphasized language recognition for African languages, including low resource languages, and expanded the range of test segment durations. Further information about the 2022 evaluation can be found in the 2022 NIST Language Recognition Evaluation Plan.

Data

The test and development segments in this release were drawn from three datasets developed by LDC: the Speech Archive of South African Languages (SASAL) (CTS, BNBS), the Maghrebi Linguistic Information Corpus (MAGLIC) (CTS), and the Low Resource African Languages (LRAL) collection (BNBS).

For the SASAL CTS collection, a small number of native speakers known as "claques" were recruited for each language to make single calls to multiple individuals in their social network. Calls lasted 8-15 minutes and speakers were free to discuss any topic. The BNBS data was collected from streaming radio programming, focusing on programs that included narrowband speech (e.g., call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality.

MAGLIC consists of conversational telephone speech recordings in three varieties of Maghrebi Arabic (Tunisian, Libyan, and Algerian) and North African French, collected in accordance with the SASAL CTS protocol.

LRAL contains Oromo and Tigrinya narrowband speech from off-the-air from broadcasts in Ethiopia and Eritrea, following the parameters used in the SASAL BNBS collection.

Test and development segments from SASAL and MAGLIC CTS callee call sides (and comparatively few claque sides) and from SASAL and LRAL BNBS data were extracted by NIST.

All test and development segments are presented as single channel, 8-bit a-law SPHERE files sampled at 8 kHz.

Metadata for the development partition is provided as a tab-separated file listing the file name, language code, LDC audio identifier, source time offset, and duration for each audio segment.

2022 NIST Language Recognition Evaluation Test and Development Sets

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees