2022 NIST Language Recognition Evaluation Test and Development Sets
| Item Name: | 2022 NIST Language Recognition Evaluation Test and Development Sets |
| Author(s): | Craig Greenberg, Kevin Walker, Karen Jones, Jonathan Wright, Stephanie Strassel |
| LDC Catalog No.: | LDC2026S03 |
| ISLRN: | 266-982-188-107-3 |
| DOI: | https://doi.org/10.35111/jzws-6m63 |
| Release Date: | February 16, 2026 |
| Member Year(s): | 2026 |
| DCMI Type(s): | Sound, Text |
| Sample Type: | 8-bit a-law |
| Sample Rate: | 8000 |
| Data Source(s): | broadcast conversation, telephone speech |
| Project(s): | NIST LRE |
| Application(s): | language identification |
| Language(s): | Tunisian Arabic, Algerian Arabic, Libyan Arabic, South Ndebele, Oromo, Tigrinya, Tsonga, Venda, Xhosa, Zulu, Afrikaans, Algerian Saharan Arabic, English, French |
| Language ID(s): | aeb, arq, ayl, nbl, orm, tir, tso, ven, xho, zul, afr, aao, eng, fra |
| License(s): |
LDC User Agreement for Non-Members |
| Online Documentation: | LDC2026S03 Documents |
| Licensing Instructions: | Subscription & Standard Members, and Non-Members |
| Citation: | Greenberg, Craig, et al. 2022 NIST Language Recognition Evaluation Test and Development Sets LDC2026S03. Web Download. Philadelphia: Linguistic Data Consortium, 2026. |
| Related Works: | View |
Introduction
2022 NIST Language Recognition Evaluation Test and Development Sets was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). This release contains the test and development data, metadata, answer keys, and documentation for the 2022 NIST Language Recognition Evaluation (LRE22). The source speech data is comprised of approximately 222 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) in 14 languages: Afrikaans, Tunisian Arabic, Algerian Arabic, Libyan Arabic, South African English, Indian-accented South African English, North African French, Ndebele, Oromo, Tigrinya, Tsonga, Venda, Xhosa and Zulu.
The goals of NIST's Language Recognition Evaluation are to advance language recognition technologies, to facilitate technology development, and to measure the performance of current state-of-the-art technology. LRE22 emphasized language recognition for African languages, including low resource languages, and expanded the range of test segment durations. Further information about the 2022 evaluation can be found in the 2022 NIST Language Recognition Evaluation Plan.
Data
The test and development segments in this release were drawn from three datasets developed by LDC: the Speech Archive of South African Languages (SASAL) (CTS, BNBS), the Maghrebi Linguistic Information Corpus (MAGLIC) (CTS), and the Low Resource African Languages (LRAL) collection (BNBS).
For the SASAL CTS collection, a small number of native speakers known as "claques" were recruited for each language to make single calls to multiple individuals in their social network. Calls lasted 8-15 minutes and speakers were free to discuss any topic. The BNBS data was collected from streaming radio programming, focusing on programs that included narrowband speech (e.g., call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality.
MAGLIC consists of conversational telephone speech recordings in three varieties of Maghrebi Arabic (Tunisian, Libyan, and Algerian) and North African French, collected in accordance with the SASAL CTS protocol.
LRAL contains Oromo and Tigrinya narrowband speech from off-the-air from broadcasts in Ethiopia and Eritrea, following the parameters used in the SASAL BNBS collection.
Test and development segments from SASAL and MAGLIC CTS callee call sides (and comparatively few claque sides) and from SASAL and LRAL BNBS data were extracted by NIST.
All test and development segments are presented as single channel, 8-bit a-law SPHERE files sampled at 8 kHz.
Metadata for the development partition is provided as a tab-separated file listing the file name, language code, LDC audio identifier, source time offset, and duration for each audio segment.
Samples
Updates
No updates at this time.