2015 NIST Language Recognition Evaluation Test Set

Item Name: 2015 NIST Language Recognition Evaluation Test Set
Author(s): Craig Greenberg, Omid Sadjadi, David Graff, Kevin Walker, Karen Jones, Christopher Caruso, Stephanie Strassel, Jonathan Wright
LDC Catalog No.: LDC2025S02
ISLRN: 411-138-775-382-3
DOI: https://doi.org/4975-nz38
Release Date: March 17, 2025
Member Year(s): 2025
Sample Type: linear pcm
Sample Rate: 8000
Data Source(s): broadcast news, telephone speech
Project(s): NIST LRE
Application(s): language identification
Language(s): Mesopotamian Arabic, North Levantine Arabic, Standard Arabic, Moroccan Arabic, Egyptian Arabic, English, Haitian, French, Portuguese, Spanish, Chinese, Wu Chinese, Yue Chinese, Min Dong Chinese, Polish, Russian
Language ID(s): acm, apc, arb, ary, arz, eng, hat, fra, por, spa, zho, wuu, yue, cdo, pol, rus
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2025S02 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Greenberg, Craig, et al. 2015 NIST Language Recognition Evaluation Test Set LDC2025S02. Web Download. Philadelphia: Linguistic Data Consortium, 2025.
Related Works: View

Introduction

2015 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains the evaluation test set for the 2015 NIST Language Recognition Evaluation, approximately 867 hours of conversational telephone speech (CTS) and broadcast narrowband speech (BNBS) collected by LDC in 20 languages, over 6 clusters of related languages: Arabic (Egyptian, Iraqi, Levantine, Maghrebi, Modern Standard Arabic); Spanish (Caribbean, European, Latin American, Brazilian Portuguese); English (British, Indian, General American English); Chinese (Cantonese, Mandarin, Min Nan, Wu); Slavic (Polish, Russian); and French (West African, Haitian Creole).

The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, 2007, 2009, 2011, 2015, 2017, and 2022. LRE15 expanded the range of test segment durations and added a test condition that allowed systems to make use of unrestricted training data when developing models. Further information about the 2015 evaluation can be found in the 2015 NIST Languagage Recognition Evaluation Plan

Data

The test segments in this release were drawn from the Multi-Language Speech Corpus (MLS14) (CTS and BNBS data) and designated Babel corpora (CTS data).

For the MLS14 CTS collection, a small number of native speakers known as "claques" were recruited for each language to make single calls to multiple individuals in their social network. Calls lasted 8-15 minutes and speakers were free to discuss any topic. The BNBS data was collected by LDC from streaming and satellite radio programming, focusing on programs that included narrowband speech (e.g. call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality.

Additional test segments for two languages, Cantonese and Haitian Creole, were drawn from the IARPA Babel series, specifically, CTS data collected in 2012-2013 from male and female speakers of a variety of ages using a range of phone types in diverse settings with varying noise conditions.

Test segments were extracted by NIST from MLS14 CTS callee call sides, narrowband portions of the MLS14 BNBS data, and from designated Babel recordings. All test segments are presented in single channel, 16-bit 8 kHz linear PCM format with NIST SPHERE headers.

Samples

SPHERE audio file

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee