Home › Language Resources › Data

2017 NIST Language Recognition Evaluation Training and Development Sets

Item Name:	2017 NIST Language Recognition Evaluation Training and Development Sets
Author(s):	Craig Greenberg, Omid Sadjadi, Douglas Reynolds, Elliot Singer, David Graff
LDC Catalog No.:	LDC2022S10
ISBN:	1-58563-999-0
ISLRN:	854-427-979-036-7
DOI:	https://doi.org/10.35111/awny-7397
Release Date:	October 17, 2022
Member Year(s):	2022
DCMI Type(s):	Sound, Text
Sample Type:	PCM, u-law, a-law
Sample Rate:	8000, 44100
Data Source(s):	broadcast conversation, telephone speech, video
Project(s):	NIST LRE
Application(s):	language identification
Language(s):	Arabic, English, Polish, Russian, Portuguese, Spanish, Mandarin Chinese, Min Nan Chinese
Language ID(s):	ara, eng, pol, rus, por, spa, cmn, nan
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2022S10 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Greenberg, Craig, et al. 2017 NIST Language Recognition Evaluation Training and Development Sets LDC2022S10. Web Download. Philadelphia: Linguistic Data Consortium, 2022.
Related Works: Hide	View isOutcomeOf LDC2026S07 Multi-Language Conversational Telephone Speech 2014 - Spanish & Portuguese isSimilarWith LDC2006S31 2003 NIST Language Recognition Evaluation LDC2008S05 2005 NIST Language Recognition Evaluation LDC2009S04 2007 NIST Language Recognition Evaluation Test Set LDC2009S05 2007 NIST Language Recognition Evaluation Supplemental Training Set LDC2014S06 2009 NIST Language Recognition Evaluation Test Set LDC2018S06 2011 NIST Language Recognition Evaluation Test Set LDC2025S02 2015 NIST Language Recognition Evaluation Test Set LDC2026S03 2022 NIST Language Recognition Evaluation Test and Development Sets relatesTo LDC2023S07 LDC Spoken Language Sampler - Sixth Release

Introduction

2017 NIST Language Recognition Evaluation Training and Development Sets contains training and development material for the 2017 NIST Language Recognition Evaluation. It consists of approximately 2,100 hours of conversational telephone speech, broadcast conversation, broadcast narrow band speech, and speech from video in the following 14 languages, dialects, and varieties: Arabic (Iraqi, Levantine, Maghrebi, Egyptian), English (British, American), Polish, Russian, Portuguese (Brazilian), Spanish (Caribbean, European, Latin American Continental), and Chinese (Mandarin, Min Nan).

The goal of the NIST (National Institute of Standards and Technology) Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, 2007, 2009, 2011, and 2015. The 2017 evaluation focused on differentiating closely related language pairs. In addition to conversational telephone speech, broadcast conversation, and broadcast narrow band speech, speech excerpts extracted from video data were used. Further information regarding this evaluation can be found in the evaluation plan which is also included in the documentation for this release.

LDC released the prior LREs as:

Data

This release includes data from LDC's CALLFRIEND and Fisher telephone collections, the VAST video collection, various broadcast sources and earlier NIST LRE test sets.

The training audio files are single-channel, 8-KHz sample rate in NIST SPHERE format, either mu-law, A-law or 16-bit PCM. The development audio files are also single-channel, but vary in format: either SPHERE or FLAC-compressed MSWAV (RIFF). All "*.flac" files are 16-bit PCM, 44.1 KHz sample rate; the "*.sph" files are all 8-KHz, with either mu-law or 16-bit PCM samples.

Samples

Please view the following audio sample.

Updates

None at this time.

Copyright

Portions © 2013-2014 Agora Radio Group, © 2013 BBC, © 2013 Bethel Church of Redding, © 2013 BFBS, © 2013 Blago Foundation, © 2013 Brazil Communication Company, © 2010-2011 Cable News Network, LP, LLLP, © 2013 El Pando Zambrano.com, © 2013-2014 Global, © 2010-2011 New Tang Dynasty TV, © 2010-2011 Phoenix New Media Limited, © 2013 Radio Amistad, C.por A., © 2013 Radio UNAL, © 2013 Spanish Radio and Television Corporation, © 2013 The New Television of the South CA (TVSUR), © 2013 University of Puerto Rico Radio Network, © 2010 WorldNetCast/TVNET, © 2011-2018 You Tube, LLC, © 1996-1999, 2001-2011, 2013-2014, 2018, 2022 Trustees of the University of Pennsylvania