Home › Language Resources › Data

2007 NIST Language Recognition Evaluation Supplemental Training Set

Item Name:	2007 NIST Language Recognition Evaluation Supplemental Training Set
Author(s):	Alvin Martin, Audrey Le, David Graff, Jan van Santen
LDC Catalog No.:	LDC2009S05
ISBN:	1-58563-530-8
ISLRN:	498-359-265-464-3
DOI:	https://doi.org/10.35111/gqmf-6p19
Release Date:	November 20, 2009
Member Year(s):	2009
DCMI Type(s):	Sound
Sample Type:	8 bit u-law
Sample Rate:	8000
Data Source(s):	telephone conversations
Project(s):	NIST LRE
Application(s):	language identification
Language(s):	Yue Chinese, Wu Chinese, Urdu, Thai, Tamil, Spanish, Russian, Min Nan Chinese, Mandarin Chinese, Bengali, Egyptian Arabic
Language ID(s):	yue, wuu, urd, tha, tam, spa, rus, nan, cmn, ben, arz
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2009S05 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Martin, Alvin, et al. 2007 NIST Language Recognition Evaluation Supplemental Training Set LDC2009S05. Web Download. Philadelphia: Linguistic Data Consortium, 2009.
Related Works: Hide	View isOutcomeOf LDC2023S02 Mixer 3 Speech hasContinuation LDC2009S04 2007 NIST Language Recognition Evaluation Test Set isSimilarWith LDC2006S31 2003 NIST Language Recognition Evaluation LDC2008S05 2005 NIST Language Recognition Evaluation LDC2014S06 2009 NIST Language Recognition Evaluation Test Set LDC2018S06 2011 NIST Language Recognition Evaluation Test Set LDC2022S10 2017 NIST Language Recognition Evaluation Training and Development Sets relatesTo LDC2025S02 2015 NIST Language Recognition Evaluation Test Set

Introduction

2007 NIST Language Recognition Evaluation Supplemental Training Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It consists of 118 hours of conversational telephone speech segments in the following languages and dialects: Arabic (Egyptian colloquial), Bengali, Min Nan Chinese, Wu Chinese, Taiwan Mandarin, Cantonese, Russian, Mexican Spanish, Thai, Urdu, and Tamil.

The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted three previous language recognition evaluations, in 1996, 2003 and 2005. The most significant differences between those evaluations and the 2007 task were the increased number of languages and dialects, the greater emphasis on a basic detection task for evaluation and the variety of evaluation conditions. Thus, in 2007, given a segment of speech and a language of interest to be detected (i.e., a target language), the task was to decide whether that target language was in fact spoken in the given telephone speech segment (yes or no), based on an automated analysis of the data contained in the segment.

Data

The supplemental training material in this release consists of the following:

Approximately 53 hours of conversational telephone speech segments in Arabic (Egyptian colloquial), Bengali, Cantonese, Min Nan Chinese, Wu Chinese, Russian, Thai, and Urdu. This material is taken from LDC's CALLHOME, CALLFRIEND, and Mixer collections.
Approximately 65 hours of full telephone conversations in Mandarin Chinese (Taiwan), Spanish (Mexican), and Tamil. This material was collected by Oregon Health and Science University (OHSU), Beaverton, Oregon. The test segments used in the 2005 NIST Language Recognition Evaluation (LDC2008S05) were derived from these full conversations.

In addition to the supplemental material contained in this release, the training data for the 2007 NIST Language Recognition Evaluation (LDC2009S04) consisted of data from previous LRE evaluation test sets, namely, 2003 NIST Language Recognition Evaluation (LDC2006S31) and 2005 NIST Language Recognition Evaluation (LDC2008S05).

LDC released other LREs as:

Samples

For an example of the data in this corpus, please listen to this Egyptian Arabic sample (WAV) from the data set.

Updates

None at this time.

2007 NIST Language Recognition Evaluation Supplemental Training Set

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees