2009 NIST Language Recognition Evaluation Test Set
Item Name: | 2009 NIST Language Recognition Evaluation Test Set |
Author(s): | Alvin Martin, Craig Greenberg, David Graff, Kevin Walker, Linda Brandschain |
LDC Catalog No.: | LDC2014S06 |
ISBN: | 1-58563-682-7 |
ISLRN: | 180-783-854-340-4 |
DOI: | https://doi.org/10.35111/qv7y-5026 |
Release Date: | July 15, 2014 |
Member Year(s): | 2014 |
DCMI Type(s): | Sound |
Sample Type: | ulaw |
Sample Rate: | 8000 |
Data Source(s): | broadcast news, telephone conversations |
Project(s): | NIST LRE |
Application(s): | language identification |
Language(s): | Amharic, Haitian, English, French, Hindi, Spanish, Urdu, Bosnian, Croatian, Georgian, Korean, Portuguese, Turkish, Vietnamese, Yue Chinese, Dari, Persian, Hausa, Mandarin Chinese, Russian, Ukrainian, Pushto |
Language ID(s): | amh, hat, eng, fra, hin, spa, urd, bos, hrv, kat, kor, por, tur, vie, yue, prs, fas, hau, cmn, rus, ukr, pus |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2014S06 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Martin, Alvin, et al. 2009 NIST Language Recognition Evaluation Test Set LDC2014S06. Web Download. Philadelphia: Linguistic Data Consortium, 2014. |
Related Works: | View |
Introduction
2009 NIST Language Recognition Evaluation Test Set was developed by the Linguistic Data Consortium (LDC) and the National Institute of Standards and Technology (NIST). It contains approximately 215 hours of conversational telephone speech and radio broadcast conversation collected by LDC in the following 23 languages and dialects: Amharic, Bosnian, Cantonese, Creole (Haitian), Croatian, Dari, English (American), English (Indian), Farsi, French, Georgian, Hausa, Hindi, Korean, Mandarin, Pashto, Portuguese, Russian, Spanish, Turkish, Ukrainian, Urdu, and Vietnamese.
The goal of NIST's Language Recognition Evaluation (LRE) is to establish the baseline of current performance capability for language recognition of conversational telephone speech and to lay the groundwork for further research efforts in the field. NIST conducted language recognition evaluations in 1996, 2003, 2005, and 2007. The 2009 evaluation increased the number of target languages. Most of the test data originated from multilingual Voice of America (VOA) radio broadcasts assessed as being of telephone bandwidth in addition to conversational telephone speech. Further information regarding this evaluation can be found in the evaluation plan which is included in the documentation for this release.
LDC released other LREs as:
- 2003 NIST Language Recognition Evaluation (LDC2006S31)
- 2005 NIST Language Recognition Evaluation (LDC2008S05)
- 2007 NIST Language Recognition Evaluation Test Set (LDC2009S04)
- 2007 NIST Language Recognition Evaluation Supplemental Training Set (LDC2009S05)
- 2011 NIST Language Recognition Evaluation Test Set (LDC2018S06)
Data
The VOA speech data was collected by LDC in 2000 and 2001 and constitutes approximately 75% of the test set. The telephone speech was taken from LDC's Mixer 3 collection recorded between 2005 and 2007.
All test speech segments are presented as a sampled data stream in standard 8-bit 8-kHz μ-law format. Each segment is stored separately in a single channel SPHERE format file.
The test segments contain three nominal durations of speech: three seconds, 10 seconds and 30 seconds. Actual speech durations vary, but were constrained to be within the ranges of 2-4 seconds, 7-13 seconds and 23-35 seconds, respectively. Non-speech portions of each segment were included in each segment so that a segment contained a continuous sample of the source recording. Therefore, the test segments may be significantly longer than the speech duration, depending on how much non-speech was included.
Samples
For an example of the data in this corpus, please listen to this sample (WAV).
Updates
None at this time.