2015 NIST Language Recognition Evaluation Test Set Authors NIST: Craig Greenberg, Omid Sadjadi LDC: David Graff, Kevin Walker, Karen Jones, Christopher Caruso, Jonathan Wright, Stephanie Strassel 1.0 Introduction This release comprises the evaluation test set for the 2015 NIST Language Recognition Evaluation (LRE15). LRE is an ongoing evaluation series designed to measure how well systems can automatically detect a target language given the test segment. The LRE15 evaluation involved both conversational telephone speech (CTS) data and broadcast narrowband speech (BNBS) data, with an emphasis on distinguishing closely related languages. This package contains a total of 164,334 LRE15 test segments, covering 20 evaluation languages. Test data was drawn primarily from the 2014 Multi-Language Speech Collection (MLS14), collected by LDC to support LRE. Additional test segments for two languages were drawn from the IARPA Babel collection. 2.0 Directory Structure The contents of this package are organized as follows: /data – contains the test segments in .sph format /docs – contains metadata and documentation: /languages.tab - list of evaluation languages /lre15_md5s.tsv - md5 checksums for each test segment /lre15_segment_keys.tsv - information about the data source, duration and session id for each test segment /lre15_trial_key.tsv - answer key for test segments /lre15_evalplan_v23.pdf - NIST LRE15 evaluation plan /lrec2016-multi-language-speech-collection-nist-lre.pdf - details of the MLS14 corpus /README.txt - this document 3.0 Data LRE15 test data includes segments in 20 languages, covering 6 clusters of related linguistic varieties: 1. Arabic language cluster including Egyptian Arabic, Iraqi Arabic, Levantine Arabic, Maghrebi Arabic and Modern Standard Arabic 2. Spanish language cluster including Caribbean Spanish, European Spanish, Latin American Spanish and Brazilian Portuguese 3. English language cluster including British English, Indian English and General American English 4. Chinese language cluster including Cantonese, Mandarin, Min Nan and Wu 5. Slavic language cluster including Polish and Russian 6. French language cluster including West African French and Haitian Creole Test segments were drawn from MLS14 and Babel corpora, described below. 3.1 MLS14 The MLS14 Corpus was collected by LDC to support development and testing of language recognition and related technologies. MLS14 consists of both CTS and BNBS data. For the CTS collection, a small number of native speakers known as "claques" were recruited for each language to make single calls to multiple individuals in their social network. Both claques and callees provided consent to be recorded under a protocol approved by the University of Pennsylvania's IRB. Calls lasted 8-15 minutes and speakers were free to discuss any topic. The BNBS data was collected by LDC from streaming and satellite radio programming, focusing on programs that included narrowband speech (e.g. call-ins to a talk show). Portions of the CTS callee call sides and portions of each broadcast recording were manually audited by native speakers to verify language and quality. Additional information about the MLS14 Corpus can be found in the following paper, included in the /docs directory: Jones, Karen, et al. Multi-language speech collection for NIST LRE. Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC'16). 2016. 3.2 Babel Additional test segments for two languages, Cantonese and Haitian Creole, were drawn from the IARPA Babel collection. This data was collected under the IARPA Babel program which sought to build speech resources for under-represented languages. LRE15 test data from Babel includes calls collected in 2012-2013 from male and female speakers of a variety of ages. A range of phone types was used and calls were made in diverse settings with varying noise conditions. Full recordings for the Babel data can be found in these LDC publications: LDC2016S02 IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c LDC2017S03 IARPA Babel Haitian Creole Language Pack IARPA-babel201b-v0.2b 3.3 Test Segments Test segments were extracted by NIST from MLS14 CTS callee call sides, narrowband portions of the MLS14 BNBS data, and from designated Babel recordings. All test segments are presented in single channel, 16-bit 8 kHz linear pcm format with NIST SPHERE headers. The total number of segments per corpus, and the number of full recordings represented, are summarized in the table below. Collection Test Segment Count Full Recording Count MLS14 124408 6600 Babel 39926 2127 Test segment duration ranges from 3-30 seconds, with multiple segments of differing duration derived from the same full recordings. The table below summarizes the number of segments for different durations: Seconds Segment Count 3 33784 5 34645 10 34842 15 26162 20 17449 25 8725 30 8727 The genre breakdown of test segments by language is as follows: Lang NUMBER OF TEST SEGS BNBS CTS ara-acm 0 8994 ara-apc 0 6802 ara-arb 2447 0 ara-ary 0 8264 ara-arz 0 8023 eng-gbr 7998 0 eng-sas 133 6799 eng-usg 1045 5935 fre-hat 0 28741 fre-waf 722 6213 por-brz 4400 245 qsl-pol 3452 1366 qsl-rus 2241 810 spa-car 471 1861 spa-eur 1957 3846 spa-lac 1801 5172 zho-cdo 0 8542 zho-cmn 1482 4544 zho-wuu 0 7496 zho-yue 722 21810 TOTALS 28871 135463 4.0 Test Segment Metadata and Answer Keys The /docs directory includes 4 tab files presenting metadata about the test segments. 4.1 languages.tab This table lists the language varieties included in the LRE15 evaluation along with their six letter LDC language code. Fields are: language_code - six letter LDC code language_name - name of language variety Example: ara-acm Arabic - Iraqi 4.2 lre15_md5s.tsv This table lists all the audio files with their md5 checksums. Fields are: segment_ID - NIST segment ID md5 - md5 checksum Example: lre15_suykam.sph 1ad40b0edb926cabb1e355d629189b49 4.3 lre15_segment_key.tsv This table provides information about the test segment. Fields are: segmentid - NIST segment ID data_source - corpus segment came from speech_duration - duration of segment in seconds sessionid - indicates source recording Example: lre15_aaakle.sph MLS14 5.0 1058582 4.4 lre15_trial_key.tsv This table reveals the language code for each segment and indicates whether the segment was used for scoring. Fields are: test_segment_ID - NIST segment ID language_code - LDC six letter language code is_scored - values T/F, field indicates whether segment was used for scoring Example: lre15_hnhjxu ara-apc T ------------- README file created by Karen Jones, February 13, 2024 updated by Stephanie Strassel, May 14, 2024