2011 NIST Language Recognition Evaluation Test Set

Authors: Craig Greenberg, Alvin Martin, David Graff, Kevin Walker, Karen Jones, Stephanie Strassel


1.0 Introduction

The goal of the NIST (National Institute of Standards and Technology) Language
Recognition Evaluation (LRE) is to establish the baseline of current performance
capability for language recognition in conversational telephone speech, and to
lay the groundwork for further research efforts in the field.  NIST conducted
prior language recognition evaluations in 1996, 2003, 2005, 2007 and 2009.
The 2011 NIST Language Recognition Evaluation Plan (LRE11) can be found at:
https://www.nist.gov/sites/default/files/documents/itl/iad/mig/LRE11_EvalPlan_releasev1.pdf

The NIST Language Recognition Evaluation for 2011 (LRE11) makes use of 10,730
manually-audited speech segments in 24 distinct language varieties, and presumes
that relevant audio segments from prior LRE cycles are available for use as
training data.  The audio presented in this release was selected from recordings
made by the Linguistic Data Consortium (LDC), including both Conversational
Telephone Speech (CTS) and narrow-band speech in broadcast audio (BNBS).

The data distributed by NIST to participants in LRE11 includes training data for
nine language varieties that had not been represented in prior LRE cycles.  The
training data is structured as follows (original NIST distribution labels are
shown in parentheses):

 - 893 audited segments of roughly 30 seconds duration each (r136_1_1)  8-KHz
 - 400 full-length CTS (r137_1_1)  8-KHz

(Note: the original NIST distribution to LRE11 participants also included a set
of 59 full-length program recordings from a handful of Arabic broadcast sources.
Due to constraints on intellectual property rights, the LDC is unable to include
these full-length broadcasts in the present release; only the manually-audited
30-sec BNBS segments from these programs are provided here.)

The evaluation test set (NIST distribution label r139_1_1) comprises a total
29,511 audio files, all manually audited at the LDC for language, and divided
equally into three different test conditions according to the nominal amount of
speech content per segment:

 - 9837 segments of 2 to 4 seconds
 - 9837 segments of 7 to 13 seconds
 - 9837 segments over 13 sec (subgrouped: 1311 13-to-25sec; 8526 25-to-35sec)

Note that all the segments in the two shorter categories were extracted from
segments in the longest category.  (LDC auditing was performed only on the long
segments.)  The answer-key tables (see below) identifies the relationships
between shorter and longer files.  The actual durations of data files varies
with the amount of non-speech content per segment.

Also note that the 893 training segments described above were included among the
test set inventory, and received the same treatment of extracting two shorter
duration segments from the original audited segment.  The answer-key table
explicitly marks all these segments (893 * 3 = 2619 answer-key entries), to
indicate that they are not to be used in scoring when system performance is
reported.  (An additional 38 audited segments -- 114 test files in the three
duration conditions -- have also been marked as not scorable, for other
reasons.)


2.0 Content Summary by Language

The tables below give a breakdowns of data quantity by language and genre (CTS,
BNBS) for each partition (evaluation, training); note that the "#Minutes" column
is based on summing the audio durations of the files, and the actual amount of
speech data may be slightly less.

Table 1: Evaluation Test Set						
						
Language	BNBS segs	BNBS mins	CTS segs	 CTS mins	Total segs 	Total mins
arabic_iraqi	0	0.0	408	224.3	408	224.3
arabic_levantine	0	0.0	408	224.0	408	224.0
arabic_maghrebi	0	0.0	405	220.8	405	220.8
arabic_msa	406	222.6	0	0.0	406	222.6
bengali	227	115.4	220	120.4	447	235.8
czech	279	127.4	179	100.0	458	227.4
dari	376	168.2	27	15.0	403	183.2
english_american	331	107.7	121	66.8	452	174.5
english_indian	366	191.4	50	27.6	416	219.0
farsi	208	113.8	197	111.1	405	224.9
hindi	348	131.5	70	38.7	418	170.2
lao	125	40.1	126	67.9	251	108.0
mandarin	173	77.0	259	141.3	432	218.3
panjabi	11	4.7	397	218.3	408	223.0
pashto	257	135.1	155	85.9	412	221.0
polish	239	100.5	242	136.4	481	236.9
russian	302	165.6	139	78.0	441	243.6
slovak	242	123.4	172	95.7	414	219.1
spanish	188	103.1	231	126.9	419	230.0
tamil	214	117.1	200	111.0	414	228.1
thai	338	176.7	65	36.2	403	212.9
turkish	305	105.5	167	93.5	472	199.0
ukrainian	67	32.5	119	66.9	186	99.4
urdu	256	140.3	222	121.0	478	261.3
						
Table 2: Training Data						
						
Language	BNBS segs	BNBS mins	#CTS segs	CTS mins	Total segs 	Total mins
arabic_iraqi	0	0.0	100	54.9	100	54.9
arabic_levantine	0	0.0	100	54.8	100	54.8
arabic_maghrebi	0	0.0	100	54.2	100	54.2
arabic_msa	100	54.8	0	0.0	100	54.8
czech	0	0.0	100	55.5	100	55.5
lao	0	0.0	93	49.8	93	49.8
panjabi	0	0.0	100	54.8	100	54.8
polish	0	0.0	100	56.2	100	56.2
slovak	0	0.0	100	56.1	100	56.1

3.0 Data Collection Methods

The LRE11 data was collected by LDC in 2010 and 2011. The CTS data was obtained
using a "claque" collection model in which speakers (claques) call friends or
relatives in their own social network for a 10-minute conversation in the
claque's native language, such that each call would involve a unique callee.
Participants were free to speak on topics of their own choosing.  All calls were
routed through a telephone collection system at the LDC, which stored the raw
mu-law sample stream into separate audio files for each call side.  Auditing and
selection were applied only to the callee side of every call, and also to the
caller (claque) side in at most one call made by each claque.  Contiguous
regions containing between 25 and 35 seconds of speech were identified by signal
analysis and extracted for manual audit.  In some cases, shorter segments (down
to a minimum of 13 seconds) were also selected for audit.

Broadcast audio was recorded via capture of satellite-receiver MPEG streams or
analog audio receivers digitizing at 16 KHz.  Platforms for data capture were
located at LDC and also in Tunisia and India.  Recordings were analyzed to
extract contiguous segments of narrow-band speech of at least 33 seconds
duration; longer segments were trimmed to a maximum length of 35 seconds for
audit.

All audited segments for training and test are presented as 8-KHz, 16-bit PCM,
single-channel audio files with NIST SPHERE headers.


4.0 Documentation and Answer-Key Data

In addition to this README.txt file, the "docs/" directory of this release
contains the following:

4.1 NIST_LRE11_EVAL_DATA_KEY.v0.tab (header line + 29,511 data rows)

The columns of this tab-delimited file are:

     1  segmentid -- file-ID ("lre11....") of test-segment audio
     2  ldcid -- original segment-ID used in selection and auditing
     3  language -- labels as tabulated in section 3
     4  duration_category -- 02sec-04sec, 07sec-13sec, 13sec-25sec, 25sec-35sec
     5  source -- telephone or broadcast_...(channel_label)...
     6  is_only_speech -- Y/N
     7  is_single_speaker -- Y/N
     8  sounds_like_narrow_band -- Y/N
     9  speaker_sex -- M/F
    10  speaker_nativeness -- normal, marked, non-native
    11  noise_amount -- easy, middling, difficult
    12  noise_type -- 0 or more of: bgnoise distortion dropouts interference other
    13  noise_comment -- empty or free text
    14  speaker_comment -- empty or free text
    15  language_comment -- empty or free text
    16  is_scored -- Y/N

Columns 12-15 may be empty, or contain 2 or more space-separated strings.
Segments that have "N" in column 16 ("is_scored") must not be used when scoring
system performance; these segments contain audio that was also used in the
training partition, or was deemed unsuitable for test purposes.

4.2 NIST_LRE11_DEV_DATA_KEY.v2.tab (header line + 893 data rows)

The columns of this tab-delimited file are:

     1  segmentid -- segment file-ID (6 digits . 6 digits . 4 digits)
     2  sourceid -- file-ID of original full-length source recording
     3  language -- labels as tabulated in section 3
     4  is_only_speech -- Y/N
     5  is_single_speaker -- Y/N
     6  sounds_like_narrow_band -- Y/N
     7  speaker_sex -- M/F
     8  speaker_nativeness -- normal, marked, non-native
     9  noise_amount -- easy, middling, difficult
    10  noise_type -- 0 or more of: bgnoise distortion dropouts interference other
    11  noise_comment -- empty or free text
    12  speaker_comment -- empty or free text
    13  language_comment -- empty or free text

Columns 10-13 may be empty, or contain 2 or more space-separated strings.

For the entries in the "EVAL" table that were derived from training data (most
of the rows having an "is_scored" value of "N" in column 16), the "ldcid" value
in that table matches the "segmentid" value in the "DEV" table.

4.3 audio_info.tab (header line + 30804 data rows)

The columns of this tab-delimited file are:

     1  file_id -- file name minus extension
     2  file_path -- directory_path/filename.extension (relative to "data/")
     3  channel_count -- 1 or 2
     4  sample_rate -- 8000
     5  duration_sec -- floating-point number

4.4 file.tbl (no header)

This is a 4-column, space-separated table with one line per data file in this
release.  Variable spacing is used for vertical column alignment; the columns
are:

     1  MD5 checksum
     2  file size in bytes
     3  file modification date_time
     4  file path (relative to base directory of the release package)


-------------
README file created by David Graff, Feb. 16, 2017
           modified by David Graff, March 3, 2017