README FILE FOR 2007 NIST Language Recognition Evaluation Supplemental Training Set ------------------------------------------ LDC Catalog-ID: LDC2009S05 This DVD contains audio data and documentation that have been prepared specifically for the 2007 NIST Language Recognition Evaluation (LRE07). The audio files contained here (under the "audio" directory) are single-channel, 8kHz, ulaw, SPHERE-formatted. Each file represents one side of a 2-channel telephone conversation. The files have been subdivided according to the language used in each conversation. Whereas the LRE07 Evaluation Plan involves a large number of distinct target languages and dialects, this training release only provides audio data in those languages for which no training data are available in existing LRE-related corpora. The languages provided here are: ABBREV. Full Name ------------------- ARB Arabic BEN Bengali CFR Min Nan Chinese RUS Russian THA Thai URD Urdu WUU Wu Chinese YUH Cantonese Chinese Directory and file names are all in lower-case; each language abbreviation is used as a directory name within the "audio" directory, and individual data file names follow the pattern: lre07_tr_{lng}_{nnn}_{c}.sph where: {lng} is the 3-letter language abbreviation {nnn} is a 3-digit sequential call-ID number {c} is the single-letter channel-ID ("a" or "b") Note that if two files have the same call-ID number, they represent the two sides of a single conversation. In all (across all eight languages), 212 distinct conversations are represented, 108 of which have both sides presented in this release (104 conversations have only one side present). In all cases, the each audio file contains a distinct, unique speaker. The "docs" directory contains a table (call_side_info.csv) that provides basic demographic data about each call side. The comma-delimited fields in that table are: LNG -- Language spoken (also the audio subdirectory name) CID -- 3-digit numeric call-ID (part of the file name) CHN -- Channel ("a" or "b", also part of the file name) SEX -- Speaker gender ("M" or "F") AGE -- Speaker age in years PHN -- Type of phone used ("L" = land-line, "C" = cell) There are 40 call sides for each of the eight languages, so there are 320 rows of data in the table (plus an initial line providing the column headings). Age information is unavailable for some subjects.