THE LDC MULTI-LANGUAGE CONVERSATIONAL TELEPHONE SPEECH CORPUS 2011 SOUTH ASIAN GROUP 1.0 Introduction This corpus is a collection of telephone calls among acquainted individuals in each of five distinct language varieties of South Asia (i.e. the Indian sub-continent): Bengali, Hindi, Punjabi, Tamil and Urdu. The data were collected primarily to support research and technology evaluation in automatic language identification, and portions of these calls were used in the NIST 2011 Language Recognition Evaluation (LRE) Test Set. For each language, a fairly small number of native speakers were recruited and instructed to contact as many of their acquaintences as possible, get their informed consent to be recorded, and make a single telephone call to each, lasting up to 15 minutes. 2.0 Languages and groupings The languages in the South Asian group are identified as follows (strings in parentheses are used as the names of directories): South Asian group (s_asian): Bengali (ben) Hindi (hin) Punjabi (pnb) Tamil (tam) Urdu (urd) 3.0 Data format and quantities All audio data are presented here in FLAC-compressed MS-WAV (RIFF) file format (*.flac); when uncompressed, each file is 2 channels (caller on "left/A" channel, callee on "right/B" channel), recorded at 8000 samples/second with samples stored as 16-bit signed integers, representing a lossless conversion from the original mu-law sample data as captured digitally from the public telephone network. We expect that the number of distinct callees (channel B speakers) is equal to the number of calls (i.e. each call involves a different callee), though no special auditing has been done to confirm this. See section 4 below for more detailed information about distinct callers (channel A speakers). The following table summarizes the total number of calls, total number of hours of recorded audio, and the total size of compressed data (in megabytes); the dialects are identified by their respective directory names. The "#MB" values below represent amounts of compressed data. group lng #calls #hours #MB s_asian ben 118 26.6 1374 s_asian hin 37 7.4 383 s_asian pnb 207 38.8 1921 s_asian tam 101 22.6 1095 s_asian urd 116 22.9 1140 s_asian Totals 579 118.3 5913 4.0 Additional documentation In addition to this README file, the "docs" directory contains the following: 4.1 AuditingInstructions_v3.pdf This is the set of instructions for the auditing that was done on all calls. It includes screen shots of the auditing tool. (Please note that the URL cited in these instructions is no longer active.) Automatically selected portions of each conversation were manually audited by native speakers, to confirm that the intended language was being spoken, and to record judgments about the overall quality of the call. In many cases, people who were recruited to make calls were also tasked to serve as auditors; the auditing process was controlled to ensure that no one would audit their own calls. 4.2 CTSInstructions_V6.pdf This is the set of instructions given to people who were recruited to make calls to their acquaintences for the collection. 4.3 Odyssey2012-new-resources-lre11.pdf This is the full text of a paper presented at the Odyssey 2012 Conference, describing the larger collection effort that LDC conducted to support the NIST 2011 LRE test program. The collection included broadcast narrow-band speech (BNBS) as well as conversational telephone speech (CTS). 4.4 This is a tab-delimited table with a one-line header followed by one row for each CTS recording in the corpus. The columns of the table are as follows: 1 language_path_file_name (e.g. "arabic/acm/20110115_123359_91.flac) 2 dur_sec -- file duration in seconds (e.g. 468.408) 3 cmp_kb -- compressed file size in kilobytes 4 unc_kb -- uncompressed file size in kilobytes 5 c_ration -- compression ratio (e.g. 0.37) 4.5 This is a tab-delimited table with a one-line header followed by one row for each CTS recording in the corpus. The columns of the table are as follows: 1 call_file_id (e.g. "20110121_153617_175") 2 lng (three-letter symbol for the language, e.g. "eng") 3 clr_id (numeric ID for the recruited caller) 4 clr_sex (gender - M or F - of the recruited caller) In the South Asian group, there were 19 callers, who made between 5 and 67 calls each. 4.6 This is a tab-delimited table with a one-line header followed by one row for each CTS recording in the corpus. The columns of the table are as follows: 1 call_file_id (e.g. "20110121_153617_175") 2 lng (three-letter symbol for the language, e.g. "eng") 3 cle_sex (callee gender: M, F or u (unknown)) 4 cle_typ (callee dialect: "normal", "marked", "non-native") 5 noise_amt (extent of noise: "easy", "middling", "difficult") 6 noise_typ (type of noise - one or more of: distortion bgnoise dropouts interference other) 7 auditor (numeric ID of auditor) Throughout the collection process, no attempt was made to request demographic information about callees. Auditors were presented with two portions from each callee call side, and filled in a web form to label the callee's gender (though this proved difficult for some callees); auditors also indicated whether the callee was speaking a predominant ("normal") dialectal variant of the language (as opposed to being "non-native", or using a "marked" variant, e.g. a minority regional dialect), and the amount and type(s) of any noise present in the two portions. Each portion contained between 30 and 35 seconds of contiguous speech (after automatic removal of silence intervals within each portion). Note that when there are multiple values in column 6 (noise_typ), they are separated by commas.