README FILE FOR: LDC2026S07 CORPUS TITLE: Multi-language Conversational Telephone Speech 2014 - Spanish & Portuguese Group AUTHORS: Karen Jones, Stephanie Strassel, Kevin Walker, David Graff, Jonathan Wright, Preston Cabe 1.0 Introduction This corpus is a collection of 569 Spanish and Portuguese telephone recordings totaling 123.8 hours of audio. The calls were collected from acquainted individuals as part of the 2014 Multi-language Speech (MLS14) collection effort. LDC's 2014 MLS Collection covered a total of 20 languages in the following "clusters" of confusable linguistic varieties: ARABIC: Egyptian Arabic, Iraqi Arabic, Levantine Arabic, Maghrebi Arabic, Modern Standard Arabic CHINESE: Cantonese, Mandarin, Min Nan, Wu ENGLISH: British English, Indian English, General American English FRENCH: Haitian Creole, West African French SLAVIC: Polish, Russian SPANISH/PORTUGUESE: Brazilian Portuguese, Caribbean Spanish, European Spanish, Latin American Spanish The data for MLS14 were collected primarily to support research and technology evaluation in automatic language identification, and portions of these recordings were used in the NIST 2015 Language Recognition Evaluation (LRE) Test Set and the NIST 2017 LRE Test Set. Two genres of speech were collected for MLS14: broadcast narrowband speech (BNBS) and conversational telephone speech (CTS). This corpus consists of CTS data only. 2.0 Language, codes and grouping The languages in the Spanish group and their 6-letter language codes are listed below: Spanish group (spa): Brazilian Portuguese (por-brz) Caribbean Spanish (spa-car) European Spanish (spa-eur) Latin American Spanish (spa-lac) 3.0 CTS Collection Protocol Data in this release stemmed from 18 recruited native speakers who enlisted their own relatives and acquaintances to take part in a recorded telephone conversation. Speakers were instructed to discuss topics of their own choosing in the designated variety (Portuguese or Spanish), which was a way of ensuring that the collected speech was conversational and natural. To produce recorded calls, the recruited speaker dialed a dedicated study telephone number, pressed "1" on their handset to indicate their consent to be recorded, entered a unique PIN, then entered the telephone number of their acquaintance. The telephone system then automatically dialed out to their call partner and once their consent to be recorded had been obtained, both speakers were connected and recording began. Speakers were instructed to hold a conversation for at least 8 minutes. Since the main purpose of the collection was to support language recognition research, it was important that there was no association between language and telephone channel. For this reason, the callsides containing the speech of the recruited speaker (channel A) were not used to extract evaluation test segments. Instead, to ensure channel variety, the callee sides of the call (channel B) which contained speech from unique speakers were the main sides of interest. For this reason, the recruited speaker encouraged their call partner to do most of the talking during the recorded conversation. A detailed description of the collection protocol can be found in the enclosed document docs/lrec2016-multi-language-speech-collection-nist-lre.pdf. 4.0 Auditing Automatically-selected portions of each conversation were manually audited by native speakers to confirm that the intended language was spoken, to record judgments about the overall quality and noise conditions of the call, whether the speech was from a sole and native speaker of the language and also to judge speaker sex. In many cases, people who were recruited to make calls were also tasked to serve as auditors; the auditing process was controlled to ensure that no one would audit their own calls. Auditors were presented with batches of audit segments in "kits" consisting of target segments (believed to be in the auditor's own language) and three other types of segments: a) 10% of segments in the kit were from another dialect in the same linguistic cluster (to assess language/dialect confusability), b) 5% of the segments were in the target dialect and had been audited by another target-language auditor (to measure inter-annotator agreement), and c) 5% of segments were from a completely different cluster of languages to keep auditors alert. Auditors were presented with several portions from each channel B side of the call and they filled in a web form to indicate their judgments. Since auditors judged multiple snippets from the same call, some calls received more than one answer to an audit question depending on which segment was judged, for example where one part of the call was judged "clear" and another part as "some_unclear." Such cases where there are two audit answers for the same question in the same call are marked in "audit_info.tab" with the value "mixed." 5.0 Directory Structure The directory structure and contents of the package are summarized below; paths shown are relative to the base (root) directory of the package: ./data/ spa/ por-brz -- 20 Brazilian Portuguese .flac files spa-car -- 105 Caribbean Spanish .flac files spa-eur -- 205 European Spanish .flac files spa-lac -- 239 Latin American Spanish .flac files ./docs/ -- contains this README, various tables and lists (see section 7 below), and PDF files detailing the collection protocol, calling instructions and auditing guidelines: lrec2016-multi-language-speech-collection-nist-lre.pdf Multi_LangTelephoneCallingInstructions.pdf Multi-languageAuditInstructions.pdf 6.0 Content Summary All audio data are presented in FLAC-compressed MS-WAV (RIFF) file format (*.flac); when uncompressed, each file is 2 channels (caller on "left/A" channel, callee on "right/B" channel), recorded at 8000 samples/second with samples stored as 16-bit signed integers, representing a lossless conversion from the original mu-law sample data as captured digitally from the public telephone network. Mu-law companding algorithms are used for voice traffic in North America. We expect that the number of distinct callees (channel B speakers) is equal to the number of calls (i.e. each call involves a different callee), though no special auditing has been done to confirm this. The following table summarizes the total number of calls, total number of hours of recorded audio, and the total size of compressed data (in gigabytes). The "GB" values below represent amounts of compressed data. group lng calls hrs GB Spanish por-brz 20 4.78 0.26 Spanish spa-car 105 23.69 1.16 Spanish spa-eur 205 44.20 2.37 Spanish spa-lac 239 51.13 2.51 7.0 Documentation Summary In addition to this README file, the "docs" directory contains the following: 7.1 Multi_LangTelephoneCallingInstructions.pdf This is the set of instructions given to people who were recruited to make calls to their acquaintances for the collection. 7.2 Multi-languageAuditInstructions.pdf This is the set of instructions for the auditing that was done on all calls. It includes screenshots of the auditing tool. 7.3 lrec2016-multi-language-speech-collection-nist-lre.pdf This is the full text of a paper presented at the 2016 Language Resources and Evaluation Conference (2016), describing the larger collection effort that LDC conducted to support the NIST 2015 and 2017 LRE evaluations. 7.4 flac_info.tab This is a tab-delimited table with a one-line header followed by one row for each CTS recording in the corpus. The columns of the table are as follows: 1 language_path_file_name (e.g. "spa/por-brz/20131008_235928_6063.flac") 2 dur_sec -- file duration in seconds (e.g. 909.948) 3 cmp_kb -- compressed flac file size in kilobytes 7.5 caller_info.tab This is a tab-delimited table with a one-line header followed by one row for each CTS recording in the corpus. The columns of the table are as follows: 1 call_file_id (e.g. "20131008_235928_6063.flac") 2 lng (six-letter symbol for the language, e.g. "por-brz") 3 clr_id (numeric ID for the recruited caller) 7.6 audit_info.tab This is a tab-delimited table with a one-line header followed by one row for each CTS recording in the corpus. The columns of the table are as follows: 1 call_file_id (e.g. "20140225_165934_7683.flac") 2 auditor_id (alphanumeric ID of auditor) 3 lng (six-letter symbol for the language, e.g. "zho-cmn") 4 all_target_lang (auditor judgment whether entire call is in specified language; values: "yes", "no", "mixed") 5 lng_comment (free text auditor comment about language) 6 mostly_speech (auditor judgment on speech amount; variables: "yes", "no", "canttell", "mixed") 7 speech_clarity (auditor judgment on speech clarity; variables: "clear", "some_unclear", "very_unclear", "mixed") 8 single_speaker (auditor judgment on whether single speaker in call; variables: "yes", "no", "NO_RESPONSE", "mixed") 9 native_speaker (auditor judgment on whether callee is native speaker of language; variables: "yes", "no", "unsure", "NO_RESPONSE", "mixed") 10 speaker_sex (auditor judgment on callee gender; variables: "female", "male", "unsure", "NO_RESPONSE", "mixed") 8.0 Copyright Information © 2013-2014 Trustees of the University of Pennsylvania 9.0 Contacts If you have questions about this data release, please contact the following personnel at LDC. Kevin Walker - Technical Lead README created by Joshua Parry April 14, 2025