README FILE FOR: LDC2026S07
CORPUS TITLE: Multi-language Conversational Telephone Speech 2014 - Spanish & Portuguese Group
AUTHORS: Karen Jones, Stephanie Strassel, Kevin Walker, David Graff,
         Jonathan Wright, Preston Cabe

1.0 Introduction

This corpus is a collection of 569 Spanish and Portuguese telephone recordings
totaling 123.8 hours of audio. The calls were collected from acquainted
individuals as part of the 2014 Multi-language Speech (MLS14) collection effort.

LDC's 2014 MLS Collection covered a total of 20 languages in the
following "clusters" of confusable linguistic varieties:

ARABIC: Egyptian Arabic, Iraqi Arabic, Levantine Arabic, Maghrebi Arabic,
        Modern Standard Arabic

CHINESE: Cantonese, Mandarin, Min Nan, Wu

ENGLISH: British English, Indian English, General American English

FRENCH: Haitian Creole, West African French

SLAVIC: Polish, Russian

SPANISH/PORTUGUESE: Brazilian Portuguese, Caribbean Spanish, European
                    Spanish, Latin American Spanish

The data for MLS14 were collected primarily to support research and
technology evaluation in automatic language identification, and
portions of these recordings were used in the NIST 2015 Language
Recognition Evaluation (LRE) Test Set and the NIST 2017 LRE Test Set.

Two genres of speech were collected for MLS14: broadcast narrowband
speech (BNBS) and conversational telephone speech (CTS). This corpus
consists of CTS data only.

2.0 Language, codes and grouping

The languages in the Spanish group and their 6-letter language
codes are listed below:

  Spanish group (spa):
     Brazilian Portuguese (por-brz)
     Caribbean Spanish (spa-car)
     European Spanish (spa-eur)
     Latin American Spanish (spa-lac)

3.0 CTS Collection Protocol

Data in this release stemmed from 18 recruited native speakers who
enlisted their own relatives and acquaintances to take part in a
recorded telephone conversation. Speakers were instructed to discuss
topics of their own choosing in the designated variety (Portuguese or
Spanish), which was a way of ensuring that the collected speech was
conversational and natural.

To produce recorded calls, the recruited speaker dialed a
dedicated study telephone number, pressed "1" on their handset to
indicate their consent to be recorded, entered a unique PIN, then
entered the telephone number of their acquaintance. The telephone
system then automatically dialed out to their call partner and once
their consent to be recorded had been obtained, both speakers
were connected and recording began.

Speakers were instructed to hold a conversation for at least 8 minutes.
Since the main purpose of the collection was to support language
recognition research, it was important that there was no association
between language and telephone channel. For this reason, the callsides
containing the speech of the recruited speaker (channel A) were not used
to extract evaluation test segments. Instead, to ensure channel variety,
the callee sides of the call (channel B) which contained speech from
unique speakers were the main sides of interest. For this reason, the
recruited speaker encouraged their call partner to do most of the talking
during the recorded conversation.

A detailed description of the collection protocol can be found in
the enclosed document
docs/lrec2016-multi-language-speech-collection-nist-lre.pdf.

4.0 Auditing

Automatically-selected portions of each conversation were manually
audited by native speakers to confirm that the intended language was
spoken, to record judgments about the overall quality and noise
conditions of the call, whether the speech was from a sole and native
speaker of the language and also to judge speaker sex. In many cases,
people who were recruited to make calls were also tasked to serve as
auditors; the auditing process was controlled to ensure that no one
would audit their own calls. Auditors were presented with batches of
audit segments in "kits" consisting of target segments (believed to be
in the auditor's own language) and three other types of segments: a) 10%
of segments in the kit were from another dialect in the same linguistic
cluster (to assess language/dialect confusability), b) 5% of the
segments were in the target dialect and had been audited by another
target-language auditor (to measure inter-annotator agreement), and c)
5% of segments were from a completely different cluster of languages to
keep auditors alert.

Auditors were presented with several portions from each channel B side
of the call and they filled in a web form to indicate their judgments.
Since auditors judged multiple snippets from the same call, some calls
received more than one answer to an audit question depending on which
segment was judged, for example where one part of the call was judged
"clear" and another part as "some_unclear." Such cases where there are
two audit answers for the same question in the same call are marked in
"audit_info.tab" with the value "mixed."

5.0 Directory Structure

The directory structure and contents of the package are summarized below;
paths shown are relative to the base (root) directory of the package:

 ./data/
     spa/
        por-brz  -- 20 Brazilian Portuguese .flac files
        spa-car  -- 105 Caribbean Spanish .flac files
        spa-eur  -- 205 European Spanish .flac files
        spa-lac  -- 239 Latin American Spanish .flac files

 ./docs/
      -- contains this README, various tables and lists (see section 7
         below), and PDF files detailing the collection protocol, calling
         instructions and auditing guidelines:
           lrec2016-multi-language-speech-collection-nist-lre.pdf
           Multi_LangTelephoneCallingInstructions.pdf
           Multi-languageAuditInstructions.pdf

6.0 Content Summary

All audio data are presented in FLAC-compressed MS-WAV (RIFF) file
format (*.flac); when uncompressed, each file is 2 channels (caller on
"left/A" channel, callee on "right/B" channel), recorded at 8000
samples/second with samples stored as 16-bit signed integers,
representing a lossless conversion from the original mu-law sample data
as captured digitally from the public telephone network. Mu-law
companding algorithms are used for voice traffic in North America.

We expect that the number of distinct callees (channel B speakers) is equal
to the number of calls (i.e. each call involves a different callee), though
no special auditing has been done to confirm this.

The following table summarizes the total number of calls, total number of
hours of recorded audio, and the total size of compressed data (in
gigabytes). The "GB" values below represent amounts of compressed data.

group	lng	calls	hrs	GB
Spanish por-brz	20	4.78	0.26
Spanish	spa-car	105	23.69	1.16
Spanish	spa-eur	205	44.20	2.37
Spanish	spa-lac	239	51.13	2.51

7.0 Documentation Summary

In addition to this README file, the "docs" directory contains the
following:

7.1 Multi_LangTelephoneCallingInstructions.pdf

This is the set of instructions given to people who were recruited to
make calls to their acquaintances for the collection.

7.2 Multi-languageAuditInstructions.pdf

This is the set of instructions for the auditing that was done on all
calls.  It includes screenshots of the auditing tool.

7.3 lrec2016-multi-language-speech-collection-nist-lre.pdf

This is the full text of a paper presented at the 2016
Language Resources and Evaluation Conference (2016),
describing the larger collection effort that LDC conducted to
support the NIST 2015 and 2017 LRE evaluations.

7.4 flac_info.tab

This is a tab-delimited table with a one-line header followed by one
row for each CTS recording in the corpus. The columns of the table
are as follows:

  1 language_path_file_name (e.g. "spa/por-brz/20131008_235928_6063.flac")
  2 dur_sec -- file duration in seconds (e.g. 909.948)
  3 cmp_kb -- compressed flac file size in kilobytes

7.5 caller_info.tab

This is a tab-delimited table with a one-line header followed by one
row for each CTS recording in the corpus. The columns of the table
are as follows:

  1 call_file_id (e.g. "20131008_235928_6063.flac")
  2 lng (six-letter symbol for the language, e.g. "por-brz")
  3 clr_id (numeric ID for the recruited caller)

7.6 audit_info.tab

This is a tab-delimited table with a one-line header followed by one
row for each CTS recording in the corpus. The columns of the table
are as follows:

  1 call_file_id (e.g. "20140225_165934_7683.flac")
  2 auditor_id (alphanumeric ID of auditor)
  3 lng (six-letter symbol for the language, e.g. "zho-cmn")
  4 all_target_lang (auditor judgment whether entire call is in specified
    language; values: "yes", "no", "mixed")
  5 lng_comment (free text auditor comment about language)
  6 mostly_speech (auditor judgment on speech amount;
    variables: "yes", "no", "canttell", "mixed")
  7 speech_clarity (auditor judgment on speech clarity;
    variables: "clear", "some_unclear", "very_unclear", "mixed")
  8 single_speaker (auditor judgment on whether single speaker in call;
    variables: "yes", "no", "NO_RESPONSE", "mixed")
  9 native_speaker (auditor judgment on whether callee is native speaker of
    language; variables: "yes", "no", "unsure", "NO_RESPONSE", "mixed")
  10 speaker_sex (auditor judgment on callee gender;
     variables: "female", "male", "unsure", "NO_RESPONSE", "mixed")

8.0 Copyright Information

© 2013-2014 Trustees of the University of Pennsylvania

9.0 Contacts

If you have questions about this data release, please contact the
following personnel at LDC.

Kevin Walker - Technical Lead <walkerk@ldc.upenn.edu>

README created by Joshua Parry April 14, 2025