CSLU: Spoltech Brazilian Portuguese Version 1.0

Item Name: CSLU: Spoltech Brazilian Portuguese Version 1.0
Authors: Mauricio C. Schramm, Luis Felipe R. Freitas, Adriano Zanuz, and Dante Barone
LDC Catalog No.: LDC2006S16
ISBN: 1-58563-383-6
Release Date: Apr 17, 2006
Data Type: speech
Sample Rate: 44100 Hz
Sampling Format: 1-channel pcm
Data Source(s): microphone speech
Application(s): language identification, language modeling, language teaching, machine learning, machine translation
Language(s): Portuguese
Language ID(s): por
Distribution: 1 DVD
Member fee: $0 for 2006 members
Non-member Fee: US $150.00
Reduced-License Fee: US $150.00
Extra-Copy Fee: US $150.00
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Mauricio C. Schramm, et al.
CSLU: Spoltech Brazilian Portuguese Version 1.0
Linguistic Data Consortium, Philadelphia


CSLU: Spoltech Brazilian Portuguese Version 1.0, Linguistic Data Consortium (LDC) catalog number LDC2006S16 and ISBN 1-58563-383-6, contains microphone speech from a variety of regions in Brazil with phonetic and orthographic transcriptions. The utterances consist of both read speech (for phonetic coverage) and responses to questions (for spontaneous speech). The corpus contains 477 speakers and 8,080 separate utterances. A total of 2,540 utterances have been transcribed at the word level (without time alignments), and 5,479 utterances have been transcribed at the phoneme level (with time alignments). Protocol design, recording and transcription were performed by the Universidade Federal do Rio Grande do Sul and the Universidade de Caxias do Sul.


The data has been recorded at 44.1 kHz (mono, 16-bit) and stored in RIFF format. The recording was conducted with a direct connection from the microphone to the sound card. The sound card was SoundBlaster-compatible. For the prompted sentences, the sentence was hidden from view when recording began, so that the speaker might utter the sentence more naturally. Verification of the recording quality was performed immediately after each utterance recording; the data-collection software allowed the speaker to re-record utterances in case the recording was not of sufficient quality. The acoustic environment was not controlled, in order to allow for background conditions that would occur in application environments.


For an example of the data in this corpus, please listen to this audio sample and examine its transcript


Content Copyright

Portions 1994-2002 Center for Spoken Language Understanding, Oregon Health & Science University, 2006 Trustees of the University of Pennsylvania