Home › Language Resources › Data

CIEMPIESS

Item Name:	CIEMPIESS
Author(s):	Carlos Daniel Hernández Mena, Abel Herrera
LDC Catalog No.:	LDC2015S07
ISBN:	1-58563-720-3
ISLRN:	838-468-581-053-6
DOI:	https://doi.org/10.35111/32r7-8k96
Release Date:	June 15, 2015
Member Year(s):	2015
DCMI Type(s):	Sound, Text
Sample Type:	pcm
Sample Rate:	16000
Data Source(s):	broadcast conversation
Application(s):	speech recognition
Language(s):	Spanish
Language ID(s):	spa
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2015S07 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Hernández Mena, Carlos Daniel, and Abel Herrera. CIEMPIESS LDC2015S07. Web Download. Philadelphia: Linguistic Data Consortium, 2015.
Related Works: Hide	View hasVersion LDC2017S23 CIEMPIESS Light isSimilarWith LDC2016S04 CHM150 LDC2018S11 CIEMPIESS Balance LDC2019S07 CIEMPIESS Experimentation isCreatedBy Praat http://www.fon.hum.uva.nl/praat/

Introduction

CIEMPIESS (Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social) was developed by the Speech Processing Laboratory of the Faculty of Engineering at the National Autonomous University of Mexico (UNAM) and consists of approximately 18 hours of Mexican Spanish radio speech, associated transcripts, pronouncing dictionaries and language models. The goal of this work was to create acoustic models for automatic speech recognition.

For more information and documentation see the CIEMPIESS-UNAM Project website.

LDC has released the following data sets in the CIEMPIESS series:

CHM150 (LDC2016S04)
CIEMPIESS Light (LDC2017S23)
CIEMPIESS Balance (LDC2018S11)
CIEMPIESS Experimentation (LDC2019S07)

Data

The speech recordings are from 43 one-hour FM radio programs broadcast by Radio IUS, a UNAM radio station. They are comprised of spontaneous conversations between a radio moderator and guests, principally about legal issues. Approximately 78% of the speakers were males, and 22% of the speakers were females.

The audio was recorded in MP3 stereo format, using a 44.1 kHz sample rate and a bit-rate of 128 kbps or higher. Only "clean" utterances were selected from the raw data, meaning that the utterances were made by one only person with no background noises, whispers, music, foreign accents, white noise or static. The audio files were converted to 16 kHz, 16-bit PCM WAV format for this release.

The recordings were transcibed using PRAAT, a tool designed for phonetics research. The transcripts are in Mexbet, a phonetic alphablet designed for Mexican Spanish based on Worldbet (Hieronymus, 1994). Plain text transcripts, textgrid format time labels and files useful for performing experiments with the SPHINX3 recognition software are also included.

CIEMPIESS

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees