------------------------------------------------------------------------------------------------- CIEMPIESS LIGHT CORPUS Audio and Transcripts of Mexican Spanish Broadcast Conversations. ------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------- PRESENTATION ------------------------------------------------------------------------------------------------- The CIEMPIESS LIGHT Corpus is an enhanced version of the CIEMPIESS Corpus (LDC item LDC2015S07). CIEMPIESS LIGHT is "light" because it doesn't include much of the files of the first version of CIEMPIESS and it is "enhanced" because it has a lot of improvements, some of them suggested by our community of users, that make this version more convenient for the new speech recognition engines such as Kaldi (http://kaldi-asr.org/). ------------------------------------------------------------------------------------------------- BRIEF HISTORY ------------------------------------------------------------------------------------------------- The CIEMPIESS LIGHT Corpus was created at the "Laboratorio de Teconologías del Lenguaje" of the "Facultad de Ingeniería" (FI) in the "Universidad Nacional Autónoma de México" (UNAM) between 2015 and 2016 by Carlos Daniel Hernández Mena, supervised by José Abel Herrera Camacho, head of Laboratory. CIEMPIESS is the acronym for: "Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social". The CIEMPIESS LIGHT is a Radio Corpus designed to create acoustic models for automatic speech recognition and it is made up by recordings of spontaneous conversations in Mexican Spanish between a radio moderator and his guests. These recordings were taken in mp3 from "PODCAST UNAM" (http://podcast.unam.mx/) and they were created by "RADIO-IUS" (http://www.derecho.unam.mx/cultura-juridica/radio.php) that is a radio station that belongs to UNAM and by "Mirador Universitario" (http://mirador.cuaed.unam.mx/) that is a TV program that also belongs to UNAM. For more information and documentation see the CIEMPIESS-UNAM Project website at: http://www.ciempiess.org/ ------------------------------------------------------------------------------------------------- COMPARISON BETWEEN THE CIEMPIESS LIGHT AND THE CIEMPIESS ------------------------------------------------------------------------------------------------- Differences between the CIEMPIESS LIGHT (CL) and the Original CIEMPIESS (OC) includes: - The organization of the directories is more simple in the CL than in the OC. - The CL doesn't include files to do experiments with Sphinx like the OC does. - The CL doesn't include "time labels" to indicate word beginnings and word endings like the OC does. - The CL is slightly bigger (18 hours / 25 minutes) than the OC (17 hours / 30 minutes). - The CL is made up by 16,663 audio files with transcripts. The OC is made up by 17,017 audio files divided in a train set and a test set. - The CL doesn't impose the test set or the train set like the OC does. - Data in CL is classified by gender and speaker, so one can easily select audios from a particular set of speakers to do experiments. - Female speakers were added to the CL but it is still being a gender unbalanced corpus: (24.85% of female speakers in the CL against 22.14% in the OC). - The CL contains 34 female speakers and 53 male speakers. 12,521 of the audio files belongs to male speakers and 4,142 belongs to female speakers. In constrast, the training set of the OC has 12,407 audio files that belongs to male speakers and 3,610 that belongs to female speakers. - Transcription files in the OC are so complicated because they are not completely orthographic, for example, this is a transcription in the OC: pEro nO podrIa eKSistIr lA sAna distAncia doctOr (0166M_08ALX_15OCT12) This is the same transcription in the CL: pero no podría existir la sana distancia doctor CMPL_M_09_08ALX_00244 As one can see, this transcription in the CL corpus is more understandable. - As in the OC, the file keys in the CL provide a lot of useful information of any particular audio file. Nevertheless, keys in the CL have different numbering with respect to the keys in the OC. - Faulty audio files in the OC were eliminated in the CL. For example, few files in the OC does not contain any voice but noise. Some others contain just one unintelligible word. - Good quality audio files were added to the CL. These new audios come from the TV show "Mirador Universitario" which is produced by UNAM. Topics in the new files are similar to the topics discuss in the old files and also, they were downloaded from the same page (http://podcast.unam.mx/) in MP3. - Audio files in the CL are all of the same type. There are no files from other corpus like in the OC. ------------------------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES ------------------------------------------------------------------------------------------------- The CIEMPIESS_LIGHT directory contains the following files and directories: - docs : Here one can find the transcription file (CIEMPIESS_LIGHT.transcription), the audio paths file (CIEMPIESS_LIGHT.paths) and the file "audiosxspk.txt" that shows the number of audio files per speaker. - data : Here one can find the audio files in flac format (16kHz@16bit mono). These files are organized by gender and also by speaker. - LICENSE.txt - README.txt ------------------------------------------------------------------------------------------------- IDENTIFICATION KEY FORMAT ------------------------------------------------------------------------------------------------- Every audio file in the CIEMPIESS LIGHT Corpus has an identification key with the following format: CMPL_M_52_13ALX_00021 CMPL M 52 13ALX 00021 Acronym Gender of Number An internal Number of the for the Speaker: of key that indicates audio file of "CIEMPIESS "M" for Male Speaker. the show where the a particular LIGHT". "F" for Female audio was extracted speaker. from. ------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------