------------------------------------------------------------------------------------------------- CIEMPIESS BALANCE CORPUS Audio and Transcripts of Mexican Spanish Broadcast Conversations. ------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------- PRESENTATION ------------------------------------------------------------------------------------------------- The CIEMPIESS BALANCE Corpus is designed to match with the CIEMPIESS LIGHT Corpus (LDC2017S23). CIEMPIESS BALANCE is "balance" because it is designed to balance the CIEMPIESS LIGHT. It means that if the CIEMPIESS BALANCE is combined with the CIEMPIESS LIGHT, one will get a gender balanced corpus. To appreciate this, one need to know that the CIEMPIESS LIGHT is by itself, a gender unbalanced corpus of approximately 25% of female speakers and 75% of male speakers. So the CIEMPIESS BALANCE is a gender unbalanced corpus with approximately 25% of male speakers and 75% of female speakers. Furthermore, the match between the two datasets is more profound than just the number of the speakers. In both corpus speakers are numbered as: F_01, M_01, F_02, M_02, etc. So, the relation between the speakers is that the speech of F_01 in CIEMPIES LIGHT has an approximate amount of time as the speech of M_01 in the CIEMPIESS BALANCE. The consequence of this speaker-to-speaker match is that the CIEMPIESS BALANCE has a size of 18 hours and 20 minutes against the 18 hours and 25 minutes of the CIEMPIESS LIGHT. It is a very good match between them! ------------------------------------------------------------------------------------------------- BRIEF HISTORY ------------------------------------------------------------------------------------------------- CIEMPIESS BALANCE belongs to the "CIEMPIESS" family of corpus for Speech Recognition in Mexican Spanish. The most distinguished member and founder of this family is The CIEMPIESS Corpus (LDC2015S07) published in 2015 by the Linguistic Data Consortium (LDC). The CIEMPIESS BALANCE Corpus was created by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) in the "Universidad Nacional Autónoma de México" (UNAM) between 2016 and 2018 by Carlos Daniel Hernández Mena, head of the program. CIEMPIESS is the acronym for: "Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social". The CIEMPIESS BALANCE is a Radio Corpus designed to create acoustic models for automatic speech recognition and it is made up by recordings of spontaneous conversations in Mexican Spanish between a radio moderator and his guests. Most of the recordings that constitute the CIEMPIESS BALANCE come from "RADIO-IUS" (http://www.derecho.unam.mx/cultura-juridica/radio.php) that is a radio station that belongs to UNAM. Other recordings needed to perform an accurate match between specific speakers were taken from the YouTube channels: - "IUS Canal Multimedia" https://www.youtube.com/user/DEDUNAM/videos - "Centro Universitario de Estudios Jurídicos (CUEJ UNAM)" https://www.youtube.com/channel/UCTxkzdUd0tiXT5BN5o6Xo-A/videos For more information and documentation see the CIEMPIESS-UNAM Project website at: http://www.ciempiess.org/ ------------------------------------------------------------------------------------------------- CORPUS CHARACTERISTICS ------------------------------------------------------------------------------------------------- The CIEMPIESS BALANCE Corpus has the following characteristics: - The organization of the directories in the CIEMPIESS BALANCE (CB) is the same as the CIEMPIESS LIGHT (CL) Corpus. - The CL is slightly bigger (18 hours / 25 minutes) than the CB (18 hours / 20 minutes). - The CB is made up by 8555 audio files with transcripts. 2447 of those files (28.6%) come from male speakers and 6108 files (71.39%) come from female speakers. - The CB contains 34 male speakers and 53 female speakers. The total amount of time of all the female recordings sum together is 12 hours and 40 minutes while the recordings from male speakers sum 5 hours and 40 minutes. - Every audio file in the CB has a duration between 5 and 10 seconds approximately. - Speakers in the CB and the CL are different persons. In fact, speakers in the CB are not present in any other CIEMPIESS dataset. - Data in CB is classified by gender and speaker, so one can easily select audios from a particular set of speakers to do experiments. - Audio files in the CL and the CB are all of the same type. In both, speakers talk about legal and lawyer issues. They also talk about things related to the UNAM University and the "Facultad de Derecho de la UNAM". - Audio files in the CB are distributed in a 16khz@16bit mono format. ------------------------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES ------------------------------------------------------------------------------------------------- The CIEMPIESS_LIGTH directory contains the following files and directories: - docs : Here one can find the transcription file (CIEMPIESS_BALANCE.transcription), the audio paths file (CIEMPIESS_BALANCE.paths), the file "audiosxspk.txt" that shows the number of audio files per speaker and the file "CMPL_and_CMPB_Match.xls" that shows the speaker-to-speaker match between the CIEMPIESS LIGHT and the CIEMPIESS BALANCE. - speech : Here one can find the audio files in a 16kHz@16bit mono format. These files are organized by gender and also by speaker. - README.txt ------------------------------------------------------------------------------------------------- IDENTIFICATION KEY FORMAT ------------------------------------------------------------------------------------------------- Every audio file in the CIEMPIESS BALANCE Corpus has an identification key with the following format: CMPB_F_51_01CAR_00012 CMPB F 51 01CAR 00012 Acronym Gender of Number An internal Number of the for the Speaker: of key that indicate audio file of "CIEMPIESS "M" for Male Speaker. us who did the a particular BALANCE". "F" for Female transcription of speaker. the current audio file. ------------------------------------------------------------------------------------------------- ACKNOWLEDGEMENTS ------------------------------------------------------------------------------------------------- The authors would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their support to the social service program: "Desarrollo de Tecnologías del Habla." They also thank to the social service students for all the hard work. Special thanks to Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the "Facultad de Derecho de la UNAM" for donating most of the recordings that constitute the CIEMPIESS BALANCE Corpus. ------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------