------------------------------------------------------------------------------------------------- CIEMPIESS FEM CORPUS Audio and Transcripts of Female Speakers in Spanish ------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------- PRESENTATION ------------------------------------------------------------------------------------------------- Since the publication of the CIEMPIESS Corpus (LDC2015S07) in 2015 we have noticed that there is a lack of female speakers in the sources where we traditionally take audio to create new CIEMPIESS datasets. That is why we decided to create a corpus that helps to balance future gender unbalanced datasets. The CIEMPIESS FEM Corpus was created by recordings and human transcripts of 21 different women. 16 of these women are mexican. The other ones come from Latin American countries. The CIEMPIESS FEM Corpus is considered a CIEMPIESS dataset because it only contains audio from the same source of the first CIEMPIESS Corpus and it is "FEM", obviously because it only contains recordings of female speakers. ------------------------------------------------------------------------------------------------- BRIEF HISTORY ------------------------------------------------------------------------------------------------- CIEMPIESS FEM belongs to the "CIEMPIESS" family of corpus for Speech Recognition in Mexican Spanish. The most distinguished member and founder of this family is The CIEMPIESS Corpus (LDC2015S07) published in 2015 by the Linguistic Data Consortium (LDC). The CIEMPIESS FEM Corpus was created by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) in the "Universidad Nacional Autónoma de México" (UNAM) between 2016 and 2018 by Carlos Daniel Hernández Mena, head of the program. CIEMPIESS is the acronym for: "Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social". The CIEMPIESS FEM is a Radio Corpus designed to create acoustic models for automatic speech recognition and it is made up by recordings of spontaneous conversations in Spanish between a radio moderator and his guests. Most of the speech in these conversations has the accent of Central Mexico. All the recordings that constitute the CIEMPIESS FEM come from "RADIO-IUS" (http://www.derecho.unam.mx/cultura-juridica/radio.php) that is a radio station that belongs to UNAM and they were donated by Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the "Facultad de Derecho de la UNAM" with the condition that they are used for academic and research purposes only. For more information and documentation see the CIEMPIESS-UNAM Project website at: http://www.ciempiess.org/ ------------------------------------------------------------------------------------------------- CORPUS CHARACTERISTICS ------------------------------------------------------------------------------------------------- The CIEMPIESS FEM (CF) Corpus has the following characteristics: - The CF has a total of 6505 audio files of 21 different women. It has a total duration of 13 hours and 54 minutes. - Every audio file in the CF has a duration between 5 and 10 seconds approximately. - Data in CF is classified by speaker and also by country, so one can easily select audios from a particular set of speakers to do experiments. - Audio files in the CF and the first CIEMPIESS are all of the same type. In both, speakers talk about legal and lawyer issues. They also talk about things related to the UNAM University and the "Facultad de Derecho de la UNAM". - As in the first CIEMPIESS Corpus, transcriptions in the CF were made by humans. - Speakers in the CF are not present in any other CIEMPIESS dataset. - Audio files in the CF are distributed in a 16khz@16bit mono format. ------------------------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES ------------------------------------------------------------------------------------------------- The CIEMPIESS_FEM directory contains the following files and directories: - files : One can find the transcription file, the paths file as well as the "Speaker_Info.xls" file that contains relevant information about all the sepakers in the corpus. - speech: One can find the speech files classified by speaker and also by country. In the directory "mexican" one can find all the speakers that come from Mexico. In the directory "foreign" are the speakers that come from a country distinct to Mexico. - README.txt ------------------------------------------------------------------------------------------------- THE CORPUS FILES ------------------------------------------------------------------------------------------------- In the "files" directory one can find the following: - CIEMPIESS_FEM.transcription : This is the transcription file in plain text format. - CIEMPIESS_FEM.paths : This file contains the relative paths from the "speech" directory to every particular speech file. - Speaker_info.xls : This file contains relevant information about the speakers. Specifically: nationality, number of audios per speaker and the total amount of time of speech per speaker. ------------------------------------------------------------------------------------------------- IDENTIFICATION KEY FORMAT ------------------------------------------------------------------------------------------------- Every audio file in the CIEMPIESS FEM Corpus has an identification key with the following format: CMPF_F_01_MEX_0001 CMPF F 01 MEX 0001 Acronym Gender of Number Country of the Number of the for the Speaker: of current speaker: audio file of "CIEMPIESS "F" for Female Speaker. MEX = Mexico a particular FEM". VEN = Venezuela speaker. ARG = Argentina SLV = El Salvador DOM = Dominician Republic UNK = Unknown ------------------------------------------------------------------------------------------------- ACKNOWLEDGEMENTS ------------------------------------------------------------------------------------------------- The authors would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their support to the social service program: "Desarrollo de Tecnologías del Habla." They also thank to the social service students for all the hard work. Special thanks to Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the "Facultad de Derecho de la UNAM" for donating all the recordings that constitute the CIEMPIESS FEM Corpus. ------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------