------------------------------------------------------------------------------------------------- CIEMPIESS TEST CORPUS Audio and Transcripts of Mexican Spanish Broadcast Conversations. ------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------- PRESENTATION ------------------------------------------------------------------------------------------------- When developing automatic speech recognition engines and any other machine learning system is a good practice to separate the test from the training data and never combined them. So, the CIEMPIESS TEST Corpus was created by this necessity of having an standard test set destined to measure the advances of the community of users of the CIEMPIESS datasets and we strongly recommend not to use the CIEMPIESS TEST for any other purpose. The CIEMPIESS TEST Corpus is a gender balanced corpus designed to test acoustic models for the speech recognition task. It was created by recordings and human transcripts of 10 male and 10 female speakers. The CIEMPIESS TEST Corpus is considered a CIEMPIESS dataset because it only contains audio from the same source of the first CIEMPIESS Corpus and it has the word "TEST" in its name, obviously because it is recommended for test purposes only. ------------------------------------------------------------------------------------------------- BRIEF HISTORY ------------------------------------------------------------------------------------------------- CIEMPIESS TEST belongs to the "CIEMPIESS" family of corpus for Speech Recognition in Mexican Spanish. The most distinguished member and founder of this family is The CIEMPIESS Corpus (LDC2015S07) published in 2015 by the Linguistic Data Consortium (LDC). The CIEMPIESS TEST Corpus was created by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) in the "Universidad Nacional Autónoma de México" (UNAM) between 2016 and 2018 by Carlos Daniel Hernández Mena, head of the program. CIEMPIESS is the acronym for: "Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social". The CIEMPIESS TEST is a Radio Corpus designed to test acoustic models of automatic speech recognition and it is made up by recordings of spontaneous conversations in Spanish between a radio moderator and his guests. Most of the speech in these conversations has the accent of Central Mexico. All the recordings that constitute the CIEMPIESS TEST come from "RADIO-IUS" (http://www.derecho.unam.mx/cultura-juridica/radio.php) that is a radio station that belongs to UNAM and they were donated by Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the "Facultad de Derecho de la UNAM" with the condition that they are used for academic and research purposes only. For more information and documentation see the CIEMPIESS-UNAM Project website at: http://www.ciempiess.org/ ------------------------------------------------------------------------------------------------- CORPUS CHARACTERISTICS ------------------------------------------------------------------------------------------------- The CIEMPIESS TEST (CT) Corpus has the following characteristics: - The CT has a total of 3558 audio files of 10 male speakers and 10 female speakers. It has a total duration of 8 hours and 8 minutes. - The total number of audio files that come from male speakers is 1694 with a total duration of 4 hours and 3 minutes. The total number of audio files that come from female speakers is 1864 with a total duration of 4 hours and 4 minutes. So CT is perfectly balanced. - All of the speakers in the CT come from Mexico, except for the speaker M_09 that comes from El Salvador. - Every audio file in the CT has a duration between 5 and 10 seconds approximately. - Data in CT is classified by gender and also by speaker, so one can easily select audios from a particular set of speakers to do experiments. - Audio files in the CT and the first CIEMPIESS are all of the same type. In both, speakers talk about legal and lawyer issues. They also talk about things related to the UNAM University and the "Facultad de Derecho de la UNAM". - As in the first CIEMPIESS Corpus, transcriptions in the CT were made by humans. - Speakers in the CT are not present in any other CIEMPIESS dataset. - Audio files in the CT are distributed in a 16khz@16bit mono format. ------------------------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES ------------------------------------------------------------------------------------------------- The CIEMPIESS_TEST directory contains the following files and directories: - files : One can find the transcription file, the paths file as well as the "Speaker_Info.xls" file that contains relevant information about all the sepakers in the corpus. - speech: Here one can find the audio files in a 16kHz@16bit mono format. These files are organized by gender and also by speaker. - README.txt ------------------------------------------------------------------------------------------------- THE CORPUS FILES ------------------------------------------------------------------------------------------------- In the "files" directory one can find the following: - CIEMPIESS_TEST.transcription : This is the transcription file in plain text format. - CIEMPIESS_TEST.paths : This file contains the relative paths from the "speech" directory to every particular speech file. - Speaker_info.xls : This file contains relevant information about the speakers. Specifically: number of audios per speaker, the total amount of time of speech per speaker and the time match between male and female speakers. ------------------------------------------------------------------------------------------------- IDENTIFICATION KEY FORMAT ------------------------------------------------------------------------------------------------- Every audio file in the CIEMPIESS TEST Corpus has an identification key with the following format: CMPT_M_01_0001 CMPT M 01 0001 Acronym Gender of Number Number of the for the Speaker: of audio file of "CIEMPIESS "F" for Female Speaker. a particular TEST". "M" for Male speaker. ------------------------------------------------------------------------------------------------- ACKNOWLEDGEMENTS ------------------------------------------------------------------------------------------------- The authors would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their support to the social service program: "Desarrollo de Tecnologías del Habla." They also thank to the social service students for all the hard work. Special thanks to Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the "Facultad de Derecho de la UNAM" for donating all the recordings that constitute the CIEMPIESS TEST Corpus. ------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------