------------------------------------------------------------------------------------------------- CIEMPIESS Experimentation Package Corpus and Tools to perform Speech Recognition Experiments in Mexican Spanish ------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------- PRESENTATION ------------------------------------------------------------------------------------------------- The CIEMPIESS Experimentation Package is a set of three different corpus designed to solve specific problems when experimenting with speech recognition systems: CIEMPIESS COMPLEMENTARY. This is a phonetically balanced corpus of isolated Spanish words spoken by people of Central Mexico. It was designed to solve the lack of instances of any phoneme in Mexican Spanish. The CIEMPIESS COMPLEMENTARY provides documentation (written in english) for learning how to produce accurate phonetic transciptions in Mexican Spanish and it also provides an automatic phonetizer coded in python 2.7 to create pronouncing dictionaries. CIEMPIESS FEM. This is a corpus created by recordings of 21 different women. The motivation of the CIEMPIESS FEM is that we have noticed a lack of female speakers in the sources where we traditionally take audio to create new CIEMPIESS datasets. So, this corpus was designed to balance (in gender) up to 14 hours of male speaker recordings. CIEMPIESS TEST. This corpus was created in response to the necessity of having an standard test set destined to measure the advances of the community of users of the CIEMPIESS datasets. ------------------------------------------------------------------------------------------------- BRIEF HISTORY ------------------------------------------------------------------------------------------------- The CIEMPIESS Experimentation Package belongs to the "CIEMPIESS" family of corpus for Speech Recognition in Mexican Spanish. The most distinguished member and founder of this family is The CIEMPIESS Corpus (LDC2015S07) published in 2015 by the Linguistic Data Consortium (LDC). In 2017, the LDC published the CIEMPIESS LIGHT Corpus (LDC2017S23) which is a revisited and augmented version of the original CIEMPIESS. We recommend the combination of the CIEMPIESS LIGHT, the CIEMPIESS BALANCE and the CIEMPIESS Experimentation Package to perform experiments with modern speech recognition engines such like Kaldi, SPHINX or HTK. The CIEMPIESS Experimentation Package was created by the social service program "Desarrollo de Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) in the "Universidad Nacional Autónoma de México" (UNAM) between 2016 and 2018 by Carlos Daniel Hernández Mena, head of the program. CIEMPIESS is the acronym for: "Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social". Most of the recordings that constitute the CIEMPIESS TEST and the CIEMPIESS FEM datasets, included in the CIEMPIESS Experimentation Package come from "RADIO-IUS" (http://www.derecho.unam.mx/cultura-juridica/radio.php) that is a radio station that belongs to UNAM and they were donated by Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the "Facultad de Derecho de la UNAM". Other recordings were taken from the YouTube channels: - "IUS Canal Multimedia" https://www.youtube.com/user/DEDUNAM/videos - "Centro Universitario de Estudios Jurídicos (CUEJ UNAM)" https://www.youtube.com/channel/UCTxkzdUd0tiXT5BN5o6Xo-A/videos The CIEMPIESS COMPLEMENTARY Corpus, included in the CIEMPIESS Experimentation Package is constituted by read speech. It was designed and edited by Carlos Daniel Hernández Mena and recorded by Susana Alejandra Jiménez Sandoval. For more information and documentation see the CIEMPIESS-UNAM Project website at: http://www.ciempiess.org/ ------------------------------------------------------------------------------------------------- MORE DETAILS ABOUT THE DATASETS ------------------------------------------------------------------------------------------------- In this setion it is provided a more detailed explanation of the datasets included in the CIEMPIESS Experimentation Package. For full documentation see the "README" file that comes with every dataset. - CIEMPIESS COMPLEMENTARY (1 hour) The CIEMPIESS COMPLEMENTARY is a phonetically balanced corpus of isolated Spanish words spoken by people of Central Mexico. It was designed to solve one particular issue when training automatic speech recognition (ASR) systems in the Spanish of Central Mexico. This problem appears when someone collects some training data, but then the system complains because it does not find enough instances of one or more particular phoneme. The CIEMPIESS COMPLEMENTARY Corpus was created with the voices of 10 male and 10 female volunteers reading isolated words. The words were chosen to assure users to get, at least, twenty instances of every single phoneme and allophone of the Mexican phonetic alphabet called Mexbet. Mexbet is a phonetic alphabet created for the Spanish of Central Mexico. It has several levels of granularity but the two levels we work with are: the T29 or phonological level with 29 symbols, and the T66 or phonetic level with 66 symbols. In this edition of the CIEMPIESS COMPLEMENTARY we provide documentation (written in english) for learning how to produce accurate phonetic transciptions using Mexbet, we provide an automatic phonetizer coded in python 2.7, we provide pronouncing dictionaries in T29 and T66 for all the words in the corpus and we also show the one-to-one equivalence between Mexbet, the International Phonetic Alphabet (IPA) and the X-SAMPA alphabet. In conclusion, the CIEMPIESS COMPLEMENTARY Corpus is "COMPLEMENTARY" because it "complements" datasets when training ASR systems in the Spanish of Central Mexico. - CIEMPIESS FEM (14 hours) Since the publication of the CIEMPIESS Corpus (LDC2015S07) in 2015 we have noticed that there is a lack of female speakers in the sources where we traditionally take audio to create new CIEMPIESS datasets. That is why we decided to create a corpus that helps to balance future gender unbalanced datasets. The CIEMPIESS FEM Corpus was created by recordings and human transcripts of 21 different women. 16 of these women are mexican. The other ones come from Latin American countries. The CIEMPIESS FEM Corpus is considered a CIEMPIESS dataset because it only contains audio from the same source of the first CIEMPIESS Corpus and it is "FEM", obviously because it only contains recordings of female speakers. - CIEMPIESS TEST (8 hours) When developing automatic speech recognition engines and any other machine learning system is a good practice to separate the test from the training data and never combined them. So, the CIEMPIESS TEST Corpus was created by this necessity of having an standard test set destined to measure the advances of the community of users of the CIEMPIESS datasets and we strongly recommend not to use the CIEMPIESS TEST for any other purpose. The CIEMPIESS TEST Corpus is a gender balanced corpus designed to test acoustic models for the speech recognition task. It was created by recordings and human transcripts of 10 male and 10 female speakers. The CIEMPIESS TEST Corpus is considered a CIEMPIESS dataset because it only contains audio from the same source of the first CIEMPIESS Corpus and it has the word "TEST" in its name, obviously because it is recommended for test purposes only. ------------------------------------------------------------------------------------------------- MEXBET DOCUMENTATION ------------------------------------------------------------------------------------------------- Mexbet is a phonetic alphabet created for the Spanish of Central Mexico. It has several levels of granularity but the two levels we work with are: the T29 or phonological level with 29 symbols, and the T66 or phonetic level with 66 symbols. In this edition of the CIEMPIESS Experimentation Package we provide documentation (written in english) for learning how to produce accurate phonetic transciptions using Mexbet, we provide an automatic phonetizer coded in python 2.7, we provide pronouncing dictionaries in T29 and T66 for all the words in the CIEMPIESS COMPLEMENTARY corpus and we also show the one-to-one equivalence between Mexbet, the International Phonetic Alphabet (IPA) and the X-SAMPA alphabet. In the "docs" directory one can find several documents that lead users to understand and to adopt the Mexbet phonetic alphabet as well as its transcription rules. The following files contain useful charts that show the manner of articulation and the point of articulation of the Mexbet phonemes and allophones. One chart also shows how to convert the Mexbet symbols to the IPA and X-SAMPA alphabets and how the IPA equivalences in Mexbet can be written in Latex. These charts are: - Mexican_Spanish_Phonology_in_Mexbet_T29.pdf : A chart with the place of articulation and the manner of articulation of the Mexbet phonemes of the T29 level. - Mexbet_T66_Phonetic_Alphabet.pdf : A chart with the place of articulation and the manner of articulation of the Mexbet phonemes and allophones of the T66 level. - Equivalences_between_phonetic_alphabets.pdf : A chart that shows the one-to-one equivalences between the Mexbet, the IPA and the X-SAMPA symbols. It also shows how to write the IPA equivalences of Mexbet in Latex. The following documents show the entire process that is needed to perform phonetic transcriptions in Mexbet. At first, one have to determine which is the stressed vowel of the word one want to transcribe. Then one have to divide the word into syllables. The next step is to use the grapheme-to-phoneme rules of Mexbet to perform a transcription in the T29 level. Finally, one have to use the phonetic rules of Mexbet to perform a transcription in the T66 level. The documents that show how to implement this whole process are: - Rules_for_Spanish_Accent_Marks.pdf : It shows how to determine if a word in Spanish needs or not an accent mark. This document also gives cues of how to identify the stressed syllable if the accent mark is not present. - Syllabification_Rules_in_Spanish.pdf : It shows the rules of how to divide an Spanish word into syllables. - Mexbet_T29_Grapheme-to-Phoneme_Rules.pdf : It shows the grapheme-to-phoneme rules to perform transcriptions in the T29 level of Mexbet. - Mexbet_T66_Phonetic_Rules.pdf : It shows the phonetic rules to perform transcriptions in the T66 level of Mexbet. ------------------------------------------------------------------------------------------------- THE "fonetica3 library" SOFTWARE TOOL ------------------------------------------------------------------------------------------------- The "fonetica3 library" is a software tool written in python 2.7 that is based on the same rules shown in the "docs" directory. It means that users can implement several phonetic tasks like syllabification or phonetic transcription. The "fonetica3 library" counts with a very detailed README file that informs users about all its functionalities. The important thing that is worth to highlight at this section is how to use the "fonetica3 library" to create a pronouncing dictionary both in T29 and in T66 with the stressed vowel well indicated. This task is very important in the ASR field and we recommend to use it to create a pronouncing dictionary for the CIEMPIESS LIGHT Corpus, and for all the future members of the CIEMPIESS family. The three functions analyzed at this section are: - T29() : Performs transcriptions in Mexbet T29 - T66() : Performs transcriptions in Mexbet T66 - vocal_tonica() : Indicates the stressed vowel of the word in parenthesis by using a capital letter. For example: vocal_tonica("camello") produces "camEllo", where "E" is the stressed vowel. One has to know that all the functions in the "fonetica3 library" expect words in lowercase. The following example shows how to use the T29() function in a python code to transcribe the word "canción": 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T29 import T29 05:print(T29("canción")) This code produces the following output: k a n . s i o_7 n The dot indicates the syllabification and the "_7" next to the "o" indicates that the vowel "o" is the stressed vowel. Now the following code shows how to use the T66() function to transcribe the same word: 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T66 import T66 05:print(T66("canción")) This code produces the following output: k a n . s j O_7 n One can also see the syllabification indicated by the dot and the stressed vowel indicated by the "_7" next to the "o". In both cases, it is obvious that the stressed vowel is the "o" because it has an accent mark or tilde (ó). If the word doesn't have an accent mark, one can use the vocal_tonica() function to determine which is the stressed vowel. The following code shows how to use the vocal_tonica() function to determine which is the stressed vowel of the word "camello": 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.vocal_tonica import vocal_tonica 05:print(vocal_tonica("camello")) This code produces the following output: camEllo Note that the stressed vowel is indicated by the capital letter "E". A vowel in uppercase indicates both to the T29() and to the T66() functions that the vowel is stressed. The following code shows what would happen if the stressed vowel is not indicated when using the T29() function: 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T29 import T29 05:print(T29("camello")) It produces: k a . m e . Z o Notice that there is no "_7" indicating the stressed vowel. Now lets see what happens when the stress vowel is properly indicated: 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T29 import T29 05:print(T29("camEllo")) It produces: k a . m e_7 . Z o Eureka! The "_7" is nex to the "e". The three analized functions work with no errors for most of the words in Spanish, but the functions T29() and T66() have an incompatibility with the pronouncing dictionaries used in the ASR field. This incompatibility is the syllabification. One can easily notice that pronouncing dictionaries included in the CIEMPIESS COMPLEMENTARY Corpus does not have dots indicating the syllabification. For example, some few lines of the T66 dictionary are: ababuy a V a V w_7 i( absolutamente a V s o l u_7 t a m e_7 n_[ t e académicamente a k a D e_7 m i k a m e_7 n_[ t e accidentalmente a k s i D e n_[ t a_2_7 l m e_7 n_[ t e acolchonada a k O l_j tS o n a_7 D a The following code shows how to eliminate the syllabification dots of a single transcription in T66 of the word "camello": 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T66 import T66 05:word = "camello" 06:transcription = T66(word) 07:transcription = transcription.replace(" . "," ") 08:print(transcription) It produces: k a m e Z o In this case, the stressed vowel was not indicated. To do so, see the following code that that takes the same word "camello" and uses the vocal_tonica() function to determine the stressed vowel. At the end the T66() function produces a transcription in T66 with stressed vowel well indicated and no syllabification dots. 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T66 import T66 05:from fonetica3.vocal_tonica import vocal_tonica 06:word = "camello" 07:stressed = vocal_tonica(word) 08:transcription = T66(stressed) 09:transcription = transcription.replace(" . "," ") 10:print(transcription) k a m e_7 Z o To look for updates of the "fonetica3 library" see: http://www.ciempiess.org/downloads ------------------------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES ------------------------------------------------------------------------------------------------- The CIEMPIESS Experimentation Package directory contains the following files and directories: - data: Contains each of the three corpora that make up CIEMPIESS Experimentation Package. - docs : One can find several manuals of how to produce accurate phonetic transcriptions using the Mexbet phonetic alphabet. There is also information about the Mexican Spanish phonology and phonetics. - tools: Here is a copy of the "fonetica3 library" software tool. ------------------------------------------------------------------------------------------------- ACKNOWLEDGEMENTS ------------------------------------------------------------------------------------------------- The authors would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their support to the social service program: "Desarrollo de Tecnologías del Habla." They also thank to the social service students for all the hard work. Thanks to Susana Alejandra Jiménez Sandoval from the "Facultad de Filosofía y Letras de la UNAM" for recording the utterances of the CIEMPIESS COMPLEMENTARY Corpus. Special thanks to Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the "Facultad de Derecho de la UNAM" for donating most of the recordings that constitute the CIEMPIESS TEST and the CIEMPIESS FEM datasets. ------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------