------------------------------------------------------------------------------------------------- CIEMPIESS COMPLEMENTARY CORPUS Audio and Transcripts of Spanish Isolated Words. ------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------- PRESENTATION ------------------------------------------------------------------------------------------------- The CIEMPIESS COMPLEMENTARY is a phonetically balanced corpus of isolated Spanish words spoken by people of Central Mexico. It was designed to solve one particular issue when training automatic speech recognition (ASR) systems in the Spanish of Central Mexico. This problem appears when someone collects some training data, but then the system complains because it does not find enough instances of one or more particular phoneme. The CIEMPIESS COMPLEMENTARY Corpus was created with the voices of 10 male and 10 female volunteers reading isolated words. The words were chosen to assure users to get, at least, twenty instances of every single phoneme and allophone of the Mexican phonetic alphabet called Mexbet. Mexbet is a phonetic alphabet created for the Spanish of Central Mexico. It has several levels of granularity but the two levels we work with are: the T29 or phonological level with 29 symbols, and the T66 or phonetic level with 66 symbols. In this edition of the CIEMPIESS COMPLEMENTARY we provide documentation (written in english) for learning how to produce accurate phonetic transciptions using Mexbet, we provide an automatic phonetizer coded in python 2.7, we provide pronouncing dictionaries in T29 and T66 for all the words in the corpus and we also show the one-to-one equivalence between Mexbet, the International Phonetic Alphabet (IPA) and the X-SAMPA alphabet. In conclusion, the CIEMPIESS COMPLEMENTARY Corpus is "COMPLEMENTARY" because it "complements" datasets when training ASR systems in the Spanish of Central Mexico. ------------------------------------------------------------------------------------------------- BRIEF HISTORY ------------------------------------------------------------------------------------------------- CIEMPIESS COMPLEMENTARY belongs to the "CIEMPIESS" family of corpus for Speech Recognition in Mexican Spanish. The most distinguished member and founder of this family is The CIEMPIESS Corpus (LDC2015S07) published in 2015 by the Linguistic Data Consortium (LDC). In 2017, the LDC published the CIEMPIESS LIGHT Corpus (LDC2017S23) which is a revisited and augmented version of the original CIEMPIESS. We recommend the combination of the CIEMPIESS COMPLEMENTARY and the CIEMPIESS LIGHT with modern speech recognition engines such like Kaldi, SPHINX or HTK. CIEMPIESS is the acronym for: "Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social". The CIEMPIESS COMPLEMENTARY Corpus was created at the "Universidad Nacional Autónoma de México" (UNAM) between 2016 and 2017 by the Social Service Program: "Desarrollo de Tecnologías del Habla" that depends on the "Facultad de Ingeniería" (FI). It was designed and edited by Carlos Daniel Hernández Mena and recorded by Susana Alejandra Jiménez Sandoval. Mexbet is a phonetic alphabet designed for the Spanish of Central Mexico. It was created in 2004 by the linguist Javier Cuétara at UNAM to do experiments with another corpus in Mexican Spanish called DIMEx100 (see http://turing.iimas.unam.mx/~luis/DIME/CORPUS-DIMEX.html), giving good results. For these reasons, Mexbet is the preferred alphabet for the all CIEMPIESS family. For more information and documentation see the CIEMPIESS-UNAM Project website at: http://www.ciempiess.org/ ------------------------------------------------------------------------------------------------- CORPUS CHARACTERISTICS ------------------------------------------------------------------------------------------------- The CIEMPIESS COMPLEMENTARY Corpus has the following characteristics: - It was recorded with a Sony recorder model ICD-PX312D in a moderate noise environment similar to a medium size library. The recordings were originally recorded in MP3 format with a quality of 44.1 khz, 128 kbps, stereo. - 10 male and 10 female volunteers from Central Mexico and ages from 20 to 49 years contributed with 26 speech files each. - It is 56 minutes long with 520 speech files converted to a 16 kHz , 16-bit, PCM, mono format. - Every speaker reads the digits from zero to nine (1 speech file), the alphabet with some common nick names of certain letters, like "i griega" for the "y" (3 speech files, 11 letters per file) and finally, every speaker reads a list of 66 words (22 speech files, 3 words per file). In general, every speaker reads different words, but some few words are read by two different speakers. - The 22 lists of Spanish words that every speaker reads are designed to ensure that the 66 phonemes and allophones of the T66 level of Mexbet are repeated, at least, one time for each speaker. Note that the 29 phonemes of the T29 level of Mexbet are included in the T66 level. - Speakers in the CIEMPIESS COMPLEMENTARY are not present in any other CIEMPIESS dataset. - The corpus counts with 1357 entries in every pronouncing dictionary. One dictionary shows the pronunciations in Mexbet T29 and the other one shows the pronunciations in Mexbet T66. - Alternative pronunciations in the pronouncing dictionaries are designated with a number in parentheses. - There are two files that show the phoneme frequency of the whole corpus in T29 and T66. - There is a file that shows relevant data from all the speakers like: gender, age, place of birth, etc. - The corpus counts with a transcription file with unique ids and a file with the relative paths to the speech files in the "speech" directory. The file id is designed to provide information about the particular speech file. This id is also friendly with software like the NIST SCLITE Scoring Package, or the Kaldi Speech Recognition Engine. - It is provided full documentation of Mexbet. There are files that show how to perform accurate transcriptions both in T29 and T66. There are also charts that show the place of articulation as well as the manner of articulation of every phoneme and allophone of Mexbet. It is also shown the equivalences between Mexbet, IPA and X-SAMPA alphabets as well as the way to write the IPA symbols for Mexbet in Latex. Note that this whole documentation has been written in English and it has been also revisited and updated for the present edition of the CIEMPIESS COMPLEMENTARY corpus. - It is provided a copy of the "fonetica3 library" that is software tool written in python 2.7 that is based on the transcription rules of Mexbet. With this tool one can perform automatic transcriptions in T29 and T66 as well as: syllabification, syllable count, stressed syllable deduction, accent mark deduction and so on. Both the pronouncing dictionaries and the example transcriptions in the documentation of Mexbet have been made with the help of the "fonetica3 library". ------------------------------------------------------------------------------------------------- LISTS OF WORDS ------------------------------------------------------------------------------------------------- - Digits (1 speech file) Every speaker reads the digits from 0 to 9: 01. cero uno dos tres cuatro cinco seis siete ocho nueve - Alphabet (3 speech files, 11 letters per file): Every speaker reads the same letters and nick names in the same order: 01. a b "be grande" c "che" d e f g h i 02. j k l "doble ele" m n ñ o p q r 03. "doble erre" s t "u be" "be chica" "doble u" "doble be" x "i griega" "ye" z Words (22 speech files, 3 words per file): Every speaker reads different words, but in few times, the same word has been read by two speakers. The words were chosen to be as long as possible. Every set of words contain, at least, one instance of the 66 phonemes and allophones of the T66 level of Mexbet. The T29 level of Mexbet is embedded in the T66 level. An example word list is shown below: 01. yíu kakuy triunfo 02. xiutlaltla bolchevique quetzitecátl 03. tepezcuitle matacaballo confrontación 04. ultrasensorial inyectándomelos estreñimiento 05. inyectárnoslo autocomplaciente detalladamente 06. confusamente esdrujulizaseis transmigración 07. esdrujulizarías barbitúricos esquizogénesis 08. congratulación irreprochable alfabetización 09. atropellamiento incompatibilidad resplandeciente 10. substancialmente cuestionamiento entrecruzamiento 11. socioeconómico geográficamente supercomputadoras 12. epistemológicas restablecimiento intercambiarán 13. verdaderamente independentistas racionalización 14. desertificación acondicionamiento improvisadamente 15. concientización magníficamente desentendimiento 16. latinoamericano importantemente iberoamericanos 17. esterilización consideraciones anteroposterior 18. conscientemente descentralizado trasdosearías 19. revolucionarios selectivamente posteriormente 20. preestablecido psicométricas penitenciaría 21. judeocristiano metodológicas jurídicamente 22. instrumentista inconvenientes hamamelidácea ------------------------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES ------------------------------------------------------------------------------------------------- The CIEMPIESS_COMPLEMENTARY directory contains the following files and directories: - docs : One can find several manuals of how to produce accurate phonetic transcriptions using the Mexbet phonetic alphabet. There is also information about the Mexican Spanish phonology and phonetics. - files : One can find the transcription file, the paths file as well as the pronouncing dictionaries, the phoneme frequency files and the data from all the speakers. - software: Here is a copy of the "fonetica3 library" software tool. - speech: One can find the speech files classified by gender and also by speaker. - README.txt ------------------------------------------------------------------------------------------------- THE CORPUS FILES ------------------------------------------------------------------------------------------------- In the "files" directory one can find the following: - CIEMPIESS_COMPLEMENTARY.transcription : This is the transcription file in plain text format. - CIEMPIESS_COMPLEMENTARY.paths : This file contains the relative paths from the "speech" directory to every particular speech file. - Speaker_info.xls : This file contains relevant information about the speakers like: gender, age, place of birth, etc. - CIEMPIESS_COMPLEMENTARY_T29.dic : This is a pronouncing dictionary of the whole corpus in Mexbet T29. - CIEMPIESS_COMPLEMENTARY_T66.dic : This is a pronouncing dictionary of the whole corpus in Mexbet T66. - CIEMPIESS_COMPLEMENTARY_T29.freq : This file shows the number of T29 phonemes counted in the whole corpus. - CIEMPIESS_COMPLEMENTARY_T66.freq : This file shows the number of T66 phonemes and allophones counted in the whole corpus. - CIEMPIESS_COMPLEMENTARY_T29.phones : This is a list of the 29 phones of the T29 Mexbet level. - CIEMPIESS_COMPLEMENTARY_T66.phones : This is a list of the 66 phones and allophones of the T66 Mexbet level. ------------------------------------------------------------------------------------------------- MEXBET DOCUMENTATION ------------------------------------------------------------------------------------------------- In the "docs" directory one can find several documents that lead users to understand and to adopt the Mexbet phonetic alphabet as well as its transcription rules. The following files contain useful charts that show the manner of articulation and the point of articulation of the Mexbet phonemes and allophones. One chart also shows how to convert the Mexbet symbols to the IPA and X-SAMPA alphabets and how the IPA equivalences in Mexbet can be written in Latex. These charts are: - Mexican_Spanish_Phonology_in_Mexbet_T29.pdf : A chart with the place of articulation and the manner of articulation of the Mexbet phonemes of the T29 level. - Mexbet_T66_Phonetic_Alphabet.pdf : A chart with the place of articulation and the manner of articulation of the Mexbet phonemes and allophones of the T66 level. - Equivalences_between_phonetic_alphabets.pdf : A chart that shows the one-to-one equivalences between the Mexbet, the IPA and the X-SAMPA symbols. It also shows how to write the IPA equivalences of Mexbet in Latex. The following documents show the entire process that is needed to perform phonetic transcriptions in Mexbet. At first, one have to determine which is the stressed vowel of the word one want to transcribe. Then one have to divide the word into syllables. The next step is to use the grapheme-to-phoneme rules of Mexbet to perform a transcription in the T29 level. Finally, one have to use the phonetic rules of Mexbet to perform a transcription in the T66 level. The documents that show how to implement this whole process are: - Rules_for_Spanish_Accent_Marks.pdf : It shows how to determine if a word in Spanish needs or not an accent mark. This document also gives cues of how to identify the stressed syllable if the accent mark is not present. - Syllabification_Rules_in_Spanish.pdf : It shows the rules of how to divide an Spanish word into syllables. - Mexbet_T29_Grapheme-to-Phoneme_Rules.pdf : It shows the grapheme-to-phoneme rules to perform transcriptions in the T29 level of Mexbet. - Mexbet_T66_Phonetic_Rules.pdf : It shows the phonetic rules to perform transcriptions in the T66 level of Mexbet. ------------------------------------------------------------------------------------------------- THE "fonetica3 library" SOFTWARE TOOL ------------------------------------------------------------------------------------------------- The "fonetica3 library" is a software tool written in python 2.7 that is based on the same rules shown in the "docs" directory. It means that users can implement several phonetic tasks like syllabification or phonetic transcription. The "fonetica3 library" counts with a very detailed README file that informs users about all its functionalities. The important thing that is worth to highlight at this section is how to use the "fonetica3 library" to create a pronouncing dictionary both in T29 and in T66 with the stressed vowel well indicated. This task is very important in the ASR field and we recommend to use it to create a pronouncing dictionary for the CIEMPIESS LIGHT Corpus, and for all the future members of the CIEMPIESS family. The three functions analyzed at this section are: - T29() : Performs transcriptions in Mexbet T29 - T66() : Performs transcriptions in Mexbet T66 - vocal_tonica() : Indicates the stressed vowel of the word in parenthesis by using a capital letter. For example: vocal_tonica("camello") produces "camEllo", where "E" is the stressed vowel. One has to know that all the functions in the "fonetica3 library" expect words in lowercase. The following example shows how to use the T29() function in a python code to transcribe the word "canción": 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T29 import T29 05:print(T29("canción")) This code produces the following output: k a n . s i o_7 n The dot indicates the syllabification and the "_7" next to the "o" indicates that the vowel "o" is the stressed vowel. Now the following code shows how to use the T66() function to transcribe the same word: 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T66 import T66 05:print(T66("canción")) This code produces the following output: k a n . s j O_7 n One can also see the syllabification indicated by the dot and the stressed vowel indicated by the "_7" next to the "o". In both cases, it is obvious that the stressed vowel is the "o" because it has an accent mark or tilde (ó). If the word doesn't have an accent mark, one can use the vocal_tonica() function to determine which is the stressed vowel. The following code shows how to use the vocal_tonica() function to determine which is the stressed vowel of the word "camello": 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.vocal_tonica import vocal_tonica 05:print(vocal_tonica("camello")) This code produces the following output: camEllo Note that the stressed vowel is indicated by the capital letter "E". A vowel in uppercase indicates both to the T29() and to the T66() functions that the vowel is stressed. The following code shows what would happen if the stressed vowel is not indicated when using the T29() function: 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T29 import T29 05:print(T29("camello")) It produces: k a . m e . Z o Notice that there is no "_7" indicating the stressed vowel. Now lets see what happens when the stress vowel is properly indicated: 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T29 import T29 05:print(T29("camEllo")) It produces: k a . m e_7 . Z o Eureka! The "_7" is nex to the "e". The three analized functions work with no errors for most of the words in Spanish, but the functions T29() and T66() have an incompatibility with the pronouncing dictionaries used in the ASR field. This incompatibility is the syllabification. One can easily notice that pronouncing dictionaries included in the CIEMPIESS COMPLEMENTARY Corpus does not have dots indicating the syllabification. For example, some few lines of the T66 dictionary are: ababuy a V a V w_7 i( absolutamente a V s o l u_7 t a m e_7 n_[ t e académicamente a k a D e_7 m i k a m e_7 n_[ t e accidentalmente a k s i D e n_[ t a_2_7 l m e_7 n_[ t e acolchonada a k O l_j tS o n a_7 D a The following code shows how to eliminate the syllabification dots of a single transcription in T66 of the word "camello": 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T66 import T66 05:word = "camello" 06:transcription = T66(word) 07:transcription = transcription.replace(" . "," ") 08:print(transcription) It produces: k a m e Z o In this case, the stressed vowel was not indicated. To do so, see the following code that that takes the same word "camello" and uses the vocal_tonica() function to determine the stressed vowel. At the end the T66() function produces a transcription in T66 with stressed vowel well indicated and no syllabification dots. 01:#-*- coding: utf-8 -*- 02:import sys 03:sys.path.append(".") 04:from fonetica3.T66 import T66 05:from fonetica3.vocal_tonica import vocal_tonica 06:word = "camello" 07:stressed = vocal_tonica(word) 08:transcription = T66(stressed) 09:transcription = transcription.replace(" . "," ") 10:print(transcription) k a m e_7 Z o To look for updates of the "fonetica3 library" see: http://www.ciempiess.org/downloads ------------------------------------------------------------------------------------------------- IDENTIFICATION KEY FORMAT ------------------------------------------------------------------------------------------------- Every audio file in the CIEMPIESS COMPLEMENTARY Corpus has an identification key with the following format: CMPC_F_01_W_0001 CMPC F 01 W 0001 Acronym Gender of Number Type of Number of the audio for the Speaker: of recording: file of a particular "CIEMPIESS "M" for Male Speaker "W" for Words type of recording from COMPLEMENTARY" "F" for Female "D" for Digits a particular speaker "A" for Alphabet ------------------------------------------------------------------------------------------------- ACKNOWLEDGEMENTS ------------------------------------------------------------------------------------------------- The authors would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their support to the social service program: "Desarrollo de Tecnologías del Habla." They also thank to the social service students for all the hard work. ------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------