-------------------------------------------------------------------------------------------------
CIEMPIESS Experimentation Package
Corpus and Tools to perform Speech Recognition Experiments in Mexican Spanish
-------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------
PRESENTATION
-------------------------------------------------------------------------------------------------

The CIEMPIESS Experimentation Package is a set of three different corpus designed to solve 
specific problems when experimenting with speech recognition systems:

CIEMPIESS COMPLEMENTARY. This is a phonetically balanced corpus of isolated Spanish words 
spoken by people of Central Mexico. It was designed to solve the lack of instances of
any phoneme in Mexican Spanish. The CIEMPIESS COMPLEMENTARY provides documentation (written in 
english) for learning how to produce accurate phonetic transciptions in Mexican Spanish and it 
also provides an automatic phonetizer coded in python 2.7 to create pronouncing dictionaries.

CIEMPIESS FEM. This is a corpus created by recordings of 21 different women. The motivation of 
the CIEMPIESS FEM is that we have noticed a lack of female speakers in the sources where we 
traditionally take audio to create new CIEMPIESS datasets. So, this corpus was designed to 
balance (in gender) up to 14 hours of male speaker recordings.

CIEMPIESS TEST. This corpus was created in response to the necessity of having an standard test 
set destined to measure the advances of the community of users of the CIEMPIESS datasets.

-------------------------------------------------------------------------------------------------
BRIEF HISTORY
-------------------------------------------------------------------------------------------------

The CIEMPIESS Experimentation Package belongs to the "CIEMPIESS" family of corpus for Speech 
Recognition in Mexican Spanish. The most distinguished member and founder of this family is 
The CIEMPIESS Corpus (LDC2015S07) published in 2015 by the Linguistic Data Consortium (LDC). In 
2017, the LDC published the CIEMPIESS LIGHT Corpus (LDC2017S23) which is a revisited and 
augmented version of the original CIEMPIESS. We recommend the combination of the CIEMPIESS LIGHT, 
the CIEMPIESS BALANCE and the CIEMPIESS Experimentation Package to perform experiments with 
modern speech recognition engines such like Kaldi, SPHINX or HTK.

The CIEMPIESS Experimentation Package was created by the social service program "Desarrollo de 
Tecnologías del Habla" of the "Facultad de Ingeniería" (FI) in the "Universidad Nacional 
Autónoma de México" (UNAM) between 2016 and 2018 by Carlos Daniel Hernández Mena, head of the 
program.

CIEMPIESS is the acronym for:

"Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica 
y Servicio Social".

Most of the recordings that constitute the CIEMPIESS TEST and the CIEMPIESS FEM datasets, 
included in the CIEMPIESS Experimentation Package come from "RADIO-IUS" 
(http://www.derecho.unam.mx/cultura-juridica/radio.php) that is a radio station that belongs 
to UNAM and they were donated by Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas 
Arevalo from the "Facultad de Derecho de la UNAM".

Other recordings were taken from the YouTube channels: 

	- "IUS Canal Multimedia" 
	https://www.youtube.com/user/DEDUNAM/videos

	- "Centro Universitario de Estudios Jurídicos (CUEJ UNAM)" 
	https://www.youtube.com/channel/UCTxkzdUd0tiXT5BN5o6Xo-A/videos

The CIEMPIESS COMPLEMENTARY Corpus, included in the CIEMPIESS Experimentation Package is 
constituted by read speech. It was designed and edited by Carlos Daniel Hernández Mena and 
recorded by Susana Alejandra Jiménez Sandoval.

For more information and documentation see the CIEMPIESS-UNAM Project website at:

		             http://www.ciempiess.org/

-------------------------------------------------------------------------------------------------
MORE DETAILS ABOUT THE DATASETS
-------------------------------------------------------------------------------------------------

In this setion it is provided a more detailed explanation of the datasets included in the 
CIEMPIESS Experimentation Package. For full documentation see the "README" file that comes 
with every dataset.

- CIEMPIESS COMPLEMENTARY (1 hour)

The CIEMPIESS COMPLEMENTARY is a phonetically balanced corpus of isolated Spanish words spoken 
by people of Central Mexico. It was designed to solve one particular issue when training 
automatic speech recognition (ASR) systems in the Spanish of Central Mexico. This problem 
appears when someone collects some training data, but then the system complains because it 
does not find enough instances of one or more particular phoneme.

The CIEMPIESS COMPLEMENTARY Corpus was created with the voices of 10 male and 10 female
volunteers reading isolated words. The words were chosen to assure users to get, at least,
twenty instances of every single phoneme and allophone of the Mexican phonetic alphabet
called Mexbet.

Mexbet is a phonetic alphabet created for the Spanish of Central Mexico. It has several levels 
of granularity but the two levels we work with are: the T29 or phonological level with 29 symbols, 
and the T66 or phonetic level with 66 symbols. In this edition of the CIEMPIESS COMPLEMENTARY 
we provide documentation (written in english) for learning how to produce accurate phonetic 
transciptions using Mexbet, we provide an automatic phonetizer coded in python 2.7, we provide 
pronouncing dictionaries in T29 and T66 for all the words in the corpus and we also show the 
one-to-one equivalence between Mexbet, the International Phonetic Alphabet (IPA) and the X-SAMPA 
alphabet.

In conclusion, the CIEMPIESS COMPLEMENTARY Corpus is "COMPLEMENTARY" because it "complements" 
datasets when training ASR systems in the Spanish of Central Mexico.

- CIEMPIESS FEM (14 hours)

Since the publication of the CIEMPIESS Corpus (LDC2015S07) in 2015 we have noticed that there
is a lack of female speakers in the sources where we traditionally take audio to create new 
CIEMPIESS datasets. That is why we decided to create a corpus that helps to balance future
gender unbalanced datasets.

The CIEMPIESS FEM Corpus was created by recordings and human transcripts of 21 different 
women. 16 of these women are mexican. The other ones come from Latin American countries.

The CIEMPIESS FEM Corpus is considered a CIEMPIESS dataset because it only contains audio
from the same source of the first CIEMPIESS Corpus and it is "FEM", obviously because it only 
contains recordings of female speakers.

- CIEMPIESS TEST (8 hours)

When developing automatic speech recognition engines and any other machine learning system is
a good practice to separate the test from the training data and never combined them. So, the 
CIEMPIESS TEST Corpus was created by this necessity of having an standard test set destined 
to measure the advances of the community of users of the CIEMPIESS datasets and we strongly 
recommend not to use the CIEMPIESS TEST for any other purpose.

The CIEMPIESS TEST Corpus is a gender balanced corpus designed to test acoustic models for the
speech recognition task. It was created by recordings and human transcripts of 10 male and 10
female speakers.

The CIEMPIESS TEST Corpus is considered a CIEMPIESS dataset because it only contains audio
from the same source of the first CIEMPIESS Corpus and it has the word "TEST" in its name, 
obviously because it is recommended for test purposes only.

-------------------------------------------------------------------------------------------------
MEXBET DOCUMENTATION
-------------------------------------------------------------------------------------------------

Mexbet is a phonetic alphabet created for the Spanish of Central Mexico. It has several levels 
of granularity but the two levels we work with are: the T29 or phonological level with 29 
symbols, and the T66 or phonetic level with 66 symbols. In this edition of the CIEMPIESS 
Experimentation Package we provide documentation (written in english) for learning how to 
produce accurate phonetic transciptions using Mexbet, we provide an automatic phonetizer coded 
in python 2.7, we provide pronouncing dictionaries in T29 and T66 for all the words in the 
CIEMPIESS COMPLEMENTARY corpus and we also show the one-to-one equivalence between Mexbet, the 
International Phonetic Alphabet (IPA) and the X-SAMPA alphabet.

In the "docs" directory one can find several documents that lead users to understand and to 
adopt the Mexbet phonetic alphabet as well as its transcription rules.

The following files contain useful charts that show the manner of articulation and the point
of articulation of the Mexbet phonemes and allophones. One chart also shows how to convert
the Mexbet symbols to the IPA and X-SAMPA alphabets and how the IPA equivalences in Mexbet
can be written in Latex. These charts are:

- Mexican_Spanish_Phonology_in_Mexbet_T29.pdf	: A chart with the place of articulation and
						  the manner of articulation of the Mexbet
						  phonemes of the T29 level.

- Mexbet_T66_Phonetic_Alphabet.pdf		: A chart with the place of articulation and
						  the manner of articulation of the Mexbet
						  phonemes and allophones of the T66 level.

- Equivalences_between_phonetic_alphabets.pdf	: A chart that shows the one-to-one equivalences
						  between the Mexbet, the IPA and the X-SAMPA
						  symbols. It also shows how to write the IPA
						  equivalences of Mexbet in Latex.

The following documents show the entire process that is needed to perform phonetic transcriptions
in Mexbet. At first, one have to determine which is the stressed vowel of the word one want
to transcribe. Then one have to divide the word into syllables. The next step is to use the
grapheme-to-phoneme rules of Mexbet to perform a transcription in the T29 level. Finally, one
have to use the phonetic rules of Mexbet to perform a transcription in the T66 level. The 
documents that show how to implement this whole process are:

- Rules_for_Spanish_Accent_Marks.pdf		: It shows how to determine if a word in Spanish
						  needs or not an accent mark. This document
						  also gives cues of how to identify the stressed
						  syllable if the accent mark is not present.

- Syllabification_Rules_in_Spanish.pdf		: It shows the rules of how to divide an Spanish
						  word into syllables.

- Mexbet_T29_Grapheme-to-Phoneme_Rules.pdf	: It shows the grapheme-to-phoneme rules to 
						  perform transcriptions in the T29 level of
						  Mexbet.

- Mexbet_T66_Phonetic_Rules.pdf			: It shows the phonetic rules to perform 
						  transcriptions in the T66 level of Mexbet.

-------------------------------------------------------------------------------------------------
THE "fonetica3 library" SOFTWARE TOOL
-------------------------------------------------------------------------------------------------

The "fonetica3 library" is a software tool written in python 2.7 that is based on the same 
rules shown in the "docs" directory. It means that users can implement several phonetic
tasks like syllabification or phonetic transcription. The "fonetica3 library" counts with
a very detailed README file that informs users about all its functionalities. 

The important thing that is worth to highlight at this section is how to use the "fonetica3 
library" to create a pronouncing dictionary both in T29 and in T66 with the stressed vowel
well indicated. This task is very important in the ASR field and we recommend to use it to 
create a pronouncing dictionary for the CIEMPIESS LIGHT Corpus, and for all the future 
members of the CIEMPIESS family.

The three functions analyzed at this section are: 


- T29()		 : Performs transcriptions in Mexbet T29 
- T66()		 : Performs transcriptions in Mexbet T66
- vocal_tonica() : Indicates the stressed vowel of the word in parenthesis by using a capital
		   letter. For example: vocal_tonica("camello") produces "camEllo", where
		   "E" is the stressed vowel.

One has to know that all the functions in the "fonetica3 library" expect words in lowercase.

The following example shows how to use the T29() function in a python code to transcribe
the word "canción":

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T29 import T29
	05:print(T29("canción"))

This code produces the following output:

	k a n . s i o_7 n

The dot indicates the syllabification and the "_7" next to the "o" indicates that the 
vowel "o" is the stressed vowel. Now the following code shows how to use the T66()
function to transcribe the same word:

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T66 import T66
	05:print(T66("canción"))

This code produces the following output:

	k a n . s j O_7 n

One can also see the syllabification indicated by the dot and the stressed vowel 
indicated by the "_7" next to the "o".

In both cases, it is obvious that the stressed vowel is the "o" because it has 
an accent mark or tilde (ó). If the word doesn't have an accent mark, one can
use the vocal_tonica() function to determine which is the stressed vowel.

The following code shows how to use the vocal_tonica() function to determine
which is the stressed vowel of the word "camello":

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.vocal_tonica import vocal_tonica
	05:print(vocal_tonica("camello"))

This code produces the following output:

	camEllo

Note that the stressed vowel is indicated by the capital letter "E".

A vowel in uppercase indicates both to the T29() and to the T66() functions
that the vowel is stressed. The following code shows what would happen
if the stressed vowel is not indicated when using the T29() function:


	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T29 import T29
	05:print(T29("camello"))

It produces:

	k a . m e . Z o

Notice that there is no "_7" indicating the stressed vowel. Now lets see
what happens when the stress vowel is properly indicated:

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T29 import T29
	05:print(T29("camEllo"))

It produces:

	k a . m e_7 . Z o

Eureka! The "_7" is nex to the "e".

The three analized functions work with no errors for most of the words in Spanish,
but the functions T29() and T66() have an incompatibility with the pronouncing
dictionaries used in the ASR field. This incompatibility is the syllabification. One
can easily notice that pronouncing dictionaries included in the CIEMPIESS COMPLEMENTARY
Corpus does not have dots indicating the syllabification. For example, some few
lines of the T66 dictionary are:

	ababuy a V a V w_7 i(
	absolutamente a V s o l u_7 t a m e_7 n_[ t e
	académicamente a k a D e_7 m i k a m e_7 n_[ t e
	accidentalmente a k s i D e n_[ t a_2_7 l m e_7 n_[ t e
	acolchonada a k O l_j tS o n a_7 D a

The following code shows how to eliminate the syllabification dots of a single 
transcription in T66 of the word "camello":

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T66 import T66
	05:word = "camello"
	06:transcription = T66(word)
	07:transcription = transcription.replace(" . "," ")
	08:print(transcription)

It produces:

	k a m e Z o

In this case, the stressed vowel was not indicated. To do so, see the following
code that that takes the same word "camello" and uses the vocal_tonica() function
to determine the stressed vowel. At the end the T66() function produces a 
transcription in T66 with stressed vowel well indicated and no syllabification dots.

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T66 import T66
	05:from fonetica3.vocal_tonica import vocal_tonica
	06:word = "camello"
	07:stressed = vocal_tonica(word)
	08:transcription = T66(stressed)
	09:transcription = transcription.replace(" . "," ")
	10:print(transcription)

	k a m e_7 Z o

To look for updates of the "fonetica3 library" see:

                              http://www.ciempiess.org/downloads

-------------------------------------------------------------------------------------------------
GENERAL ORGANIZATION OF THE DIRECTORIES
-------------------------------------------------------------------------------------------------

The CIEMPIESS Experimentation Package directory contains the following files and directories:

	- data:  	Contains each of the three corpora that make up CIEMPIESS Experimentation Package.
	- docs	: 	One can find several manuals of how to produce accurate phonetic
		  	transcriptions using the Mexbet phonetic alphabet. There is also 
			information about the Mexican Spanish phonology and phonetics.

	- tools:	Here is a copy of the "fonetica3 library" software tool.


-------------------------------------------------------------------------------------------------
ACKNOWLEDGEMENTS
-------------------------------------------------------------------------------------------------

The authors would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their 
support to the social service program: "Desarrollo de Tecnologías del Habla." They also thank 
to the social service students for all the hard work.

Thanks to Susana Alejandra Jiménez Sandoval from the "Facultad de Filosofía y Letras de la UNAM" 
for recording the utterances of the CIEMPIESS COMPLEMENTARY Corpus.

Special thanks to Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the 
"Facultad de Derecho de la UNAM" for donating most of the recordings that constitute the
CIEMPIESS TEST and the CIEMPIESS FEM datasets.

-------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------