-------------------------------------------------------------------------------------------------
CIEMPIESS COMPLEMENTARY CORPUS
Audio and Transcripts of Spanish Isolated Words.
-------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------
PRESENTATION
-------------------------------------------------------------------------------------------------

The CIEMPIESS COMPLEMENTARY is a phonetically balanced corpus of isolated Spanish words spoken 
by people of Central Mexico. It was designed to solve one particular issue when training 
automatic speech recognition (ASR) systems in the Spanish of Central Mexico. This problem 
appears when someone collects some training data, but then the system complains because it 
does not find enough instances of one or more particular phoneme.

The CIEMPIESS COMPLEMENTARY Corpus was created with the voices of 10 male and 10 female
volunteers reading isolated words. The words were chosen to assure users to get, at least,
twenty instances of every single phoneme and allophone of the Mexican phonetic alphabet
called Mexbet.

Mexbet is a phonetic alphabet created for the Spanish of Central Mexico. It has several levels 
of granularity but the two levels we work with are: the T29 or phonological level with 29 symbols, 
and the T66 or phonetic level with 66 symbols. In this edition of the CIEMPIESS COMPLEMENTARY 
we provide documentation (written in english) for learning how to produce accurate phonetic 
transciptions using Mexbet, we provide an automatic phonetizer coded in python 2.7, we provide 
pronouncing dictionaries in T29 and T66 for all the words in the corpus and we also show the 
one-to-one equivalence between Mexbet, the International Phonetic Alphabet (IPA) and the X-SAMPA 
alphabet.

In conclusion, the CIEMPIESS COMPLEMENTARY Corpus is "COMPLEMENTARY" because it "complements" 
datasets when training ASR systems in the Spanish of Central Mexico.

-------------------------------------------------------------------------------------------------
BRIEF HISTORY
-------------------------------------------------------------------------------------------------

CIEMPIESS COMPLEMENTARY belongs to the "CIEMPIESS" family of corpus for Speech Recognition
in Mexican Spanish. The most distinguished member and founder of this family is The CIEMPIESS 
Corpus (LDC2015S07) published in 2015 by the Linguistic Data Consortium (LDC). In 2017,
the LDC published the CIEMPIESS LIGHT Corpus (LDC2017S23) which is a revisited and augmented
version of the original CIEMPIESS. We recommend the combination of the CIEMPIESS COMPLEMENTARY 
and the CIEMPIESS LIGHT with modern speech recognition engines such like Kaldi, SPHINX or HTK.

CIEMPIESS is the acronym for:

"Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica 
y Servicio Social".

The CIEMPIESS COMPLEMENTARY Corpus was created at the "Universidad Nacional Autónoma de México" 
(UNAM) between 2016 and 2017 by the Social Service Program: "Desarrollo de Tecnologías del 
Habla" that depends on the "Facultad de Ingeniería" (FI). It was designed and edited by 
Carlos Daniel Hernández Mena and recorded by Susana Alejandra Jiménez Sandoval.

Mexbet is a phonetic alphabet designed for the Spanish of Central Mexico. It was created in 2004 
by the linguist Javier Cuétara at UNAM to do experiments with another corpus in Mexican Spanish
called DIMEx100 (see http://turing.iimas.unam.mx/~luis/DIME/CORPUS-DIMEX.html), giving good 
results. For these reasons, Mexbet is the preferred alphabet for the all CIEMPIESS family.

For more information and documentation see the CIEMPIESS-UNAM Project website at:

		             http://www.ciempiess.org/

-------------------------------------------------------------------------------------------------
CORPUS CHARACTERISTICS
-------------------------------------------------------------------------------------------------

The CIEMPIESS COMPLEMENTARY Corpus has the following characteristics:

- It was recorded with a Sony recorder model ICD-PX312D in a moderate noise environment similar 
  to a medium size library. The recordings were originally recorded in MP3 format with a quality
  of 44.1 khz, 128 kbps, stereo.

- 10 male and 10 female volunteers from Central Mexico and ages from 20 to 49 years contributed 
  with 26 speech files each.

- It is 56 minutes long with 520 speech files converted to a 16 kHz , 16-bit, PCM, mono format.

- Every speaker reads the digits from zero to nine (1 speech file), the alphabet with some
  common nick names of certain letters, like "i griega" for the "y" (3 speech files, 11 letters 
  per file) and finally, every speaker reads a list of 66 words (22 speech files, 3 words per 
  file). In general, every speaker reads different words, but some few words are read by two 
  different speakers.

- The 22 lists of Spanish words that every speaker reads are designed to ensure that the
  66 phonemes and allophones of the T66 level of Mexbet are repeated, at least, one time
  for each speaker. Note that the 29 phonemes of the T29 level of Mexbet are included in 
  the T66 level.

- Speakers in the CIEMPIESS COMPLEMENTARY are not present in any other CIEMPIESS dataset.

- The corpus counts with 1357 entries in every pronouncing dictionary. One dictionary 
  shows the pronunciations in Mexbet T29 and the other one shows the pronunciations in 
  Mexbet T66.

- Alternative pronunciations in the pronouncing dictionaries are designated with a number 
  in parentheses.

- There are two files that show the phoneme frequency of the whole corpus in T29 and T66.

- There is a file that shows relevant data from all the speakers like: gender, age, place
  of birth, etc.

- The corpus counts with a transcription file with unique ids and a file with the 
  relative paths to the speech files in the "speech" directory. The file id is designed 
  to provide information about the particular speech file. This id is also friendly with 
  software like the NIST SCLITE Scoring Package, or the Kaldi Speech Recognition Engine.

- It is provided full documentation of Mexbet. There are files that show how to perform
  accurate transcriptions both in T29 and T66. There are also charts that show the place 
  of articulation as well as the manner of articulation of every phoneme and allophone
  of Mexbet. It is also shown the equivalences between Mexbet, IPA and X-SAMPA alphabets
  as well as the way to write the IPA symbols for Mexbet in Latex. Note that this whole
  documentation has been written in English and it has been also revisited and updated for 
  the present edition of the CIEMPIESS COMPLEMENTARY corpus.

- It is provided a copy of the "fonetica3 library" that is software tool written in python 2.7
  that is based on the transcription rules of Mexbet. With this tool one can perform automatic
  transcriptions in T29 and T66 as well as: syllabification, syllable count, stressed syllable
  deduction, accent mark deduction and so on. Both the pronouncing dictionaries and the example
  transcriptions in the documentation of Mexbet have been made with the help of the "fonetica3 
  library".

-------------------------------------------------------------------------------------------------
LISTS OF WORDS
-------------------------------------------------------------------------------------------------

- Digits (1 speech file)

Every speaker reads the digits from 0 to 9:

	01. cero uno dos tres cuatro cinco seis siete ocho nueve

- Alphabet (3 speech files, 11 letters per file):

Every speaker reads the same letters and nick names in the same order:

	01. a b "be grande" c "che" d e f g h i
	02. j k l "doble ele" m n ñ o p q r
	03. "doble erre" s t "u be" "be chica" "doble u" "doble be" x "i griega" "ye" z

Words (22 speech files, 3 words per file):

Every speaker reads different words, but in few times, the same word has been read by two 
speakers. The words were chosen to be as long as possible. Every set of words contain, at 
least, one instance of the 66 phonemes and allophones of the T66 level of Mexbet. The 
T29 level of Mexbet is embedded in the T66 level. An example word list is shown below:

	01. yíu kakuy triunfo
	02. xiutlaltla bolchevique quetzitecátl
	03. tepezcuitle matacaballo confrontación
	04. ultrasensorial inyectándomelos estreñimiento
	05. inyectárnoslo autocomplaciente detalladamente
	06. confusamente esdrujulizaseis transmigración
	07. esdrujulizarías barbitúricos esquizogénesis
	08. congratulación irreprochable alfabetización
	09. atropellamiento incompatibilidad resplandeciente
	10. substancialmente cuestionamiento entrecruzamiento
	11. socioeconómico geográficamente supercomputadoras
	12. epistemológicas restablecimiento intercambiarán
	13. verdaderamente independentistas racionalización
	14. desertificación acondicionamiento improvisadamente
	15. concientización magníficamente desentendimiento
	16. latinoamericano importantemente iberoamericanos
	17. esterilización consideraciones anteroposterior
	18. conscientemente descentralizado trasdosearías
	19. revolucionarios selectivamente posteriormente
	20. preestablecido psicométricas penitenciaría
	21. judeocristiano metodológicas jurídicamente
	22. instrumentista inconvenientes hamamelidácea

-------------------------------------------------------------------------------------------------
GENERAL ORGANIZATION OF THE DIRECTORIES
-------------------------------------------------------------------------------------------------

The CIEMPIESS_COMPLEMENTARY directory contains the following files and directories:

	- docs	: 	One can find several manuals of how to produce accurate phonetic
		  	transcriptions using the Mexbet phonetic alphabet. There is also 
			information about the Mexican Spanish phonology and phonetics.

	- files	: 	One can find the transcription file, the paths file as well as the 
			pronouncing dictionaries, the phoneme frequency files and the data 
			from all the speakers.

	- software:	Here is a copy of the "fonetica3 library" software tool.

	- speech:	One can find the speech files classified by gender and also by speaker.

	- README.txt

-------------------------------------------------------------------------------------------------
THE CORPUS FILES
-------------------------------------------------------------------------------------------------

In the "files" directory one can find the following:

- CIEMPIESS_COMPLEMENTARY.transcription	: This is the transcription file in plain text format.

- CIEMPIESS_COMPLEMENTARY.paths		: This file contains the relative paths from the
					  "speech" directory to every particular speech file.

- Speaker_info.xls			: This file contains relevant information about the 
					  speakers like: gender, age, place of birth, etc.

- CIEMPIESS_COMPLEMENTARY_T29.dic	: This is a pronouncing dictionary of the whole
					  corpus in Mexbet T29.

- CIEMPIESS_COMPLEMENTARY_T66.dic 	: This is a pronouncing dictionary of the whole
					  corpus in Mexbet T66.

- CIEMPIESS_COMPLEMENTARY_T29.freq	: This file shows the number of T29 phonemes counted
					  in the whole corpus.

- CIEMPIESS_COMPLEMENTARY_T66.freq	: This file shows the number of T66 phonemes and 
					  allophones counted in the whole corpus.

- CIEMPIESS_COMPLEMENTARY_T29.phones	: This is a list of the 29 phones of the T29 Mexbet 
					  level.

- CIEMPIESS_COMPLEMENTARY_T66.phones	: This is a list of the 66 phones and allophones of 
					  the T66 Mexbet level.

-------------------------------------------------------------------------------------------------
MEXBET DOCUMENTATION
-------------------------------------------------------------------------------------------------

In the "docs" directory one can find several documents that lead users to understand and to 
adopt the Mexbet phonetic alphabet as well as its transcription rules.

The following files contain useful charts that show the manner of articulation and the point
of articulation of the Mexbet phonemes and allophones. One chart also shows how to convert
the Mexbet symbols to the IPA and X-SAMPA alphabets and how the IPA equivalences in Mexbet
can be written in Latex. These charts are:

- Mexican_Spanish_Phonology_in_Mexbet_T29.pdf	: A chart with the place of articulation and
						  the manner of articulation of the Mexbet
						  phonemes of the T29 level.

- Mexbet_T66_Phonetic_Alphabet.pdf		: A chart with the place of articulation and
						  the manner of articulation of the Mexbet
						  phonemes and allophones of the T66 level.

- Equivalences_between_phonetic_alphabets.pdf	: A chart that shows the one-to-one equivalences
						  between the Mexbet, the IPA and the X-SAMPA
						  symbols. It also shows how to write the IPA
						  equivalences of Mexbet in Latex.

The following documents show the entire process that is needed to perform phonetic transcriptions
in Mexbet. At first, one have to determine which is the stressed vowel of the word one want
to transcribe. Then one have to divide the word into syllables. The next step is to use the
grapheme-to-phoneme rules of Mexbet to perform a transcription in the T29 level. Finally, one
have to use the phonetic rules of Mexbet to perform a transcription in the T66 level. The 
documents that show how to implement this whole process are:

- Rules_for_Spanish_Accent_Marks.pdf		: It shows how to determine if a word in Spanish
						  needs or not an accent mark. This document
						  also gives cues of how to identify the stressed
						  syllable if the accent mark is not present.

- Syllabification_Rules_in_Spanish.pdf		: It shows the rules of how to divide an Spanish
						  word into syllables.

- Mexbet_T29_Grapheme-to-Phoneme_Rules.pdf	: It shows the grapheme-to-phoneme rules to 
						  perform transcriptions in the T29 level of
						  Mexbet.

- Mexbet_T66_Phonetic_Rules.pdf			: It shows the phonetic rules to perform 
						  transcriptions in the T66 level of Mexbet.

-------------------------------------------------------------------------------------------------
THE "fonetica3 library" SOFTWARE TOOL
-------------------------------------------------------------------------------------------------

The "fonetica3 library" is a software tool written in python 2.7 that is based on the same 
rules shown in the "docs" directory. It means that users can implement several phonetic
tasks like syllabification or phonetic transcription. The "fonetica3 library" counts with
a very detailed README file that informs users about all its functionalities. 

The important thing that is worth to highlight at this section is how to use the "fonetica3 
library" to create a pronouncing dictionary both in T29 and in T66 with the stressed vowel
well indicated. This task is very important in the ASR field and we recommend to use it to 
create a pronouncing dictionary for the CIEMPIESS LIGHT Corpus, and for all the future 
members of the CIEMPIESS family.

The three functions analyzed at this section are: 


- T29()		 : Performs transcriptions in Mexbet T29 
- T66()		 : Performs transcriptions in Mexbet T66
- vocal_tonica() : Indicates the stressed vowel of the word in parenthesis by using a capital
		   letter. For example: vocal_tonica("camello") produces "camEllo", where
		   "E" is the stressed vowel.

One has to know that all the functions in the "fonetica3 library" expect words in lowercase.

The following example shows how to use the T29() function in a python code to transcribe
the word "canción":

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T29 import T29
	05:print(T29("canción"))

This code produces the following output:

	k a n . s i o_7 n

The dot indicates the syllabification and the "_7" next to the "o" indicates that the 
vowel "o" is the stressed vowel. Now the following code shows how to use the T66()
function to transcribe the same word:

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T66 import T66
	05:print(T66("canción"))

This code produces the following output:

	k a n . s j O_7 n

One can also see the syllabification indicated by the dot and the stressed vowel 
indicated by the "_7" next to the "o".

In both cases, it is obvious that the stressed vowel is the "o" because it has 
an accent mark or tilde (ó). If the word doesn't have an accent mark, one can
use the vocal_tonica() function to determine which is the stressed vowel.

The following code shows how to use the vocal_tonica() function to determine
which is the stressed vowel of the word "camello":

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.vocal_tonica import vocal_tonica
	05:print(vocal_tonica("camello"))

This code produces the following output:

	camEllo

Note that the stressed vowel is indicated by the capital letter "E".

A vowel in uppercase indicates both to the T29() and to the T66() functions
that the vowel is stressed. The following code shows what would happen
if the stressed vowel is not indicated when using the T29() function:


	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T29 import T29
	05:print(T29("camello"))

It produces:

	k a . m e . Z o

Notice that there is no "_7" indicating the stressed vowel. Now lets see
what happens when the stress vowel is properly indicated:

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T29 import T29
	05:print(T29("camEllo"))

It produces:

	k a . m e_7 . Z o

Eureka! The "_7" is nex to the "e".

The three analized functions work with no errors for most of the words in Spanish,
but the functions T29() and T66() have an incompatibility with the pronouncing
dictionaries used in the ASR field. This incompatibility is the syllabification. One
can easily notice that pronouncing dictionaries included in the CIEMPIESS COMPLEMENTARY
Corpus does not have dots indicating the syllabification. For example, some few
lines of the T66 dictionary are:

	ababuy a V a V w_7 i(
	absolutamente a V s o l u_7 t a m e_7 n_[ t e
	académicamente a k a D e_7 m i k a m e_7 n_[ t e
	accidentalmente a k s i D e n_[ t a_2_7 l m e_7 n_[ t e
	acolchonada a k O l_j tS o n a_7 D a

The following code shows how to eliminate the syllabification dots of a single 
transcription in T66 of the word "camello":

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T66 import T66
	05:word = "camello"
	06:transcription = T66(word)
	07:transcription = transcription.replace(" . "," ")
	08:print(transcription)

It produces:

	k a m e Z o

In this case, the stressed vowel was not indicated. To do so, see the following
code that that takes the same word "camello" and uses the vocal_tonica() function
to determine the stressed vowel. At the end the T66() function produces a 
transcription in T66 with stressed vowel well indicated and no syllabification dots.

	01:#-*- coding: utf-8 -*-
	02:import sys
	03:sys.path.append(".")
	04:from fonetica3.T66 import T66
	05:from fonetica3.vocal_tonica import vocal_tonica
	06:word = "camello"
	07:stressed = vocal_tonica(word)
	08:transcription = T66(stressed)
	09:transcription = transcription.replace(" . "," ")
	10:print(transcription)

	k a m e_7 Z o

To look for updates of the "fonetica3 library" see:

                              http://www.ciempiess.org/downloads

-------------------------------------------------------------------------------------------------
IDENTIFICATION KEY FORMAT
-------------------------------------------------------------------------------------------------

Every audio file in the CIEMPIESS COMPLEMENTARY Corpus has an identification key with the 
following format:

                                 CMPC_F_01_W_0001

	CMPC               F            01              W                   0001
      Acronym          Gender of      Number        Type of          Number of the audio
        for           the Speaker:      of         recording:        file of a particular
     "CIEMPIESS       "M" for Male    Speaker    "W" for Words       type of recording from
    COMPLEMENTARY"    "F" for Female             "D" for Digits      a particular speaker
                                                 "A" for Alphabet

-------------------------------------------------------------------------------------------------
ACKNOWLEDGEMENTS
-------------------------------------------------------------------------------------------------

The authors would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their 
support to the social service program: "Desarrollo de Tecnologías del Habla." They also thank 
to the social service students for all the hard work.

-------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------