-------------------------------------------------------------------------------------------------
CIEMPIESS BALANCE CORPUS
Audio and Transcripts of Mexican Spanish Broadcast Conversations.
-------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------
PRESENTATION
-------------------------------------------------------------------------------------------------

The CIEMPIESS BALANCE Corpus is designed to match with the CIEMPIESS LIGHT Corpus (LDC2017S23).

CIEMPIESS BALANCE is "balance" because it is designed to balance the CIEMPIESS LIGHT. It means
that if the CIEMPIESS BALANCE is combined with the CIEMPIESS LIGHT, one will get a gender 
balanced corpus. To appreciate this, one need to know that the CIEMPIESS LIGHT is by itself, 
a gender unbalanced corpus of approximately 25% of female speakers and 75% of male speakers. So 
the CIEMPIESS BALANCE is a gender unbalanced corpus with approximately 25% of male speakers and 
75% of female speakers. 

Furthermore, the match between the two datasets is more profound than just the number of the
speakers. In both corpus speakers are numbered as: F_01, M_01, F_02, M_02, etc. So, the relation 
between the speakers is that the speech of F_01 in CIEMPIES LIGHT has an approximate amount 
of time as the speech of M_01 in the CIEMPIESS BALANCE.

The consequence of this speaker-to-speaker match is that the CIEMPIESS BALANCE has a size of 18 
hours and 20 minutes against the 18 hours and 25 minutes of the CIEMPIESS LIGHT. It is a very 
good match between them!

-------------------------------------------------------------------------------------------------
BRIEF HISTORY
-------------------------------------------------------------------------------------------------

CIEMPIESS BALANCE belongs to the "CIEMPIESS" family of corpus for Speech Recognition in Mexican 
Spanish. The most distinguished member and founder of this family is The CIEMPIESS Corpus 
(LDC2015S07) published in 2015 by the Linguistic Data Consortium (LDC).

The CIEMPIESS BALANCE Corpus was created by the social service program "Desarrollo de Tecnologías
del Habla" of the "Facultad de Ingeniería" (FI) in the "Universidad Nacional Autónoma de México" 
(UNAM) between 2016 and 2018 by Carlos Daniel Hernández Mena, head of the program.

CIEMPIESS is the acronym for:

"Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica 
y Servicio Social".

The CIEMPIESS BALANCE is a Radio Corpus designed to create acoustic models for automatic speech 
recognition and it is made up by recordings of spontaneous conversations in Mexican Spanish 
between a radio moderator and his guests. 

Most of the recordings that constitute the CIEMPIESS BALANCE come from "RADIO-IUS" 
(http://www.derecho.unam.mx/cultura-juridica/radio.php) that is a radio station that belongs to 
UNAM.

Other recordings needed to perform an accurate match between specific speakers
were taken from the YouTube channels: 

	- "IUS Canal Multimedia" 
	https://www.youtube.com/user/DEDUNAM/videos

	- "Centro Universitario de Estudios Jurídicos (CUEJ UNAM)" 
	https://www.youtube.com/channel/UCTxkzdUd0tiXT5BN5o6Xo-A/videos

For more information and documentation see the CIEMPIESS-UNAM Project website at:

		             http://www.ciempiess.org/


-------------------------------------------------------------------------------------------------
CORPUS CHARACTERISTICS
-------------------------------------------------------------------------------------------------

The CIEMPIESS BALANCE Corpus has the following characteristics:

- The organization of the directories in the CIEMPIESS BALANCE (CB) is the same as the 
  CIEMPIESS LIGHT (CL) Corpus.

- The CL is slightly bigger (18 hours / 25 minutes) than the CB (18 hours / 20 minutes).

- The CB is made up by 8555 audio files with transcripts. 2447 of those files (28.6%) 
  come from male speakers and 6108 files (71.39%) come from female speakers.

- The CB contains 34 male speakers and 53 female speakers. The total amount of time of 
  all the female recordings sum together is 12 hours and 40 minutes while the recordings 
  from male speakers sum 5 hours and 40 minutes.
  
- Every audio file in the CB has a duration between 5 and 10 seconds approximately.

- Speakers in the CB and the CL are different persons. In fact, speakers in the CB are not 
  present in any other CIEMPIESS dataset.

- Data in CB is classified by gender and speaker, so one can easily select audios from
  a particular set of speakers to do experiments.

- Audio files in the CL and the CB are all of the same type. In both, speakers talk about
  legal and lawyer issues. They also talk about things related to the UNAM University and
  the "Facultad de Derecho de la UNAM".

- Audio files in the CB are distributed in a 16khz@16bit mono format.

-------------------------------------------------------------------------------------------------
GENERAL ORGANIZATION OF THE DIRECTORIES
-------------------------------------------------------------------------------------------------

The CIEMPIESS_LIGTH directory contains the following files and directories:

	- docs   : Here one can find the transcription file (CIEMPIESS_BALANCE.transcription), 
                   the audio paths file (CIEMPIESS_BALANCE.paths), the file "audiosxspk.txt"
                   that shows the number of audio files per speaker and the file 
                   "CMPL_and_CMPB_Match.xls" that shows the speaker-to-speaker match between
                   the CIEMPIESS LIGHT and the CIEMPIESS BALANCE.

	- speech : Here one can find the audio files in a 16kHz@16bit mono format. These
                   files are organized by gender and also by speaker.

	- README.txt

-------------------------------------------------------------------------------------------------
IDENTIFICATION KEY FORMAT
-------------------------------------------------------------------------------------------------

Every audio file in the CIEMPIESS BALANCE Corpus has an identification key with the following 
format:

                                 CMPB_F_51_01CAR_00012

	CMPB            F            51            01CAR                00012
      Acronym      Gender of        Number     An internal           Number of the
      for          the Speaker:     of         key that indicate     audio file of
      "CIEMPIESS   "M" for Male     Speaker.   us who did the        a particular
      BALANCE".    "F" for Female              transcription of      speaker.
                                               the current audio
                                               file.

-------------------------------------------------------------------------------------------------
ACKNOWLEDGEMENTS
-------------------------------------------------------------------------------------------------

The authors would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their 
support to the social service program: "Desarrollo de Tecnologías del Habla." They also thank 
to the social service students for all the hard work.

Special thanks to Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the 
"Facultad de Derecho de la UNAM" for donating most of the recordings that constitute the
CIEMPIESS BALANCE Corpus.

-------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------