-------------------------------------------------------------------------------------------------
CIEMPIESS TEST CORPUS
Audio and Transcripts of Mexican Spanish Broadcast Conversations.
-------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------
PRESENTATION
-------------------------------------------------------------------------------------------------

When developing automatic speech recognition engines and any other machine learning system is
a good practice to separate the test from the training data and never combined them. So, the 
CIEMPIESS TEST Corpus was created by this necessity of having an standard test set destined 
to measure the advances of the community of users of the CIEMPIESS datasets and we strongly 
recommend not to use the CIEMPIESS TEST for any other purpose.

The CIEMPIESS TEST Corpus is a gender balanced corpus designed to test acoustic models for the
speech recognition task. It was created by recordings and human transcripts of 10 male and 10
female speakers.

The CIEMPIESS TEST Corpus is considered a CIEMPIESS dataset because it only contains audio
from the same source of the first CIEMPIESS Corpus and it has the word "TEST" in its name, 
obviously because it is recommended for test purposes only.

-------------------------------------------------------------------------------------------------
BRIEF HISTORY
-------------------------------------------------------------------------------------------------

CIEMPIESS TEST belongs to the "CIEMPIESS" family of corpus for Speech Recognition in Mexican 
Spanish. The most distinguished member and founder of this family is The CIEMPIESS Corpus 
(LDC2015S07) published in 2015 by the Linguistic Data Consortium (LDC).

The CIEMPIESS TEST Corpus was created by the social service program "Desarrollo de Tecnologías
del Habla" of the "Facultad de Ingeniería" (FI) in the "Universidad Nacional Autónoma de 
México" (UNAM) between 2016 and 2018 by Carlos Daniel Hernández Mena, head of the program.

CIEMPIESS is the acronym for:

"Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica 
y Servicio Social".

The CIEMPIESS TEST is a Radio Corpus designed to test acoustic models of automatic speech 
recognition and it is made up by recordings of spontaneous conversations in Spanish between a 
radio moderator and his guests. Most of the speech in these conversations has the accent of 
Central Mexico.

All the recordings that constitute the CIEMPIESS TEST come from "RADIO-IUS" 
(http://www.derecho.unam.mx/cultura-juridica/radio.php) that is a radio station that belongs to 
UNAM and they were donated by Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo 
from the "Facultad de Derecho de la UNAM" with the condition that they are used for academic and 
research purposes only.

For more information and documentation see the CIEMPIESS-UNAM Project website at:

		             http://www.ciempiess.org/

-------------------------------------------------------------------------------------------------
CORPUS CHARACTERISTICS
-------------------------------------------------------------------------------------------------

The CIEMPIESS TEST (CT) Corpus has the following characteristics:

- The CT has a total of 3558 audio files of 10 male speakers and 10 female speakers. It has 
  a total duration of 8 hours and 8 minutes.

- The total number of audio files that come from male speakers is 1694 with a total duration
  of 4 hours and 3 minutes. The total number of audio files that come from female speakers is 
  1864 with a total duration of 4 hours and 4 minutes. So CT is perfectly balanced.

- All of the speakers in the CT come from Mexico, except for the speaker M_09 that comes
  from El Salvador.

- Every audio file in the CT has a duration between 5 and 10 seconds approximately.

- Data in CT is classified by gender and also by speaker, so one can easily select audios 
  from a particular set of speakers to do experiments.

- Audio files in the CT and the first CIEMPIESS are all of the same type. In both, speakers 
  talk about legal and lawyer issues. They also talk about things related to the UNAM 
  University and the "Facultad de Derecho de la UNAM".

- As in the first CIEMPIESS Corpus, transcriptions in the CT were made by humans.

- Speakers in the CT are not present in any other CIEMPIESS dataset.

- Audio files in the CT are distributed in a 16khz@16bit mono format.

-------------------------------------------------------------------------------------------------
GENERAL ORGANIZATION OF THE DIRECTORIES
-------------------------------------------------------------------------------------------------

The CIEMPIESS_TEST directory contains the following files and directories:

	- files	: 	One can find the transcription file, the paths file as well as the 
			"Speaker_Info.xls" file that contains relevant information about
                        all the sepakers in the corpus.

	- speech: 	Here one can find the audio files in a 16kHz@16bit mono format. These
                   	files are organized by gender and also by speaker.


	- README.txt

-------------------------------------------------------------------------------------------------
THE CORPUS FILES
-------------------------------------------------------------------------------------------------

In the "files" directory one can find the following:

- CIEMPIESS_TEST.transcription	        : This is the transcription file in plain text format.

- CIEMPIESS_TEST.paths		        : This file contains the relative paths from the
					  "speech" directory to every particular speech file.

- Speaker_info.xls			: This file contains relevant information about the 
					  speakers. Specifically: number of audios per speaker, 
                                          the total amount of time of speech per speaker and 
                                          the time match between male and female speakers.

-------------------------------------------------------------------------------------------------
IDENTIFICATION KEY FORMAT
-------------------------------------------------------------------------------------------------

Every audio file in the CIEMPIESS TEST Corpus has an identification key with the following 
format:

                           CMPT_M_01_0001

	CMPT            M            01             0001
      Acronym      Gender of        Number     Number of the
      for          the Speaker:     of         audio file of
      "CIEMPIESS   "F" for Female   Speaker.   a particular
      TEST".       "M" for Male                speaker.
                                               
-------------------------------------------------------------------------------------------------
ACKNOWLEDGEMENTS
-------------------------------------------------------------------------------------------------

The authors would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their 
support to the social service program: "Desarrollo de Tecnologías del Habla." They also thank 
to the social service students for all the hard work.

Special thanks to Lic. Cesar Gabriel Alanis Merchand and Mtro. Ricardo Rojas Arevalo from the 
"Facultad de Derecho de la UNAM" for donating all the recordings that constitute the CIEMPIESS 
TEST Corpus.

-------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------