-------------------------------------------------------------------------------------------------
CIEMPIESS LIGHT CORPUS
Audio and Transcripts of Mexican Spanish Broadcast Conversations.
-------------------------------------------------------------------------------------------------

-------------------------------------------------------------------------------------------------
PRESENTATION
-------------------------------------------------------------------------------------------------

The CIEMPIESS LIGHT Corpus is an enhanced version of the CIEMPIESS Corpus (LDC item LDC2015S07).

CIEMPIESS LIGHT is "light" because it doesn't include much of the files of the first version of 
CIEMPIESS and it is "enhanced" because it has a lot of improvements, some of them suggested by 
our community of users, that make this version more convenient for the new speech recognition 
engines such as Kaldi (http://kaldi-asr.org/).

-------------------------------------------------------------------------------------------------
BRIEF HISTORY
-------------------------------------------------------------------------------------------------

The CIEMPIESS LIGHT Corpus was created at the "Laboratorio de Teconologías del Lenguaje" 
of the "Facultad de Ingeniería" (FI) in the "Universidad Nacional Autónoma de México" (UNAM)
between 2015 and 2016 by Carlos Daniel Hernández Mena, supervised by José Abel Herrera Camacho, 
head of Laboratory.

CIEMPIESS is the acronym for:

"Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica 
y Servicio Social".

The CIEMPIESS LIGHT is a Radio Corpus designed to create acoustic models for automatic speech 
recognition and it is made up by recordings of spontaneous conversations in Mexican Spanish 
between a radio moderator and his guests. 

These recordings were taken in mp3 from "PODCAST UNAM" (http://podcast.unam.mx/) and they 
were created by "RADIO-IUS" (http://www.derecho.unam.mx/cultura-juridica/radio.php) that is 
a radio station that belongs to UNAM and by "Mirador Universitario" 
(http://mirador.cuaed.unam.mx/) that is a TV program that also belongs to UNAM.

For more information and documentation see the CIEMPIESS-UNAM Project website at:

		http://www.ciempiess.org/


-------------------------------------------------------------------------------------------------
COMPARISON BETWEEN THE CIEMPIESS LIGHT AND THE CIEMPIESS
-------------------------------------------------------------------------------------------------

Differences between the CIEMPIESS LIGHT (CL) and the Original CIEMPIESS (OC) includes:

- The organization of the directories is more simple in the CL than in the OC.

- The CL doesn't include files to do experiments with Sphinx like the OC does.

- The CL doesn't include "time labels" to indicate word beginnings and word endings 
  like the OC does.

- The CL is slightly bigger (18 hours / 25 minutes) than the OC (17 hours / 30 minutes).

- The CL is made up by 16,663 audio files with transcripts. The OC is made up by
  17,017 audio files divided in a train set and a test set.

- The CL doesn't impose the test set or the train set like the OC does.

- Data in CL is classified by gender and speaker, so one can easily select audios from
  a particular set of speakers to do experiments.

- Female speakers were added to the CL but it is still being a gender unbalanced corpus:
  (24.85% of female speakers in the CL against 22.14% in the OC). 

- The CL contains 34 female speakers and 53 male speakers. 12,521 of the audio files
  belongs to male speakers and 4,142 belongs to female speakers. In constrast, the 
  training set of the OC has 12,407 audio files that belongs to male speakers and 3,610 
  that belongs to female speakers.

- Transcription files in the OC are so complicated because they are not completely 
  orthographic, for example, this is a transcription in the OC:

  <s> <sil> pEro nO podrIa eKSistIr lA sAna distAncia doctOr <sil> </s> (0166M_08ALX_15OCT12)

  This is the same transcription in the CL:

  pero no podría existir la sana distancia doctor CMPL_M_09_08ALX_00244

  As one can see, this transcription in the CL corpus is more understandable.

- As in the OC, the file keys in the CL provide a lot of useful information of any particular
  audio file. Nevertheless, keys in the CL have different numbering with respect to the keys 
  in the OC.

- Faulty audio files in the OC were eliminated in the CL. For example, few files in the OC 
  does not contain any voice but noise. Some others contain just one unintelligible word.

- Good quality audio files were added to the CL. These new audios come from the TV show 
  "Mirador Universitario" which is produced by UNAM. Topics in the new files are similar 
  to the topics discuss in the old files and also, they were downloaded from the same 
  page (http://podcast.unam.mx/) in MP3.

- Audio files in the CL are all of the same type. There are no files from other corpus
  like in the OC.

-------------------------------------------------------------------------------------------------
GENERAL ORGANIZATION OF THE DIRECTORIES
-------------------------------------------------------------------------------------------------

The CIEMPIESS_LIGHT directory contains the following files and directories:

	- docs   : Here one can find the transcription file (CIEMPIESS_LIGHT.transcription), 
                   the audio paths file (CIEMPIESS_LIGHT.paths) and the file "audiosxspk.txt"
                   that shows the number of audio files per speaker.

	- data : Here one can find the audio files in flac format (16kHz@16bit mono). These
                   files are organized by gender and also by speaker.

	- LICENSE.txt

	- README.txt

-------------------------------------------------------------------------------------------------
IDENTIFICATION KEY FORMAT
-------------------------------------------------------------------------------------------------

Every audio file in the CIEMPIESS LIGHT Corpus has an identification key with the following 
format:

                                 CMPL_M_52_13ALX_00021

	CMPL            M            52            13ALX                00021
      Acronym      Gender of        Number     An internal           Number of the
      for          the Speaker:     of         key that indicates    audio file of
      "CIEMPIESS   "M" for Male     Speaker.   the show where the    a particular
      LIGHT".      "F" for Female              audio was extracted   speaker.
                                               from.

-------------------------------------------------------------------------------------------------
-------------------------------------------------------------------------------------------------