----------------------------------------------------------------------------------------------- PRESENTATION ----------------------------------------------------------------------------------------------- This README file aims to explain users how the CIEMPIESS Corpus is organized and what kind of files does it have. The CIEMPIESS Corpus was created at the Speech Processing Laboratory of the Faculty of Enegineering (FI) in the National Autonomous University of Mexico (UNAM) in 2012-2014 by Carlos Daniel Hernández Mena, supervised by José Abel Herrera Camacho, head of Laboratory. CIEMPIESS is the acronym for: "Corpus de Investigación en Español de México del Posgrado de Ingeniería Eléctrica y Servicio Social". The CIEMPIESS is a Radio Corpus that was mainly designed to create acoustic models for automatic speech recognition and it is made up by recordings of spontaneous conversations between a radio moderator and his guests. These recordings were taken in mp3 from "PODCAST UNAM" (http://podcast.unam.mx/) and they were created by "RADIO-IUS" (http://www.derecho.unam.mx/cultura-juridica/radio.php) that is a radio station that belongs to UNAM. For more information and documentation see the CIEMPIESS-UNAM Project website at: http://www.ciempiess.org/ ----------------------------------------------------------------------------------------------- TERMS OF USE ----------------------------------------------------------------------------------------------- CIEMPIESS Corpus by Carlos Daniel Hernández Mena is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license visit http://creativecommons.org/licenses/by-sa/4.0/. Based on a work at http://odin.fi-b.unam.mx/CIEMPIESS-UNAM/. ----------------------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES ----------------------------------------------------------------------------------------------- The CIEMPIESS directory contains the following directories: - transcriptions - train - test - textgrids - sphinx_experiments and the following files: - LICENSE.txt - README.txt The following is a detailed explanation of the files in every directory. ----------------------------------------------------------------------------------------------- NOTICE THAT ----------------------------------------------------------------------------------------------- The design of the CIEMPIESS corpus was very influenced by the CMU-SPHINX3 Speech Recognition Software. That is why you can find a directory named "sphinx_experiments" and in general, all the files of the CIEMPIESS are made to match with the format of the configuration files of the SPHINX3. Maybe you might see this online tutorial to understand with more detail our influences in the design of the CIEMPIESS: http://www.speech.cs.cmu.edu/sphinx/tutorial.html ----------------------------------------------------------------------------------------------- "transcriptions" DIRECTORY ----------------------------------------------------------------------------------------------- You will find the following files that come in pairs: - CIEMPIESS_FULL_TRAIN.transcription - CIEMPIESS_FULL_TRAIN.fileids - CIEMPIESS_test.transcription - CIEMPIESS_test.fileids - CIEMPIESS_train.transcription - CIEMPIESS_train.fileids The "FULL_TRAIN" files require an historical explanation: When the first version of the CIEMPIESS was finished on December 2013, it contained a total of 16717 audio files with their transcriptions. You can always identify these "primitive" audio files because they have identification keys like: 0173M_09ALX_22OCT12 0177M_09ALX_22OCT12 0020M_11ALX_10DIC12 0001F_12ALX_17DIC12 0009F_12ALX_17DIC12 (Where F is for "Female" and M is for "Male" voices) After that, it was decided that two audio sets must be selected for "train" and "test" stages. So we took a total of 700 files of these "primitive" audio files and then added another 300 with different identification key format, for example: F09MAY1844_0036 F09MAY1844_0038 AB01_1 AB01_4 OSC_001 OSC_003 These 300 audio files come from different sources and they are prior to the creation of the CIEMPIESS. At the end, the CIEMPIESS is divided in "train" and "test" sets, and you can manage these sets with the corresponding "_train" or "_test" files in the "transcriptions" directory, but if you want to work with only the "primitive" files of the CIEMPIESS, you have to choose the "FULL_TRAIN" files instead. ----------------------------------------------------------------------------------------------- "train" DIRECTORY ----------------------------------------------------------------------------------------------- You will find the following directories: - ALX_TRAIN - ANG_TRAIN - MAB_TRAIN The "primitive" audio files of the CIEMPIESS were extracted by three different workers: Alejandro (ALX), Angel (ANG) and Mabel (MAB) and that is the reason for naming these three directories the way they are. Inside each of them, you will find a set of recordings you may need for performing a training stage with SPHINX3. All of the files in the "train" directory has the same identification key format, that is: 0001M_01ALX_17DIC12 0001 M_ 01ALX_ 17DIC12 A relative number Gender of the Speaker This is the This is the date that identifies one certain "M" for Male "01" directory of when the entire file inside a directory "F" for Female the "ALX" recordings directory was created ----------------------------------------------------------------------------------------------- "test" DIRECTORY ----------------------------------------------------------------------------------------------- You will find the following directories: - ciempiess - description - fm - read As previously mentioned, the "test" set comes from different sources: ciempiess: Here we have the 700 "primitive" audio files extracted from the first version of the CIEMPIESS. All of these files have the identification key format shown above (see the section: "train" DIRECTORY). description: This directory contains 200 recordings of spontaneous speech of people describing paintings or answering questions. fm : This directory contains 17 recordings extracted from the FM Radio. The radio station selected for these recordings is different of the radio estations selected to create the CIEMPIESS. read: This directory contains recordings of read speech. The speaker in these recordings is a male person between 25 and 30 years who has lived in Mexico City all his life. ----------------------------------------------------------------------------------------------- "textgrids" DIRECTORY ----------------------------------------------------------------------------------------------- You will find the following directory: - full_train and the following text files: - CIEMPIESS_FULL_TRAIN.label_transcriptions - CIEMPIESS_FULL_TRAIN.label_fileids One of the reasons that the creation of the CIEMPIESS Corpus took so long was that it has "time labels" to indicate where a word begins and ends in a certain recording. This "time labels" were created with the software PRAAT www.praat.org PRAAT generates these "time labels" with the extension ".TextGrid", and that is why we call "textgrigs" to them. The "time labels" or "textgrids" are only available for the 16717 "primitive" audio files. NOTICE THAT: The file CIEMPIESS_FULL_TRAIN.label_transcriptions was taken directly from all the "time labels" with the help of a python script. It means that this file is a reflect of the "time labels". Nevertheless, the file CIEMPIESS_FULL_TRAIN.transcription was, at the beginning, equal to the CIEMPIESS_FULL_TRAIN.label_transcriptions, but we corrected spelling errors in the CIEMPIESS_FULL_TRAIN.transcription while the other file remained untouched. So, in conclusion, these two files are not exactly the same but they are very similar to each other. ----------------------------------------------------------------------------------------------- "sphinx_experiments" DIRECTORY ----------------------------------------------------------------------------------------------- You will find the following directories: - T22_NOTONIC - T22_TONIC - T50_NOTONIC - T50_TONIC All of these directories contain the files needed to perform training and recognition experiments with the SPHINX3 recognition software. The directories with the word "TONIC" have SPHINX3 files that take into account the tonic vowel of every word, and the directories with the words "NOTONIC" work with all the words in lowercase. For more information about how to deal with tonic vowels you can see this article: Carlos. D. Hernández-Mena and José. A. Herrera-Camacho, “CIEMPIESS: A new open-sourced mexican spanish radio corpus,” in Proc. LREC. European Language Resources Association, 2014. That you can download from here http://www.lrec-conf.org/proceedings/lrec2014/pdf/182_Paper.pdf Anyway, every directory have the following files: feat.params : Contains several variables to calculate the MFCC with SPHINX3 .dic : Pronouncing dictionary .filler : Filler dictionary .phone : list of phonemes .ug.lm : ASCII version of the Language Model in ARPA Format .ug.lm.DMP : Binary version of the Language Model in ARPA Format _test.fileids : List of all the paths to the audio files of the test set _test.transcription : Transcription file of the test set _train.fileids : List of all the paths to the audio files of the train set _train.transcription : Transcription file of the train set T22 directories handle only phonemes of the Mexican Spanish. T50 directories handle phonemes and allophones. NOTICE THAT: In the directories with the word "TONIC" you will find words with double letters like these: AAbre bloquEEen enIIgmas OObras ajUUsco agroeKKSSportadOOr mEEJJico SSicotEEncatl precampAANNa This is because SPHINX3 and the CMU Statistical Language Modeling Toolkit (http://www.speech.cs.cmu.edu/SLM/toolkit.html) does not distinguish between lowercase and uppercase. This represents a problem to the CIEMPIESS because it utilizes uppercase letters to indicate things (for example: tonic vowels). To handle these double letters you can do a simple "Find and Replace" and do the following substitutions: AA -> A EE -> E II -> I OO -> O UU -> U KKSS -> KS JJ -> J SS - SS NN -> N ----------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------