----------------------------------------------------------------------------------------------- PRESENTATION ----------------------------------------------------------------------------------------------- This README file aims to explain users how the CHM150 Corpus is organized, what kind of files does it have and the corpus characteristics. The CHM150 Corpus was created at the "Laboratorio de Tecnologías del Habla" of the "Facultad de Ingeniería (FI)" in the "Universidad Nacional Autónoma de México (UNAM)" in 2012 by Carlos Daniel Hernández Mena, supervised by José Abel Herrera Camacho, head of Laboratory. CHM150 is the acronym for: "Corpus Hecho en México" The "150" means that it contains utterances recollected from 150 speakers. ----------------------------------------------------------------------------------------------- TERMS OF USE ----------------------------------------------------------------------------------------------- CHM150 Corpus by Carlos Daniel Hernández Mena is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. To view a copy of this license visit http://creativecommons.org/licenses/by-sa/4.0/. ----------------------------------------------------------------------------------------------- Corpus Characteristics ----------------------------------------------------------------------------------------------- The CHM150 is a corpus of microphone speech of mexican Spanish taken from 75 male speakers and 75 female speakers in a noise environment of a "quiet office" with a total duration of 1.63 hours. Speakers were encouraged to respond between some pre selected open questions or they could also describe a particular painting showed to them in a computer monitor. By so, the speech is completely spontaneous and one can see it in the transcription file, that captures disfluencies and mispronunciations in an orthographic way. Only the most "clean" utterances were selected to be part of the corpus. By "clean" one can understand that there is no background music, loud noises, or more than one people speaking at the same time. The audio equipment utilized to create the corpus was modest, it consisted in: - USB Interface (http://www.produktinfo.conrad.com/datenblaetter/300000-324999/303350-an-01-en-U_Control_UCA200.pdf) - Analogic Audio Mixer (http://www.music-group.com/Categories/Behringer/Mixers/Small-Format-Mixers/502/p/P0576) - Dynamic Cardioid Vocal Microphone (http://www.music-group.com/Categories/Behringer/Microphones/Dynamic-Microphones/XM8500/p/P0120) The software utilized for recording was Audacity (http://audacityteam.org/) and then the audio was downsampled and normalized with SoX (http://sox.sourceforge.net/). The main characteristics of the audio files are: - Encoding: Signed PCM - Sample Rate: 16000 - Precision: 16 bit - Channels: 1 (mono) The CHM150 corpus contains a total of 2663 utterances classified by speaker, and it also contains a small vocabulary of 1898 unique words. For these reasons the CHM150 could be so small for speech recognition but it is fine for doing spoken term detection and forensic speaker identification. ----------------------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES ----------------------------------------------------------------------------------------------- The CHM150 directory contains the following directories: - docs - transcriptions - Speech and the following files: - LICENSE.txt - README.txt The following is a detailed explanation of the files in every directory. ----------------------------------------------------------------------------------------------- The "docs" directory ----------------------------------------------------------------------------------------------- The "docs" directory contains the following files: - Speaker_Info.xlxs : It is an Excel document with information about the speakers: gender, age, nationality, languages other than english and nationality of the parents. - CHM150.wfreq : It contains the word frequency of the whole corpus. - CHM150.paths : It contains a list of relative paths to every audio file in the corpus. ----------------------------------------------------------------------------------------------- The "transcriptions" directory ----------------------------------------------------------------------------------------------- It contains the file CHM150.transcription that contains the transcriptions of the whole corpus. ----------------------------------------------------------------------------------------------- The "speech" directory ----------------------------------------------------------------------------------------------- It contains the speech files classified by speaker. An "F" stands for "Female" and "M" stands for "Male". ----------------------------------------------------------------------------------------------- -----------------------------------------------------------------------------------------------