------------------------------------------------------------------------------ WIKIPEDIA SPANISH CORPUS Audio and Transcripts in Spanish presented in a CIEMPIESS Corpus style taken from the WikiProject Wikipedia Grabada ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ PRESENTATION ------------------------------------------------------------------------------ According to the project page of the WikiProject Spoken Wikipedia, available at: https://en.wikipedia.org/wiki/Wikipedia:WikiProject_Spoken_Wikipedia The WikiProject Spoken Wikipedia aims to produce recordings of Wikipedia articles being read aloud. Therefore, the WIKIPEDIA SPANISH CORPUS is a dataset created from the Spanish version of the WikiProject Spoken Wikipedia, called Wikipedia Grabada, available at: https://es.wikipedia.org/wiki/Wikiproyecto:Wikipedia_grabada The WIKIPEDIA SPANISH CORPUS aims to be used in the Automatic Speech Recognition (ASR) task. It has a similar organization as the CIEMPIESS LIGHT CORPUS published in 2017 at the Linguistic Data Consortium (LDC2017S23). That is why we say that it is presented in the style of the CIEMPIESS Corpus. The WIKIPEDIA SPANISH CORPUS is a gender unbalanced corpus of 25 hours of duration. It contains read speech of several articles of the Wikipedia Grabada; most of such articles are recorded by male speakers. Transcriptions in this corpus were generated from the scratch by native speakers. ------------------------------------------------------------------------------ CORPUS CHARACTERISTICS ------------------------------------------------------------------------------ The WIKIPEDIA SPANISH CORPUS (WSC) has the following characteristics: - The WSC has an exact duration of 25 hours and 37 minutes. It has 11569 audio files. - The WSC counts with 193 different speakers: 150 men and 43 women. - Every audio file in the WSC has a duration between 3 and 10 seconds approximately. - Data in WSC is classified by speaker. It means, all the recordings of one single speaker are stored in one single directory. - Data is also classified according to the gender (male/female) of the speakers. - Audio and transcriptions in the WSC are segmented and transcribed from the scratch by native speakers of the Spanish language - Audio files in the WSC are distributed in a 16khz@16bit mono format. - Every audio file has an ID that is compatible with ASR engines such as Kaldi and CMU-Sphinx. ------------------------------------------------------------------------------ GENERAL ORGANIZATION OF THE DIRECTORIES ------------------------------------------------------------------------------ The WIKIPEDIA_SPANISH directory contains the following files and directories: - data/files: One can find the transcription file, the paths file as well as the "Speaker_Info.xls" file that contains relevant information about all the sepakers in the corpus. - data/speech: One can find the speech files classified by gender (male/female). - docs/README.txt This file. ------------------------------------------------------------------------------ THE CORPUS FILES ------------------------------------------------------------------------------ In the "files" directory one can find the following: - WIKIPEDIA_SPANISH.transcription : This is the transcription file in plain text format. - WIKIPEDIA_SPANISH.paths : This file contains the relative paths from the "speech" directory to every particular speech file. - Speaker_info.xls : This file contains relevant information about the speakers. Specifically: Number of audios per speaker and the total amount of time of speech per speaker. - Source_Files.list : This file contains a list of the original audio files used to create the corpus. ------------------------------------------------------------------------------ IDENTIFICATION KEY FORMAT ------------------------------------------------------------------------------ Every audio file in the WIKIPEDIA SPANISH CORPUS has an identification key with the following format: WKSP_M_0010_E1_0015 WKSP M 0010 E1 0015 Acronym Gender of Number Edition Number of the for the Speaker: of One audio file of "WIKIPEDIA "M" for Male Speaker a particular SPANISH" "F" for Female speaker ------------------------------------------------------------------------------ AUTHORS ------------------------------------------------------------------------------ Corpus Creation: Carlos Daniel Hernández Mena, Iván Vladimir Meza Ruiz Final Edition: Carlos Daniel Hernández Mena ------------------------------------------------------------------------------ ACKNOWLEDGEMENTS ------------------------------------------------------------------------------ The authors would like to thank to Alberto Templos Carbajal, Elena Vera and Angélica Gutiérrez for their support to the social service program "Desarrollo de Tecnologías del Habla" at the Facultad de Ingeniería (FI) of the Universidad Nacional Autónoma de México (UNAM). We also thank to the social service students for all the hard work. ------------------------------------------------------------------------------ ------------------------------------------------------------------------------ To find Corpora similar to this visit: www.ciempiess.org ------------------------------------------------------------------------------ ------------------------------------------------------------------------------