------------------------------------------------------------------------------------------------- LIBRIVOX SPANISH CORPUS Audio and Transcripts in Spanish with a CIEMPIESS Corpus style, taken from Librivox.org ------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------- PRESENTATION ------------------------------------------------------------------------------------------------- Librivox is a non-commercial, non-profit and ad-free project that is dedicated to make all books in the public domain available, for free, in audio format on the internet. According to this, we downloaded 300 titles in Spanish to create the LIBRIVOX SPANISH CORPUS which has a similar style format to the CIEMPIESS LIGHT Corpus (LDC2017S23). This is why we say that the LIBRIVOX SPANISH CORPUS has a CIEMPIESS Style. The LIBRIVOX SPANISH CORPUS has a duration of 73 hours and it is constituted by audio files between 3 and 10 seconds long, manually segmented. Transcription are also manually made by Spanish native speakers. The recordings are divided between male/female and native/non-native speakers. ------------------------------------------------------------------------------------------------- CORPUS CHARACTERISTICS ------------------------------------------------------------------------------------------------- The LIBRIVOX SPANISH CORPUS (LSC) has the following characteristics: - The LSC has an exact duration of 73 hours and 1 minute. It has 36338 audio files. - The LSC counts with 154 different speakers: 77 men and 77 women. - Every audio file in the LSC has a duration between 3 and 10 seconds approximately. - Data in LSC is classified by speaker. It means, all the recordings of one single speaker are stored in one single directory. - Data is also classified according to the gender (male/female) of the speakers and according to the way they speak (native/non-native). - Audio and transcriptions in the LSC are segmented and transcribed by native speakers of the Spanish language - Audio files in the LSC are distributed in a 16khz@16bit mono format. - Every audio file has an ID that is compatible with ASR engines such as Kaldi and CMU-Sphinx. ------------------------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES ------------------------------------------------------------------------------------------------- The LIBRIVOX_SPANISH directory contains the following files and directories: - data: It contains the "files" and "speech" directories. - files: One can find the transcription file, the paths file as well as the "Speaker_Info.xls" file that contains relevant information about all the sepakers in the corpus. - speech: One can find the speech files classified by speaker and also by the way of pronunciation. In the directory "native" one can find all the native speakers of the Spanish language, in the directory "non_native" are the speakers with a foreign accent. - docs: It contains the README.txt file. ------------------------------------------------------------------------------------------------- THE CORPUS FILES ------------------------------------------------------------------------------------------------- In the "files" directory one can find the following: - LIBRIVOX_SPANISH.transcription: This is the transcription file in plain text format. - LIBRIVOX_SPANISH.paths: This file contains the relative paths from the "speech" directory to every particular speech file. - Speaker_Info.xls: This file contains relevant information about the speakers. Specifically: Librivox user ID, number of audios per speaker, the total amount of time of speech per speaker, the speaker ID in the LSC and the accent (native/non-native) of every speaker. ------------------------------------------------------------------------------------------------- IDENTIFICATION KEY FORMAT ------------------------------------------------------------------------------------------------- Every audio file in the LIBRIVOX SPANISH CORPUS has an identification key with the following format: LBVX_M_01_NAT_0001 LBVX M 01 NAT 0001 Acronym Gender of Number Type of Number of the for the Speaker: of pronunciation: audio file of "Librivox" "M" for Male Speaker. NAT = Native a particular "F" for Female NNT = Non-Native speaker. ------------------------------------------------------------------------------------------------- ACKNOWLEDGEMENTS ------------------------------------------------------------------------------------------------- The author would like to thank to Alejandro V. Mena, Elena Vera and Angélica Gutiérrez for their support to the social service program: "Desarrollo de Tecnologías del Habla." He also thanks to the social service students for all the hard work. Special thanks to the Librivox team for publishing all the recordings that constitute the LIBRIVOX SPANISH CORPUS. ------------------------------------------------------------------------------------------------- ------------------------------------------------------------------------------------------------- To find Corpora similar to this visit: www.ciempiess.org ------------------------------------------------------------------------------------------------- -------------------------------------------------------------------------------------------------