-------------------------------------------------------------------------------- SAMRÓMUR-SYNTHETIC Synthetized Speech with Transcriptions in Icelandic produced by the Language and Voice Laboratory at Reykjavík University -------------------------------------------------------------------------------- Language : Icelandic. Authors : Carlos Daniel Hernández Mena, Gunnar Thor Örnólfsson and Jón Guðnason Recommended use : speech recognition. -------------------------------------------------------------------------------- PRESENTATION -------------------------------------------------------------------------------- The SAMRÓMUR-SYNTHETIC is a corpus made out of synthesized speech in Icelandic. The text-to-speech (TTS) system utilized to produce the utterances was developed by the Language and Voice Laboratory at Reykjavík University The sentences used to create the corpus were extracted from the Samrómur platform (samromur.is) which is a web platform where any speaker of Icelandic (native or as a second language) can donate their voice to an open source data bank. The SAMRÓMUR-SYNTHETIC CORPUS was created in June, 2023. To see more open resources developed by the Language and Voice Lab see the HuggingFace repository at https://huggingface.co/language-and-voice-lab -------------------------------------------------------------------------------- DISCLAIMER AND TERMS OF USE -------------------------------------------------------------------------------- "SAMRÓMUR-SYNTHETIC" by Carlos Daniel Hernández Mena, Gunnar Thor Örnólfsson and Jón Guðnason is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License with the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. To view a copy of this license visit: https://creativecommons.org/licenses/by/4.0/ -------------------------------------------------------------------------------- CREATION METHODOLOGY -------------------------------------------------------------------------------- The corpus was created following the steps below: - Sentences from Samrómur are added in a single plain text file. Samrómur includes normalized text versions of each sentence. Normalized text sentences are taken for the next steps. - Only sentences with a minimum of 8 words and a maximum of 24 were selected to be synthezied. The average number of words per line is 12.785. - The text sentences are spread to cover a total of 22 male voices and 22 female voices at 5 different speed rates. Each combination of voice with a particular speed rate is considered as a individual speaker; so there are 220 (22x5x2) different speakers in the corpus (110 men plus 110 women). It is then assigned a total of 285 sentences to each speaker; so the total number of sentences to be synthesized are 62,700 (22x5x2x285). - At the end, the selected sentences were converted into utterances using a TTS system. The amount of speech per speaker is around 21 minutes. -------------------------------------------------------------------------------- CORPUS CHARACTERISTICS -------------------------------------------------------------------------------- The SAMRÓMUR-SYNTHETIC CORPUS (SSC) has the following characteristics: - The SSC has an exact duration of 72 hours and 52 minutes. It has 62,700 audio files. - The SSC has utterances from 44 different voices: 22 male and 22 female voices. - Speakers were produced when varying between 5 speed rate values: 0.6, 0.8, 1.0, 1.2 and 1.4. - Data in SSC is classified by speaker. It means that all the utterances belonging to one single speaker are stored in one single directory. - Each speaker has been assigned 285 utterances with a minimum of 8 words and a maximum of 24. The average number of words per utterance is 12.785. - Utterances have a length between 1 and 12 seconds each. The average length is 4.184 seconds. - Utterances are also classified according to the gender (male/female) of the speaker. - Audio files in the SSC are distributed in FLAC format at 22050Hz@16bit mono. - Every audio file has an ID that is compatible with ASR engines such as Kaldi and CMU-Sphinx. - Transcriptions in SSC are lowercase with no punctuation marks. -------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE FILES IN THE CORPUS -------------------------------------------------------------------------------- The SAMRÓMUR-SYNTHETIC directory contains the following files and directories: - speech : One can find the speech files classified by gender (male/female voice). - metadata.tsv : File in a "tab separated" format containing valuable information about the audios in the corpus such as: audio id, audio duration, transcriptions, etc. - README.txt -------------------------------------------------------------------------------- IDENTIFICATION KEY FORMAT -------------------------------------------------------------------------------- Each audio file in the SAMRÓMUR-SYNTHETIC corpus has an identification key with the following format: SRMS_F_0065_V013S14_0045 SRMS F 0065 V013 S14 0045 Acronym Gender of Number Number Speed Rate Number of the for the voice: of of S06 = 0.6 audio file of "SAMRÓMUR- "M" for Male Speaker voice S08 = 0.8 a particular SYNTHETIC" "F" for Female S10 = 1.0 speaker. There S12 = 1.2 are no S14 = 1.4 discontinuities There are 285 utterances per speaker. -------------------------------------------------------------------------------- CITATION -------------------------------------------------------------------------------- When publishing results based on the corpus please refer to: Hernández Mena, Carlos Daniel; Örnólfsson, Gunnar Thor; Guðnason, Jón; "SAMRÓMUR SYNTHETIC". Web Download. Reykjavik University: Language and Voice Lab, 2023. Contact: Jón Guðnason (jg@ru.is), Carlos Mena (carlosm@ru.is) License: CC BY 4.0 -------------------------------------------------------------------------------- ACKNOWLEDGEMENTS -------------------------------------------------------------------------------- This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. -------------------------------------------------------------------------------- For more information about us, visit our website https://lvl.ru.is/ --------------------------------------------------------------------------------