Title: Samrómur Icelandic Speech 1.0 Description: This is the first release of the Samrómur Icelandic Speech corpus that contains 100.000 validated utterances. The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab at the Reykjavik University, in cooperation with Almannarómur, Center for Language Technology. The recording has started in October 2019 and continues to this day (May 2021). This release has been authorized for release in May 2021. The aim is to create an open-source speech corpus to enable research and development for Icelandic Language Technology. Please note this version 1.0 is equivalent to "Samrómur Icelandic Speech 21.05" as used by the Language Technology Programme for Icelandic 2019-2023. The corpus contains audio recordings and a metadata file that contains the prompts the participants read. A Kaldi based script using this data can be found on the Language and Voice Lab gitHub page https://github.com/cadia-lvl/samromur-asr Authors: David Mollberg, Olafur Jonnson, Sunneva Thorsteinsdottir, Jóhanna Vigdís Guðmundsdóttir, Steinthor Steingrimsson, Eydis Magnusdottir, Judy Fong, Michal Borsky, Jon Gudnason Language: Icelandic Recommended use: speech recognition, speaker verification, speaker identification Collection Procedure: The data was collected using the website https://samromur.is, code of which is available at https://github.com/cadia-lvl/samromur. The participants are aged between 18 to 90, 59,782 recordings are from female speakers and 40,218 are from male, recorded by a smartphone or the web app. The original audio was collected at 44.1 kHz or 48 kHz sampling rate as *.wav files, which was down-sampled to 16 kHz and recoded to *.flac. Each recording contains one read sentence from a script. The script contains 85.080 unique sentences and 90.838 unique tokens. The participants self-reported their age group, gender, and the native language. There was no identifier other than the session ID, which is used as the speaker ID. The corpus is distributed with a metadata file with a detailed information on each utterance and speaker. Data Format Specifics, Text: The corpus does not contain separate transcription or prompt files. The metadata file contains the prompts in their original text form, as the participants saw them, and also in their normalized form. The prompts were gathered from a variety of sources, mainly from The Icelandic Gigaword Corpus, which is available at http://clarin.is/en/resources/gigaword. The corpus includes text from novels, news, plays, and from a list of location names in Iceland. The prompts also came from the Icelandic Web of Science (https://www.visindavefur.is/). The prompts where pulled from these corpora, if they met the criteria of having only letters which are present in the Icelandic alphabet, and if they are listed in the DIM: Database Icelandic Morphology [1]. Finally, there are also synthesised prompts consisting of a name followed by a question or a demand, in order to simulate a dialogue with a smart-device. The audio files content was manually verified against the prompts by one or more listener. The metadata file is encoded as UTF-8 Unicode. Data Format Specifics, Audio: The corpus contains 100 000 utterance from 8392 speaker, totalling 145 hours. The distributed audio files are encoded at 16 kHz sampling rate, 16 bit linear PCM, 1 channel, *.flac format. The corpus is split into train, dev, and test subsets with no speaker overlap. Each subset contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac. Citation: When publishing results based on the corpus please refer to: Mollberg et al. "Samrómur: Crowdsourcing Data Collection for Icelandic Speech Recognition". Proceedings of The 12th Language Resources and Evaluation Conference (LREC 2020), Marseille. 2020 Contact: Jon Gundason (jg@ru.is) License: CC BY 4.0 [1] Bjarnadóttir et al. " DIM: The Database of Icelandic Morphology". Proceedings of the 22nd Nordic Conference on Computaltion Linguistics (NoDaLiDa), Findland. 2019.