Title: Samrómur Icelandic Speech 1.0

Description:
This is the first release of the Samrómur Icelandic Speech corpus that contains 100.000 validated utterances.
The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab at the Reykjavik University,
in cooperation with Almannarómur, Center for Language Technology. The recording has started in October 2019 and
continues to this day (May 2021). This release has been authorized for release in May 2021. The aim is to create
an open-source speech corpus to enable research and development for Icelandic Language Technology. Please note 
this version 1.0 is equivalent to "Samrómur Icelandic Speech 21.05" as used by the Language Technology Programme
for Icelandic 2019-2023. 

The corpus contains audio recordings and a metadata file that contains the prompts the participants read. A Kaldi
based script using this data can be found on the Language and Voice Lab gitHub page https://github.com/cadia-lvl/samromur-asr

Authors: David Mollberg, Olafur Jonnson, Sunneva Thorsteinsdottir, Jóhanna Vigdís Guðmundsdóttir, Steinthor Steingrimsson, Eydis Magnusdottir, Judy Fong, Michal Borsky, Jon Gudnason

Language: Icelandic

Recommended use: speech recognition, speaker verification, speaker identification

Collection Procedure: The data was collected using the website https://samromur.is, code of which is available at
https://github.com/cadia-lvl/samromur. The participants are aged between 18 to 90, 59,782 recordings are from female
speakers and 40,218 are from male, recorded by a smartphone or the web app. The original audio was collected at 44.1
kHz or 48 kHz sampling rate as *.wav files, which was down-sampled to 16 kHz and recoded to *.flac. Each recording
contains one read sentence from a script. The script contains 85.080 unique sentences and 90.838 unique tokens. The 
participants self-reported their age group, gender, and the native language. There was no identifier other than the
session ID, which is used as the speaker ID. The corpus is distributed with a metadata file with a detailed
information on each utterance and speaker.

Data Format Specifics, Text: The corpus does not contain separate transcription or prompt files. The metadata file
contains the prompts in their original text form, as the participants saw them, and also in their normalized form.
The prompts were gathered from a variety of sources, mainly from The Icelandic Gigaword Corpus, which is available
at http://clarin.is/en/resources/gigaword. The corpus includes text from novels, news, plays, and from a list of
location names in Iceland. The prompts also came from the Icelandic Web of Science (https://www.visindavefur.is/).
The prompts where pulled from these corpora, if they met the criteria of having only letters which are present in
the Icelandic alphabet, and if they are listed in the DIM: Database Icelandic Morphology [1]. Finally, there are
also synthesised prompts consisting of a name followed by a question or a demand, in order to simulate a dialogue
with a smart-device. The audio files content was manually verified against the prompts by one or more listener. 
The metadata file is encoded as UTF-8 Unicode.

Data Format Specifics, Audio: The corpus contains 100 000 utterance from 8392 speaker, totalling 145 hours. The
distributed audio files are encoded at 16 kHz sampling rate, 16 bit linear PCM, 1 channel, *.flac format. The
corpus is split into train, dev, and test subsets with no speaker overlap. Each subset contains folders that
correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac.

Citation: When publishing results based on the corpus please refer to:
Mollberg et al. "Samrómur: Crowdsourcing Data Collection for Icelandic Speech Recognition". Proceedings of The 12th Language Resources and Evaluation Conference (LREC 2020), Marseille. 2020

Contact: Jon Gundason (jg@ru.is)

License: CC BY 4.0


[1] Bjarnadóttir et al. " DIM: The Database of Icelandic Morphology". Proceedings of the 22nd Nordic Conference on Computaltion Linguistics (NoDaLiDa), Findland. 2019.