-------------------------------------------------------------------------------- Samrómur Queries Icelandic Speech 1.0 -------------------------------------------------------------------------------- Language : Icelandic Authors : Staffan Hedström, Judy Y. Fong, Ragnheiður Þórhallsdóttir, David Erik Mollberg, Smári Freyr Guðmundsson, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir, Eydís Huld Magnúsdóttir, Jon Gudnason Recommended use : speech recognition, speaker verification, speaker identification and speaker enrollment focused on queries -------------------------------------------------------------------------------- Description -------------------------------------------------------------------------------- This release of data from the Samrómur collection focuses on queries. It contains 17,475 (20 hours) validated speech-recordings in Icelandic. Please note this version 1.0 is equivalent to "Samrómur Queries Icelandic Speech 21.12" as used by the Language Technology Programme for Icelandic 2019-2023. The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab (LVL) at Reykjavik University, in cooperation with Almannarómur, the Icelandic Center for Language Technology. The recording process has started in October 2019 and continues to this day (December 2021). The present edition of the corpus has been authorized for release in December 2021. The aim is to create an open-source speech corpus to enable research and development for Icelandic Language Technology. The corpus consists of audio recordings and a metadata file containing the sentences read by the participants. To see more open resources developed by the Language and Voice Lab (LVL) see the github repository at https://github.com/cadia-lvl/samromur-asr -------------------------------------------------------------------------------- Corpus Characteristics -------------------------------------------------------------------------------- - The utterances were recorded by a smartphone or the web app. - Participants self-reported their age group, gender and native language. - Participants are from 6 and up to 80+ years. - The corpus contains 17,475 utterances from 3,809 speakers, totalling 20 hours. - The amount of data from female speakers is 13h17m, the amount of data from male speakers is 6h9m and the amount of data from speakers with an unknown gender information is 34m. - The number of female speakers is 2,322, the number of male speakers is 1,294. The number of speakers with an unknown gender information is 193. - The amount of utterances from female speakers are 11,320, the utterances from male speakers are 5,676 and the utterances from speakers with unknown gender information are 479. - The corpus is split into train, dev, and test sets. Lengths of the sets are are: train = 16h, test = 2h, dev = 2h. The sets have no speaker overlap. Test has no sentence overlap with train or dev. Dev has 242 sentences overlapping with train. - If any of the information in the metadata is unavailable this will is indicated with a NAN in the metadata file. -------------------------------------------------------------------------------- Collection Procedure -------------------------------------------------------------------------------- The data was collected using the website https://samromur.is, code of which is available at https://github.com/cadia-lvl/samromur. The collection Procedure is well described in "Samrómur: Crowd-sourcing Data Collection for Icelandic Speech Recognition" [1]. For this corpus the utterances selected were all queries. The original audio was collected at 44.1 kHz or 48 kHz sampling rate as *.wav files, which was down-sampled to 16 kHz and converted to *.flac. Each recording contains one read query from a script. The script contains 73,893 unique queries 556,882 tokens and 37,056 word types. Each time a device visits the website for the first time they are assigned a client id, this client id together with a combination of gender, age and native language was used to assign the speaker id. If any of these variables were changed, a new speaker id was also created. The corpus is distributed with a metadata file with a detailed information on each utterance and speaker. The metadata file is encoded as UTF-8 Unicode. The queries were gathered from a variety of sources, mainly from The Icelandic Gigaword Corpus, which is available at http://clarin.is/en/resources/gigaword. The corpus includes queries found in novels, news, and plays. The queries also came from the Icelandic Web of Science (https://www.visindavefur.is/). Queries were pulled from these sources if they met the criteria of having only letters which are present in the Icelandic alphabet, and if they are listed in DIM: Database Icelandic Morphology [2]. Finally, there are also synthesized queries consisting of a name followed by a question, in order to simulate a dialogue with a smart-device. The audio files' content was manually verified against the questions by one or more listener(s). -------------------------------------------------------------------------------- Data Format Specifics -------------------------------------------------------------------------------- - Text : The corpus does not contain separate transcription or sentence files. The metadata file contains the sentences in their original text form, as the participants saw them, and also in their normalized form. - Audio: The distributed audio files are encoded at 16 kHz sampling rate, 16 bit linear PCM, 1 channel, *.flac format. The corpus is split into train, dev, and test subsets with no speaker overlap. Each subset contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac. -------------------------------------------------------------------------------- Citation -------------------------------------------------------------------------------- When publishing results based on the corpus please refer to: Hedström et al. "Samrómur Queries 21.12". Web Download. Reykjavik University: Language and Voice Lab, 2021. Contact: Jon Gudnason (jg@ru.is) License: CC BY 4.0 (https://creativecommons.org/licenses/by/4.0/legalcode) -------------------------------------------------------------------------------- Acknowledgements -------------------------------------------------------------------------------- This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. The verification for the dataset was mainly done by the Language and Voice Lab at Reykjavik University, but part of the data has been verified by summer students from the Student Summer Job Program in 2020 and 2021 funded by the the Icelandic Directorate of Labour. Special thanks to the assisting LVL members and summer students for all the hard work. -------------------------------------------------------------------------------- Stats for the dataset -------------------------------------------------------------------------------- Age and gender split: | | Total | Test | Dev | Train | | ---------------- | ----- | ----- | ----- | ----- | | 0-19: | 26.4% | 18.2% | 23.0% | 27.7% | | 20-39: | 31.8% | 43.0% | 36.5% | 30.0% | | 40-59: | 34.5% | 32.4% | 30.5% | 35.2% | | 60-79: | 5.7% | 6.0% | 7.4% | 5.5% | | 80+: | 0.6% | 0.4% | 2.6% | 0.3% | | ---------------- | ----- | ----- | ----- | ----- | | Female: | 66.4% | 49.1% | 49.6% | 70.7% | | Male: | 30.7% | 50.9% | 50.4% | 25.7% | | Other: | 2.9% | 0.0% | 0.0% | 3.6% | | ---------------- | ----- | ----- | ----- | ----- | | Duration (h): | 20.0 | 2.0 | 2.0 | 16.0 | | Unique speakers: | 3809 | 570 | 534 | 2705 | Amount of utterances in each subset: train: 14,140 dev: 1,728 test: 1,607 Total speakers and utterances: Speakers: 3,809 Utterances: 17,475 -------------------------------------------------------------------------------- References -------------------------------------------------------------------------------- [1] Mollberg et al. "Samrómur: Crowd-sourcing Data Collection for Icelandic Speech Recognition," 12th International Conference on Language Resources and Evaluation (LREC), France, 2020. [2] Bjarnadóttir et al. "DIM: The Database of Icelandic Morphology". Proceedings of the 22nd Nordic Conference on Computational Linguistics (NoDaLiDa), Finland. 2019. --------------------------------------------------------------------------------