-------------------------------------------------------------------------------- Samrómur Children Icelandic Speech 1.0 -------------------------------------------------------------------------------- Language : Icelandic Authors : Carlos Mena, Michal Borsky, David Erik Mollberg, Smári Freyr Guðmundsson, Staffan Hedström, Ragnar Pálsson, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir, Jóhanna Vigdís Guðmundsdóttir, Eydís Huld Magnúsdóttir, Ragnheiður Þórhallsdóttir, Jon Gudnason. Recommended use : speech recognition, speaker verification, speaker identification. -------------------------------------------------------------------------------- Description -------------------------------------------------------------------------------- This is the first release of the Samrómur Children corpus. It contains more than 137.000 validated speech-recordings uttered by Icelandic children. Please note this version 1.0 is equivalent to "Samrómur Children Icelandic Speech 21.09" as used by the Language Technology Programme for Icelandic 2019-2023. The corpus is a result of the crowd-sourcing effort run by the Language and Voice Lab (LVL) at the Reykjavik University, in cooperation with Almannarómur, Center for Language Technology. The recording process has started in October 2019 and continues to this day (Spetember 2021). The present edition of the corpus has been authorized for release in September 2021. The aim is to create an open-source speech corpus to enable research and development for Icelandic Language Technology. The corpus consists of audio recordings and a metadata file containing the prompts read by the participants. To see more open resources developed by the Language and Voice Lab see the gitHub repository at https://github.com/cadia-lvl/samromur-asr -------------------------------------------------------------------------------- Corpus Characteristics -------------------------------------------------------------------------------- - The utterances were recorded by a smartphone or the web app. - Participants self-reported their age group, gender, and the native language. - Participants are aged between 4 to 17 years. - The corpus contains 137597 utterances from 3175 speakers, totalling 131 hours. - The amount of data due to female speakers is 73h38m, the amount of data due to male speakers is 52h26m and the amount of data due to speakers with an unknown gender information is 05h02m - The number of female speakers is 1667, the number of male speakers is 1412. The number of speakers with an unknown gender information is 96. - The audios due to female speakers are 78993, the audios due to male speakers are 53927 and the audios due to speakers with an unknown gender information are 4677. - The corpus is split into train, dev, and test portions. Lenghts or every portion are: train = 127h25m, test = 1h50m, dev=1h50m. -------------------------------------------------------------------------------- Experimental Perspective -------------------------------------------------------------------------------- In the field of Automatic Speech Recognition (ASR) is a known fact that the children's speech is particulary hard to recognise due to its high variability produced by developmental changes in children's anatomy and speech production skills [2]. For this reason, the criteria of selection for the train/dev/test portions have to take into account the children's age. Nevertheless, the Samrómur Children is an unbalanced corpus in terms of gender and age of the speakers. This means that the corpus has, for example, a total of 1667 female speakers (73h38m) versus 1412 of male speakers (52h26m). These unbalances impose conditions in the type of the experiments than can be performed with the corpus. For example, a equal number of female and male speakers through certain ranges of age is impossible. So, if one can't have a perfectly balance corpus in the training set, at least one can have it in the test portion. The test portion of the Samróur Children was meticulously selected to cover ages bewteen 6 to 16 years in both female and male speakers. Every of these range of age in both genders have a total duration of 5 minutes each. The development portion of the corpus contains only speakers with an unknown gender information. Both test and dev sets have a total duration of 1h50m each. In order to perform fairer experiments, speakers in the train and test sets are not shared. Nevertheless, there is only one speaker shared between the train and development set. It can be identified with the speaker ID=010363. However, no audio files are shared between these two sets. -------------------------------------------------------------------------------- Collection Procedure -------------------------------------------------------------------------------- The data was collected using the website https://samromur.is, code of which is available at https://github.com/cadia-lvl/samromur. The age range selected for this corpus is between 4 and 17 years. The original audio was collected at 44.1 kHz or 48 kHz sampling rate as *.wav files, which was down-sampled to 16 kHz and converted to *.flac. Each recording contains one read sentence from a script. The script contains 85.080 unique sentences and 90.838 unique tokens. There was no identifier other than the session ID, which is used as the speaker ID. The corpus is distributed with a metadata file with a detailed information on each utterance and speaker. The madata file is encoded as UTF-8 Unicode. The prompts were gathered from a variety of sources, mainly from The Icelandic Gigaword Corpus, which is available at http://clarin.is/en/resources/gigaword. The corpus includes text from novels, news, plays, and from a list of location names in Iceland. The prompts also came from the Icelandic Web of Science (https://www.visindavefur.is/). Prompts were pulled from these corpora if they met the criteria of having only letters which are present in the Icelandic alphabet, and if they are listed in the DIM: Database Icelandic Morphology [1]. Finally, there are also synthesised prompts consisting of a name followed by a question or a demand, in order to simulate a dialogue with a smart-device. The audio files content was manually verified against the prompts by one or more listener. -------------------------------------------------------------------------------- Data Format Specifics -------------------------------------------------------------------------------- - Text : The corpus does not count with separate transcription or prompt files. The metadata file contains the prompts in their original text form, as the participants saw them, and also in their normalized form. - Audio: The distributed audio files are encoded at 16 kHz sampling rate, 16 bit linear PCM, 1 channel, *.flac format. The corpus is split into train, dev, and test subsets with no speaker overlap. Each subset contains folders that correspond to speaker IDs, and the audio files inside use the following naming convention: {speaker_ID}-{utterance_ID}.flac. -------------------------------------------------------------------------------- Citation -------------------------------------------------------------------------------- When publishing results based on the corpus please refer to: Mena, Carlos et al. "Samrómur Children: Icelandic Speech Data 21.09". Web Download. Reykjavik University: Language and Voice Lab, 2021. Contact: Jon Gudnason (jg@ru.is) License: CC BY 4.0 -------------------------------------------------------------------------------- Acknowledgements -------------------------------------------------------------------------------- This project was funded by the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. The verification for the dataset was funded by the the Icelandic Directorate of Labour's Student Summer Job Program in 2020 and 2021. Special thanks for the summer students for all the hard work. -------------------------------------------------------------------------------- References -------------------------------------------------------------------------------- [1] Bjarnadóttir et al. " DIM: The Database of Icelandic Morphology". Proceedings of the 22nd Nordic Conference on Computaltion Linguistics (NoDaLiDa), Findland. 2019. [2] Hämäläinen, A., Candeias, S., Cho, H., Meinedo, H., Abad, A., Pellegrini, T., ... & Dias, M. S. (2014, September). Correlating ASR errors with developmental changes in speech production: A study of 3-10-year-old European Portuguese children's speech. In Workshop on Child Computer Interaction-WOCCI 2014 (pp. pp-1). -------------------------------------------------------------------------------- --------------------------------------------------------------------------------