Samrómur Queries Icelandic Speech 1.0

Item Name: Samrómur Queries Icelandic Speech 1.0
Author(s): Staffan Hedström, Judy Fong, Ragnheiður Þórhallsdóttir, David Mollberg, Smári Freyr Guðmundsson, Ólafur Helgi Jónsson, Sunneva Þorsteinsdóttir, Eydis Huld Magnusdottir, Jon Gudnason
LDC Catalog No.: LDC2023S05
ISLRN: 363-728-488-848-1
DOI: https://doi.org/10.35111/aq18-1540
Release Date: August 15, 2023
Member Year(s): 2023
DCMI Type(s): Sound, Text
Sample Type: flac
Sample Rate: 16000
Data Source(s): web collection
Application(s): speaker identification, speaker verification, speech recognition
Language(s): Icelandic
Language ID(s): isl
License(s): Samrómur Queries Icelandic Speech 1.0 Agreement (For-Profit Member)
Samrómur Queries Icelandic Speech 1.0 Agreement (Non-Member)
Samrómur Queries Icelandic Speech 1.0 Agreement (Not-for-Profit)
Online Documentation: LDC2023S05 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Hedström, Staffan, et al. Samrómur Queries Icelandic Speech 1.0 LDC2023S05. Web Download. Philadelphia: Linguistic Data Consortium, 2023.
Related Works: View

Introduction

Samrómur Queries Icelandic Speech 1.0 was developed by the Language and Voice Lab, Reykjavik University in cooperation with Almannarómur, Center for Language Technology. The corpus contains 20 hours of Icelandic prompted queries from 3,809 speakers representing 17,475 utterances.

This version 1.0 is equivalent to "Samrómur Queries Icelandic Speech 21.12" as used by the Language Technology Programme for Icelandic 2019-2023.

Data

Speech data was collected between October 2019 and December 2021 using the Samrómur website which displayed prompts to participants. The prompts were mainly from The Icelandic Gigaword Corpus, which includes text from novels, news, plays, and from a list of location names in Iceland. Additional prompts were taken from the Icelandic Web of Science and others were created by combining a name followed by a question. Prompts and speaker metadata are included in the corpus.

The audio data is divided into train, dev, and test sets and is presented as flac compressed, single channel, 16 kHz, 16-bit linear PCM.

Samples

Please view this audio sample (FLAC).

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee