Althingi Parliamentary Speech

Item Name: Althingi Parliamentary Speech
Author(s): Inga Rún Helgadóttir, Róbert Kjaran, Anna Björk Nikulásdóttir, Jon Gudnason
LDC Catalog No.: LDC2021S01
ISBN: 1-58563-956-7
ISLRN: 142-519-062-218-1
DOI: https://doi.org/10.35111/695b-6697
Release Date: February 15, 2021
Member Year(s): 2021
DCMI Type(s): Sound, Text
Sample Type: mp3
Sample Rate: 44100
Data Source(s): microphone speech
Application(s): speech recognition
Language(s): Icelandic
Language ID(s): isl
License(s): Althingi Parliamentary Speech Agreement (For-Profit)
Althingi Parliamentary Speech Agreement (Non-Member)
Althingi Parliamentary Speech Agreement (Not-For-Profit)
Online Documentation: LDC2021S01 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Helgadóttir, Inga Rún, et al. Althingi Parliamentary Speech LDC2021S01. Web Download. Philadelphia: Linguistic Data Consortium, 2021.

Introduction

Althingi Parliamentary Speech consists of approximately 542 hours of recorded speech from Althingi, the Icelandic Parliament, along with corresponding transcripts, a pronunciation dictionary and two language models. Speeches date from 2005-2016.

This dataset was collected in 2016 by the ASR for Althingi project at Reykjavik University in collaboration with the Althingi speech department. The purpose of that project was to develop an ASR (automatic speech recognition) system for parliamentary speech to replace the procedure of manually transcribing performed speeches.

Data

The mean speech length is six minutes, with speeches ranging from under one minute to around thirty minutes. The corpus features 197 speakers (105 male, 92 female) and is split into training, development and evaluation sets. The language models are of two types: a pruned trigram model, used in decoding, and an unpruned constant ARPA 5-gram model, used for re-scoring decoding results.

Audio data is presented as single channel 16-bit mp3 files; the majority of these files have a sample rate of 44.1 kHz. Transcripts and other text data are plain text encoded in UTF-8.

Samples

Please view this audio sample and transcript sample.

Updates

None at this time.

Additional Citation

When publishing results based on the texts in the corpus please refer to:

Inga Rún Helgadóttir, Róbert Kjaran, Anna Björk Nikulásdóttir and Jón Guðnason, 2017. Building an ASR corpus using Althingi’s Parliamentary Speeches. Proceedings of Interspeech 2017.

Available Media

View Fees





Login for the applicable fee