Althingi's Parliamentary Speeches for ASR is an aligned and segmented corpus of speech recordings. About the Althingi Parliamentary Speech corpus ---------------------------------------------- This is an aligned and segmented corpus of 6493 Althingi recordings with 196 speakers. The recordings consist of 199,614 segments, with average duration of 9.8 s. A file called segments links each text segment to its place in the audio files. The total duration of the data set is 542 hours and 25 minutes of data and it contains 4,583,751 word tokens. The corpus is split up into a training-, development- and an evaluation set. The training set contains speeches from 2005 to 2015, with a total duration of 514.5 hours. The speeches from 2016 were split evenly between the development- and evaluation sets, with 14 hours in duration each. The evaluation set is cleaner than the development set, and both are cleaner than the training set. The pronunciation dictionary is based on an edited version of Hjal’s pronunciation dictionary (E. Rögnvaldsson, 2003), which is available at Málföng.is, plus common words from the Althingi texts and from Málrómur (J. Guðnason et al., 2012). It currently contains ~181,000 words. Sequitur’s grapheme to phoneme converter (M. Bisani et al., 2008), trained on the edited pronunciation dictionary from Hjal, plus the Málrómur data, was used to get the phonemes for the new words from the Althingi data. The language models were built using transcripts of Althingi speeches dating back to 2003, excluding speeches from 2016. One is a pruned trigram model, used in decoding. The other one is a unpruned constant arpa 5-gram model, used for rescoring decoding results. Using this data, pronunciation dictionary and language models, an automatic speech recognizer with a 10.23% word error rate has been developed. This error rate was obtained using an acoustic model based on lattice-free maximum mutual information neural network architecture with both time-delay and long short term memory layers. It is based on the Switchboard recipe in the Kaldi toolkit (D. Povey et al., 2011) (https://github.com/kaldi-asr/kaldi/tree/master/egs/swbd). Our training recipe from start to finish will be made public soon. This dataset was collected in 2016 by the ASR for Althingi, the Icelandic Parliament, project at Reykjavik University in collaboration with the Althingi speech department. The structure of the corpus --------------------------- | . - docs/ | . - README.txt | . - data/ | .- audio/ | .- rad20160504T163103.mp3 | . - malfong/ | .- pron_dict.txt | .- lang_3gsmall/ | .- lang_5glarge/ | .- {dev, train, eval}/ | .- segments | .- spk2gender | .- spk2utt | .- text | .- utt2spk | .- reco2audio - audio contains the utterances as the original mp3 files. rad20160504T163103 is the speechID formed using rad and time timestamp of when the speech started. These files contain speech data as well as non speech data. The reco2audio file maps between the audio filenames and the segments files. The segments file, as explained below lists only the segments within the audio with speech data. *malfong contains the textual data - {dev, train, eval} contains the cleaned and segmented transcripts mentioned above. It follows the structure of data directories in the Kaldi toolkit. - segments maps the speech segment to the time in the audio file. The format is segmentID(speakerID-speechID_segment#) recordingID(speakerID-speechID) start time (seconds) end time(seconds). - text contains the transcripts of the audio paired with the segmentID it came from. example: BN-rad20160504T163103_00000 hæstvirtur forseti ég heyrði áðan háttvirtan þingmann Katrínu Jakobsdóttur segja það að Each line has the segmentID and the transcript of that segment - spk2gender is a tsv containing the speakerID matched with the speaker's gender, m for males, f for females. - spk2utt Each line contains the speaker abbreviation followed by all the segment/utteranceIDs(speechIDs broken into segments) spoken by that speaker. This file follows the same format as in the Kaldi toolkit. - utt2spk Each line contains the segment/utteranceID followed by the speakers within it. - reco2audio Each line contains the recordingID(speakerID-speechID) and the name of the audiofile in the audio directory. It's similar to wav.scp in kaldi but not quite. Due to the nature of Icelandic, most speaker's gender is easily inferred. A speaker whose last name contains "dóttir" is a female and a last name of "son" is usually a male. - pron_dict.txt is the pronunciation dictionary using a subset of the IPA for the phonemes - lang_3gsmall contains all the different components of the trigram language model to be used for ASR within Kaldi - lang_5glarge contains all the different components of the 5-gram language model to be used for ASR within Kaldi Citations ------------------ When publishing results based on the texts in the corpus please refer to: Inga Rún Helgadóttir, Róbert Kjaran, Anna Björk Nikulásdóttir and Jón Guðnason, 2017. Building an ASR corpus using Althingi’s Parliamentary Speeches. Proceedings of Interspeech 2017. Further information about the corpus and the building of it, is in the paper. @inproceedings{Helgadottir2017corpus, author={Helgad{\'o}ttir, Inga Run and Kjaran, R{\'o}bert and Nikul{\'a}sd{\'o}ttir, Anna B. and Gudnason, Jon}, title={Building an {ASR} Corpus Using {A}lthingi's Parliamentary Speeches}, year=2017, booktitle={Proc. Interspeech 2017}, pages={2163--2167}, doi={10.21437/Interspeech.2017-903}, url={http://dx.doi.org/10.21437/Interspeech.2017-903}, } Statistics ---------- recordings 6,493 segments 199,614 pronunciation dictionary 182,476 words speakers 196 duration 542hrs 25mins word tokens 4,583,751 language IS