-------------------------------------------------------------------------------- RAVNURSSON FAROESE SPEECH AND TRANSCRIPTS -------------------------------------------------------------------------------- Language : Faroese. Authors : Carlos Daniel Hernández Mena, Annika Simonsen. Recommended use : speech recognition. -------------------------------------------------------------------------------- Description -------------------------------------------------------------------------------- The corpus "RAVNURSSON FAROESE SPEECH AND TRANSCRIPTS" (or RAVNURSSON Corpus for short) is a collection of speech recordings with transcriptions intended for Automatic Speech Recognition (ASR) applications in the language that is spoken at the Faroe Islands (Faroese). It was curated at the Reykjavík University (RU) in 2022. The RAVNURSSON Corpus is an extract of the "Basic Language Resource Kit 1.0" (BLARK 1.0) [1] developed by the Ravnur Project from the Faroe Islands [2]. As a matter of fact, the name RAVNURSSON comes from Ravnur (a tribute to the Ravnur Project) and the suffix "son" which in Icelandic means "son of". Therefore, the name "RAVNURSSON" means "The (Icelandic) son of Ravnur". The double "ss" is just for aesthetics. The audio was collected by recording speakers reading texts. The participants are aged 15-83, divided into 3 age groups: 15-35, 36-60 and 61+. The speech files are made of 249 female speakers and 184 male speakers; 433 speakers total. The recordings were made on TASCAM DR-40 Linear PCM audio recorders using the built-in stereo microphones in WAVE 16 bit with a sample rate of 48kHz, but then, downsampled to 16kHz@16bit mono for this corpus. -------------------------------------------------------------------------------- Disclaimer and Terms of Use -------------------------------------------------------------------------------- "RAVNURSSON FAROESE SPEECH AND TRANSCRIPTS" by Carlos Daniel Hernández Mena and Annika Simonsen is licensed under a Creative Commons Attribution 4.0 International (CC BY 4.0) License with the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. To view a copy of this license visit: https://creativecommons.org/licenses/by/4.0/ -------------------------------------------------------------------------------- Corpus Characteristics -------------------------------------------------------------------------------- - The utterances were recorded using a TASCAM DR-40. - The audio files in this corpus are distributed in a flac format at 16kHz@16bit mono. - Participants self-reported their age group, gender, native language and dialect. - Participants are aged between 15 to 83 years. - The corpus contains 71949 speech files from 433 speakers, totalling 109 hours and 9 minutes. - The corpus is split into train, dev, and test portions. Lengths or every portion are: train = 100h08m, test = 4h30m, dev=4h30m. - The development and test portions have exactly 10 male and 10 female speakers each and both portions have exactly the same size in hours. - Due to the limited number of prompts to read, only 39945 of the 71949 prompts in the whole corpus are unique. In other words, 44.48% of the prompts in the corpus are repeated at least once. - Despite the repeated prompts in the corpus, the development and test portions do not share speakers with each other or with the training set. -------------------------------------------------------------------------------- Analysis of the Repeated Prompts -------------------------------------------------------------------------------- As the number of reading prompts was limited, the common denominator in the RAVNURSSON corpus is that one prompt is read by more than one speaker. This is relevant because is a common practice in ASR to create a language model using the prompts that are found in the train portion of the corpus. That is not recommended for the RAVNURSSON Corpus as it counts with many prompts shared by all the portions and that will produce an important bias in the language modeling task. In this section we present some statistics about the repeated prompts through all the portions of the corpus. - In the train portion: * Total number of prompts = 65616 * Number of unique prompts = 38646 There are 26970 repeated prompts in the train portion. In other words, 41.1% of the prompts are repeated. - In the test portion: * Total number of prompts = 3002 * Number of unique prompts = 2887 There are 115 repeated prompts in the test portion. In other words, 3.83% of the prompts are repeated. - In the dev portion: * Total number of prompts = 3331 * Number of unique prompts = 3302 There are 29 repeated prompts in the dev portion. In other words, 0.87% of the prompts are repeated. - Considering the corpus as a whole: * Total number of prompts = 71949 * Number of unique prompts = 39945 There are 32004 repeated prompts in the whole corpus. In other words, 44.48% of the prompts are repeated. -------------------------------------------------------------------------------- Organization of the Speech Files -------------------------------------------------------------------------------- The directory called "speech" contains all the speech files of the corpus. The files in the speech directory are divided in three directories: train, dev and test. The train portion is sub-divided in three types of recordings: RDATA1O, RDATA1OP and RDATA2; this is due the organization of the recordings in the original BLARK 1.0. There, the recordings are divided in Rdata1 and Rdata2. One main difference between Rdata1 and Rdata2 is that the reading environment for Rdata2 was controlled by a software called "PushPrompt" which is included in the original BLARK 1.0. Another main difference is that in Rdata1 there are some available transcriptions labeled at the phoneme level. For this reason the audio files in the speech directory of the RAVNURSSON corpus are divided in the folders RDATA1O where "O" is for "Orthographic" and RDATA1OP where "O" is for Orthographic and "P" is for phonetic. In the case of the dev and test portions, the data come only from Rdata2 which does not have labels at the phonetic level. It is important to clarify that the RAVNURSSON Corpus only includes transcriptions at the orthographic level. -------------------------------------------------------------------------------- The Metadata File (metadata.tsv) -------------------------------------------------------------------------------- The metadata file is a "tab-separated values file" (TSV) containing all the relevant information of the corpus. This file can be read using the Python library called "Pandas" [3]. The metadata.tsv file comprises of the following 12 columns: 01.- id : Filename as explained in the section "Audio Filenames" without the extension ".flac". 02.- speaker_id : The filename without the segment number. This id can be very useful in ASR systems like Kaldi, which performs Speaker Adaptation Training (SAT). 03.- filename : Filename as explained in the section "Audio Filenames" with the extension ".flac". 04.- sentence_norm : The normalized transcription: no punctuation marks, no digits, lower case letters, one single space between. words. 05.- gender : The gender of the speaker: male or female. 06.- age : The age range of the speaker: 15-35, 36-60, 61+ years old. 07.- native_language : "Faroese" in all the cases. 08.- dialect : The speaker dialect as explained in the section "Audio Filenames". 09.- created_at : The date when the audio file was recorded. 10.- duration : duration of the speech file in seconds. 11.- sample_rate : 16kHz in all the cases. 12.- status : The portion: train, test or dev. -------------------------------------------------------------------------------- Audio Filenames -------------------------------------------------------------------------------- Every audio file in the RAVNURSSON Corpus has an individual filename with the following format: MEY01_040319_rok0_0009.flac MEY01 : Speaker Id. The speaker id can be broken down into the following: * M for male or K for female * E is the dialect group that can be: + U for Suðuroy + A for Sandoy + S for Suðurstreymoy + E for Norðurstreymoy/Eysturoy (exclusive of Eiði, Gjógv og Funningur) + V for Vágar + N for Norðuroyggjar (inklusive of Eiði, Gjógv og Funningur) * Y is the age group that can be: + Y for "Younger" between 15-35 years old. + M for "Middle-aged" between 36-60 years old. + E for "Elderly" 61 years old or older. * 01 is a number that always consists of two digits and starts with 01, 02, 03 etc. The first speaker in a group with the same gender, dialect group and age group (e.g. MEY) gets the number 01. The next speaker in the same group gets the number 02 (and his ID is therefore MEY02). 040319 : The date when the speech was recorded (day/month/year). rok0 : The type of reading material. This code can only be found in speech files at RDATA1O and RDATA1OP. For more information about the types of reading material please see the documentation of the original BLARL 1.0 and its directory "readingtexts_1.0". 0009 : Segment number. In the original BLARK 1.0 the recording session is distributed as one audio file per speaker and it can be very long from the ASR perspective. So, the audio files are subdivided in segments of around 10 seconds to fit most of the modern ASR engines. The numbering is continuous for each speaker; the only exception is with the files MUY01_180519_set4_0004 and MUY02_190120_eind2_0007. We detected that they are empty and removed them. .flac : The corpus is distributed in flac format. -------------------------------------------------------------------------------- Citation -------------------------------------------------------------------------------- When publishing results based on the corpus please refer to: Mena, Carlos; Simonsen, Annika; "RAVNURSSON FAROESE SPEECH AND TRANSCRIPTS". Web Download. Reykjavik University: Language and Voice Lab, 2022. Contact: Carlos Mena (carlosm@ru.is) License: CC BY 4.0 -------------------------------------------------------------------------------- Acknowledgements -------------------------------------------------------------------------------- This project was made possible under the umbrella of the Language Technology Programme for Icelandic 2019-2023. The programme, which is managed and coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, Science and Culture. Special thanks to Dr. Jón Guðnason, professor at Reykjavík University and head of the Language and Voice Lab (LVL) for providing computational resources. -------------------------------------------------------------------------------- References -------------------------------------------------------------------------------- [1] Simonsen, A., Debess, I. N., Lamhauge, S. S., & Henrichsen, P. J. Creating a basic language resource kit for Faroese. In LREC 2022. 13th International Conference on Language Resources and Evaluation. [2] Website. The Project Ravnur under the Talutøkni Foundation https://maltokni.fo/en/the-ravnur-project [3] Software. Pandas (Python Library). https://pandas.pydata.org -------------------------------------------------------------------------------- --------------------------------------------------------------------------------