--------------------------------------------------------------------------------
                  RAVNURSSON FAROESE SPEECH AND TRANSCRIPTS
--------------------------------------------------------------------------------

Language        : Faroese.

Authors         : Carlos Daniel Hernández Mena, Annika Simonsen.

Recommended use : speech recognition.

--------------------------------------------------------------------------------
Description
--------------------------------------------------------------------------------

The corpus "RAVNURSSON FAROESE SPEECH AND TRANSCRIPTS" (or RAVNURSSON Corpus 
for short) is a collection of speech recordings with transcriptions intended 
for Automatic Speech Recognition (ASR) applications in the language that is
spoken at the Faroe Islands (Faroese). It was curated at the Reykjavík
University (RU) in 2022.

The RAVNURSSON Corpus is an extract of the "Basic Language Resource Kit 1.0" 
(BLARK 1.0) [1] developed by the Ravnur Project from the Faroe Islands [2]. As 
a matter of fact, the name RAVNURSSON comes from Ravnur (a tribute to the 
Ravnur Project) and the suffix "son" which in Icelandic means "son of". 
Therefore, the name "RAVNURSSON" means "The (Icelandic) son of Ravnur". The 
double "ss" is just for aesthetics.

The audio was collected by recording speakers reading texts. The participants 
are aged 15-83, divided into 3 age groups: 15-35, 36-60 and 61+.

The speech files are made of 249 female speakers and 184 male speakers; 433 
speakers total. The recordings were made on TASCAM DR-40 Linear PCM audio 
recorders using the built-in stereo microphones in WAVE 16 bit with a sample 
rate of 48kHz, but then, downsampled to 16kHz@16bit mono for this corpus.

--------------------------------------------------------------------------------
Disclaimer and Terms of Use
--------------------------------------------------------------------------------

"RAVNURSSON FAROESE SPEECH AND TRANSCRIPTS" by Carlos Daniel Hernández Mena
and Annika Simonsen is licensed under a Creative Commons Attribution 
4.0 International (CC BY 4.0) License with the hope that it will be useful, 
but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY 
or FITNESS FOR A PARTICULAR PURPOSE.  

To view a copy of this license visit:
https://creativecommons.org/licenses/by/4.0/

--------------------------------------------------------------------------------
Corpus Characteristics
--------------------------------------------------------------------------------

- The utterances were recorded using a TASCAM DR-40.

- The audio files in this corpus are distributed in a flac format at 
  16kHz@16bit mono.

- Participants self-reported their age group, gender, native language and
  dialect.

- Participants are aged between 15 to 83 years. 
 
- The corpus contains 71949 speech files from 433 speakers, totalling 109 hours
  and 9 minutes.

- The corpus is split into train, dev, and test portions. Lengths or every 
  portion are: train = 100h08m, test = 4h30m, dev=4h30m.
  
- The development and test portions have exactly 10 male and 10 female
  speakers each and both portions have exactly the same size in hours.
  
- Due to the limited number of prompts to read, only 39945 of the 71949
  prompts in the whole corpus are unique. In other words,  44.48% of the 
  prompts in the corpus are repeated at least once.
  
- Despite the repeated prompts in the corpus, the development and test
  portions do not share speakers with each other or with the training set.
  
--------------------------------------------------------------------------------
Analysis of the Repeated Prompts
--------------------------------------------------------------------------------

As the number of reading prompts was limited, the common denominator in the
RAVNURSSON corpus is that one prompt is read by more than one speaker. This is
relevant because is a common practice in ASR to create a language model using
the prompts that are found in the train portion of the corpus. That is not
recommended for the RAVNURSSON Corpus as it counts with many prompts shared
by all the portions and that will produce an important bias in the language
modeling task.

In this section we present some statistics about the repeated prompts through 
all the portions of the corpus.

- In the train portion:

	* Total number of prompts = 65616
	* Number of unique prompts = 38646

There are 26970 repeated prompts in the train portion. In other words, 41.1% of 
the prompts are repeated.

- In the test portion:

	* Total number of prompts = 3002
	* Number of unique prompts = 2887

There are 115 repeated prompts in the test portion. In other words, 3.83% of 
the prompts are repeated.

- In the dev portion:

	* Total number of prompts = 3331
	* Number of unique prompts = 3302

There are 29 repeated prompts in the dev portion. In other words, 0.87% of the 
prompts are repeated.

- Considering the corpus as a whole:

	* Total number of prompts = 71949
	* Number of unique prompts = 39945

There are 32004 repeated prompts in the whole corpus. In other words, 44.48% of 
the prompts are repeated.
  
--------------------------------------------------------------------------------
Organization of the Speech Files
--------------------------------------------------------------------------------

The directory called "speech" contains all the speech files of the corpus. The 
files in the speech directory are divided in three directories: train, dev and 
test. The train portion is sub-divided in three types of recordings: RDATA1O,
RDATA1OP and RDATA2; this is due the organization of the recordings in the 
original BLARK 1.0. There, the recordings are divided in Rdata1 and Rdata2.

One main difference between Rdata1 and Rdata2 is that the reading environment
for Rdata2 was controlled by a software called "PushPrompt" which is included
in the original BLARK 1.0. Another main difference is that in Rdata1 there
are some available transcriptions labeled at the phoneme level. For this
reason the audio files in the speech directory of the RAVNURSSON corpus are 
divided in the folders RDATA1O where "O" is for "Orthographic" and RDATA1OP 
where "O" is for Orthographic and "P" is for phonetic.

In the case of the dev and test portions, the data come only from Rdata2
which does not have labels at the phonetic level.

It is important to clarify that the RAVNURSSON Corpus only includes 
transcriptions at the orthographic level.

--------------------------------------------------------------------------------
The Metadata File (metadata.tsv)
--------------------------------------------------------------------------------

The metadata file is a "tab-separated values file" (TSV) containing all the 
relevant information of the corpus. This file can be read using the Python 
library called "Pandas" [3]. The metadata.tsv file comprises of the following
12 columns:


01.- id              : Filename as explained in the section "Audio Filenames"
                       without the extension ".flac".

02.- speaker_id      : The filename without the segment number. This id can be
                       very useful in ASR systems like Kaldi, which performs 
                       Speaker Adaptation Training (SAT).

03.- filename        : Filename as explained in the section "Audio Filenames"
                       with the extension ".flac".

04.- sentence_norm   : The normalized transcription: no punctuation marks, no 
                       digits, lower case letters, one single space between.
                       words.

05.- gender          : The gender of the speaker: male or female.

06.- age             : The age range of the speaker: 15-35, 36-60, 61+ years
                       old.

07.- native_language : "Faroese" in all the cases.

08.- dialect         : The speaker dialect as explained in the section "Audio 
                       Filenames".

09.- created_at      : The date when the audio file was recorded.

10.- duration        : duration of the speech file in seconds.

11.- sample_rate     : 16kHz in all the cases.

12.- status          : The portion: train, test or dev.

--------------------------------------------------------------------------------
Audio Filenames
--------------------------------------------------------------------------------

Every audio file in the RAVNURSSON Corpus has an individual filename with the 
following format:

                        MEY01_040319_rok0_0009.flac

MEY01           : Speaker Id. The speaker id can be broken down into the
                   following:
			                   
                   * M for male or K for female
                   * E is the dialect group that can be:
                   	
                   	+ U for Suðuroy
                   	+ A for Sandoy
                   	+ S for Suðurstreymoy
                   	+ E for Norðurstreymoy/Eysturoy (exclusive of Eiði, 
                   	    Gjógv og Funningur)
                   	+ V for Vágar
                   	+ N for Norðuroyggjar (inklusive of Eiði, Gjógv 
                   	    og Funningur)
                   	    
                   * Y is the age group that can be:
                   
                   	+ Y for "Younger" between 15-35 years old.
                   	+ M for "Middle-aged" between 36-60 years old.
                   	+ E for "Elderly" 61 years old or older.
                   	
                   * 01 is a number that always consists of two digits and 
                        starts with 01, 02, 03 etc. The first speaker in a 
                        group with the same gender, dialect group and age 
                        group (e.g. MEY) gets the number 01. The next speaker 
                        in the same group gets the number 02 (and his ID is 
                        therefore MEY02).
                   
040319          : The date when the speech was recorded (day/month/year).

rok0            : The type of reading material. This code can only be found
                  in speech files at RDATA1O and RDATA1OP. For more 
                  information about the types of reading material please
                  see the documentation of the original BLARL 1.0 and its
                  directory "readingtexts_1.0".

0009            : Segment number. In the original BLARK 1.0 the recording 
                  session is distributed as one audio file per speaker and
                  it can be very long from the ASR perspective. So, the
                  audio files are subdivided in segments of around 10 seconds
                  to fit most of the modern ASR engines. The numbering is 
                  continuous for each speaker; the only exception is with the 
                  files MUY01_180519_set4_0004 and MUY02_190120_eind2_0007. We
                  detected that they are empty and removed them.
                                   
.flac           : The corpus is distributed in flac format.

--------------------------------------------------------------------------------
Citation
--------------------------------------------------------------------------------

When publishing results based on the corpus please refer to:

   Mena, Carlos; Simonsen, Annika; "RAVNURSSON FAROESE SPEECH AND TRANSCRIPTS". 
   Web Download. Reykjavik University: Language and Voice Lab, 2022.

Contact: Carlos Mena (carlosm@ru.is)

License: CC BY 4.0

--------------------------------------------------------------------------------
Acknowledgements
--------------------------------------------------------------------------------

This project was made possible under the umbrella of the Language Technology 
Programme for Icelandic 2019-2023. The programme, which is managed and 
coordinated by Almannarómur, is funded by the Icelandic Ministry of Education, 
Science and Culture.

Special thanks to Dr. Jón Guðnason, professor at Reykjavík University and head
of the Language and Voice Lab (LVL) for providing computational resources.

--------------------------------------------------------------------------------
References
--------------------------------------------------------------------------------

[1] Simonsen, A., Debess, I. N., Lamhauge, S. S., & Henrichsen, P. J. Creating 
    a basic language resource kit for Faroese. In LREC 2022. 13th International 
    Conference on Language Resources and Evaluation.
    
[2] Website. The Project Ravnur under the Talutøkni Foundation
    https://maltokni.fo/en/the-ravnur-project
    
[3] Software. Pandas (Python Library). https://pandas.pydata.org

--------------------------------------------------------------------------------
--------------------------------------------------------------------------------