--------------------------------------------------------------------------------
                              MASRI-SYNTHETIC
          Synthetized Speech with Transcriptions in Maltese produced by the
                   MASRI Team of the University of Malta
--------------------------------------------------------------------------------
--------------------------------------------------------------------------------
PRESENTATION
--------------------------------------------------------------------------------

The MASRI-SYNTHETIC is a corpus made out of synthesized speech in Maltese. The
text-to-speech (TTS) system utilized to produce the utterances was developed 
by the Research & Development Department of Crimsonwing p.l.c.

The sentences used to create the corpus were extracted from the MLRS Corpus, 
which is a corpus of written or transcribed Maltese divided into different 
genres, including: culture, news, academic, religion, sports, etc. More
information below.

MASRI stands for "Maltese Automatic Speech Recognition I". MASRI is a project 
at the University of Malta, funded by the University of Malta Research Fund 
Award Scheme.

The MASRI-SYNTHETIC CORPUS was created in June, 2020 and it was used to 
perform experiments related to data augmentation techniques to improve the 
speech recognition of Maltese.

--------------------------------------------------------------------------------
DISCLAIMER
--------------------------------------------------------------------------------

The MASRI team does not guarantee the accuracy of this corpus, nor its 
suitability for any specific purpose. In fact, we expect a number of errors, 
omissions and inconsistencies to remain in the corpus.

--------------------------------------------------------------------------------
ACKNOWLEDGEMENTS
--------------------------------------------------------------------------------

We wish to thank KPMG Microsoft Business Solutions (formerly CrimsonWing) for 
providing the TTS system used in our experiments.

For more information about the CrimsonWing TTS system see:
https://pdfs.semanticscholar.org/5e5a/25e34b3c351ba0e58211a5192535e9ddea06.pdf

--------------------------------------------------------------------------------
MOTIVATION
--------------------------------------------------------------------------------

The experiments which motivated the creation of the MASRI-SYNTHETIC Corpus had
to do with data augmentation techniques for improving speech recognition of
Maltese.

We considered three different types of data augmentation: unsupervised training
(which implies the creation of automatic transcriptions by an ASR system in 
Maltese), multilingual training and the use of synthesized speech as training 
data. The goal was to determine which of these techniques, or combination of 
them were the most effective to improve speech recognition with only 7 hours 
of gold transcribed data in Maltese.

Our experiments suggests that multilingual training yield benefits, even when 
the transcriptions are noisy. However, gold annotations are better; in 
particular, the inclusion of English gold data with automatic transcriptions 
in Maltese (noisy transcriptions) yielded significant gains. Furthermore, we 
observed that pretraining on imperfect synthesized data in Maltese also 
improves performance, with further gains provided once more by the inclusion 
of gold English data. In sum, the combination of these three techniques led 
us to an absolute reduction of 15% in WER from the baseline system.

--------------------------------------------------------------------------------
BACKGROUND: THE MLRS CORPUS
--------------------------------------------------------------------------------

The MLRS Corpus is a text corpus of around 250m tokens in several different 
genres, including parliamentary debates, news, law, opinion articles, sports 
articles, culture, academic, literature and religious texts. 

Tokens in the corpus are tagged with part of speech, and labelled with lemmas 
and (where relevant) consonantal root for words of a Semitic origin. We use 
this text corpus for creating synthesized data as described in the section 
"CREATION METHODOLOGY".

The corpus is available on the Maltese Language Resource Server, and can also 
be searched through an online interface. For more
information, see: https://mlrs.research.um.edu.mt/index.php?page=corpora

--------------------------------------------------------------------------------
CREATION METHODOLOGY
--------------------------------------------------------------------------------

The corpus was created following the steps below:

-  All the sentences from MLRS are put in a single plain text file. The text 
   includes punctuation marks.

-  To facilitate the text processing, sentences are split to fit into lines 
   with 30 words only.

-  Punctuation marks and sentences including not UTF-8 characters are removed.

-  Sentences with foreign words and proper names were removed.

-  As the letters "c" and "y" do not really belong to the Maltese alphabet, 
   sentences including words with any of those letters were removed. This is 
   done to ensure that only Maltese words will be included in each sentence.

-  Using Python, the resulting sentences are now put into a simple list; so, 
   each element is a word.

-  Each word of the list is now taken one by one to produce text lines of 
   exactly 13 words. This process only generated 27,714 sentences of the 
   52,500 that constitute the whole corpus.

-  To produce the remaining sentences, the words of the list were shuffled 
   and the process in the previous point were repeated until we got the 
   52,500 sentences needed by the corpus.

-  At the end, the produced sentences were converted into utterances using the 
   TTS system.

--------------------------------------------------------------------------------
CORPUS CHARACTERISTICS
--------------------------------------------------------------------------------

The MASRI-SYNTHETIC CORPUS (MSYC) has the following characteristics:

- The MSYC has an exact duration of 99 hours and 18 minutes. It has 52500 audio
  files.

- The MSYC has utterances from 210 different voices: 105 male and 105 female
  voices.

- Voices were produced when varying between 21 values of pitch (-20 to 20) and 
  5 values of speech rate (-2 to 2).

- Data in MSYC is classified by voice. It means, all the utterances belonging 
  to one single voice are stored in one single directory.

- Each voice has assigned 250 utterances of 13 words each.

- Utterances have a duration between 2 and 10 seconds each.

- Utterances are also classified according to the gender (male/female) of the 
  voice.

- Audio files in the MSYC are distributed in a 16khz@16bit mono format.

- Every audio file has an ID that is compatible with ASR engines such as 
  Kaldi and CMU-Sphinx.

- Transcriptions in MSYC are lowercase. No punctuation marks are permitted 
  except dashes (-) and apostrophes (') because they belong to the Maltese 
  orthography.

--------------------------------------------------------------------------------
GENERAL ORGANIZATION OF THE DIRECTORIES
--------------------------------------------------------------------------------

The MASRI_SYNTHETIC directory contains the following files 
and directories:

        - files       :	One can find the transcription files, the paths file 
			as well as the "Voices_Info.xls" file that contains 
			relevant information about all the voices in the 
			corpus.

        - speech      : One can find the speech files classified by gender 
			(male/female voice).

        - README.txt

--------------------------------------------------------------------------------
THE CORPUS FILES
--------------------------------------------------------------------------------

In the "files" directory one can find the following:

- MASRI_SYNTHETIC.trans		:	This is the transcription file in plain
					text format with no punctuation marks.

- MASRI_SYNTHETIC.paths		: 	This file contains the relative paths
					from the "speech" directory to every 
					particular speech file.

- Voices_Info.xls		:	This file contains relevant information
					about the voices. Specifically: Number
					of audio files per voice and the total
					amount of time of speech per voice.

--------------------------------------------------------------------------------
IDENTIFICATION KEY FORMAT
--------------------------------------------------------------------------------

Every audio file in the MASRI-SYNTHETIC CORPUS has an identification key with 
the following format:
                            MSRSY_F_0001_RN02PN10_0005


   MSRSY          F            0001           RN02PN10               0005
   Acronym    Gender of        Number       Special key         Number of the
   for        the voice:       of         with information      audio file of
   "MASRI-    "M" for Male     voice.     of the current        a particular
   SYNTHETIC  "F" for Female            voice explained below   voice. There
   Corpus"                                                          are no
                                                                discontinuities

                                    RN02PN10


                       RN02                           PN10
                "R" is for Speech Rate         "P" is for Pitch
                "N" is for Negavite            "N" is for Negavite
                Possible values are:           Possible values are:
		"N" = Negative                 "N" = Negative
		"P" = Positive                 "P" = Positive
		"C" = Zero                     "C" = Cero
		"02" is the Speech Rate.       "10" is the Speech Rate.
		In this case, the Speech       In this case, the Pitch
		Rate is -2.                    is -10.

--------------------------------------------------------------------------------
AUTHORS
--------------------------------------------------------------------------------

MASRI Team	:	Carlos Daniel Hernández Mena 
                         Albert Gatt
                         Claudia Borg
                         Andrea DeMarco
                         Lonneke van der Plas

--------------------------------------------------------------------------------
                 For more information, visit our website
                  https://www.um.edu.mt/projects/masri/
--------------------------------------------------------------------------------