-------------------------------------------------------------------------------- MASRI-SYNTHETIC Synthetized Speech with Transcriptions in Maltese produced by the MASRI Team of the University of Malta -------------------------------------------------------------------------------- -------------------------------------------------------------------------------- PRESENTATION -------------------------------------------------------------------------------- The MASRI-SYNTHETIC is a corpus made out of synthesized speech in Maltese. The text-to-speech (TTS) system utilized to produce the utterances was developed by the Research & Development Department of Crimsonwing p.l.c. The sentences used to create the corpus were extracted from the MLRS Corpus, which is a corpus of written or transcribed Maltese divided into different genres, including: culture, news, academic, religion, sports, etc. More information below. MASRI stands for "Maltese Automatic Speech Recognition I". MASRI is a project at the University of Malta, funded by the University of Malta Research Fund Award Scheme. The MASRI-SYNTHETIC CORPUS was created in June, 2020 and it was used to perform experiments related to data augmentation techniques to improve the speech recognition of Maltese. -------------------------------------------------------------------------------- DISCLAIMER -------------------------------------------------------------------------------- The MASRI team does not guarantee the accuracy of this corpus, nor its suitability for any specific purpose. In fact, we expect a number of errors, omissions and inconsistencies to remain in the corpus. -------------------------------------------------------------------------------- ACKNOWLEDGEMENTS -------------------------------------------------------------------------------- We wish to thank KPMG Microsoft Business Solutions (formerly CrimsonWing) for providing the TTS system used in our experiments. For more information about the CrimsonWing TTS system see: https://pdfs.semanticscholar.org/5e5a/25e34b3c351ba0e58211a5192535e9ddea06.pdf -------------------------------------------------------------------------------- MOTIVATION -------------------------------------------------------------------------------- The experiments which motivated the creation of the MASRI-SYNTHETIC Corpus had to do with data augmentation techniques for improving speech recognition of Maltese. We considered three different types of data augmentation: unsupervised training (which implies the creation of automatic transcriptions by an ASR system in Maltese), multilingual training and the use of synthesized speech as training data. The goal was to determine which of these techniques, or combination of them were the most effective to improve speech recognition with only 7 hours of gold transcribed data in Maltese. Our experiments suggests that multilingual training yield benefits, even when the transcriptions are noisy. However, gold annotations are better; in particular, the inclusion of English gold data with automatic transcriptions in Maltese (noisy transcriptions) yielded significant gains. Furthermore, we observed that pretraining on imperfect synthesized data in Maltese also improves performance, with further gains provided once more by the inclusion of gold English data. In sum, the combination of these three techniques led us to an absolute reduction of 15% in WER from the baseline system. -------------------------------------------------------------------------------- BACKGROUND: THE MLRS CORPUS -------------------------------------------------------------------------------- The MLRS Corpus is a text corpus of around 250m tokens in several different genres, including parliamentary debates, news, law, opinion articles, sports articles, culture, academic, literature and religious texts. Tokens in the corpus are tagged with part of speech, and labelled with lemmas and (where relevant) consonantal root for words of a Semitic origin. We use this text corpus for creating synthesized data as described in the section "CREATION METHODOLOGY". The corpus is available on the Maltese Language Resource Server, and can also be searched through an online interface. For more information, see: https://mlrs.research.um.edu.mt/index.php?page=corpora -------------------------------------------------------------------------------- CREATION METHODOLOGY -------------------------------------------------------------------------------- The corpus was created following the steps below: - All the sentences from MLRS are put in a single plain text file. The text includes punctuation marks. - To facilitate the text processing, sentences are split to fit into lines with 30 words only. - Punctuation marks and sentences including not UTF-8 characters are removed. - Sentences with foreign words and proper names were removed. - As the letters "c" and "y" do not really belong to the Maltese alphabet, sentences including words with any of those letters were removed. This is done to ensure that only Maltese words will be included in each sentence. - Using Python, the resulting sentences are now put into a simple list; so, each element is a word. - Each word of the list is now taken one by one to produce text lines of exactly 13 words. This process only generated 27,714 sentences of the 52,500 that constitute the whole corpus. - To produce the remaining sentences, the words of the list were shuffled and the process in the previous point were repeated until we got the 52,500 sentences needed by the corpus. - At the end, the produced sentences were converted into utterances using the TTS system. -------------------------------------------------------------------------------- CORPUS CHARACTERISTICS -------------------------------------------------------------------------------- The MASRI-SYNTHETIC CORPUS (MSYC) has the following characteristics: - The MSYC has an exact duration of 99 hours and 18 minutes. It has 52500 audio files. - The MSYC has utterances from 210 different voices: 105 male and 105 female voices. - Voices were produced when varying between 21 values of pitch (-20 to 20) and 5 values of speech rate (-2 to 2). - Data in MSYC is classified by voice. It means, all the utterances belonging to one single voice are stored in one single directory. - Each voice has assigned 250 utterances of 13 words each. - Utterances have a duration between 2 and 10 seconds each. - Utterances are also classified according to the gender (male/female) of the voice. - Audio files in the MSYC are distributed in a 16khz@16bit mono format. - Every audio file has an ID that is compatible with ASR engines such as Kaldi and CMU-Sphinx. - Transcriptions in MSYC are lowercase. No punctuation marks are permitted except dashes (-) and apostrophes (') because they belong to the Maltese orthography. -------------------------------------------------------------------------------- GENERAL ORGANIZATION OF THE DIRECTORIES -------------------------------------------------------------------------------- The MASRI_SYNTHETIC directory contains the following files and directories: - files : One can find the transcription files, the paths file as well as the "Voices_Info.xls" file that contains relevant information about all the voices in the corpus. - speech : One can find the speech files classified by gender (male/female voice). - README.txt -------------------------------------------------------------------------------- THE CORPUS FILES -------------------------------------------------------------------------------- In the "files" directory one can find the following: - MASRI_SYNTHETIC.trans : This is the transcription file in plain text format with no punctuation marks. - MASRI_SYNTHETIC.paths : This file contains the relative paths from the "speech" directory to every particular speech file. - Voices_Info.xls : This file contains relevant information about the voices. Specifically: Number of audio files per voice and the total amount of time of speech per voice. -------------------------------------------------------------------------------- IDENTIFICATION KEY FORMAT -------------------------------------------------------------------------------- Every audio file in the MASRI-SYNTHETIC CORPUS has an identification key with the following format: MSRSY_F_0001_RN02PN10_0005 MSRSY F 0001 RN02PN10 0005 Acronym Gender of Number Special key Number of the for the voice: of with information audio file of "MASRI- "M" for Male voice. of the current a particular SYNTHETIC "F" for Female voice explained below voice. There Corpus" are no discontinuities RN02PN10 RN02 PN10 "R" is for Speech Rate "P" is for Pitch "N" is for Negavite "N" is for Negavite Possible values are: Possible values are: "N" = Negative "N" = Negative "P" = Positive "P" = Positive "C" = Zero "C" = Cero "02" is the Speech Rate. "10" is the Speech Rate. In this case, the Speech In this case, the Pitch Rate is -2. is -10. -------------------------------------------------------------------------------- AUTHORS -------------------------------------------------------------------------------- MASRI Team : Carlos Daniel Hernández Mena Albert Gatt Claudia Borg Andrea DeMarco Lonneke van der Plas -------------------------------------------------------------------------------- For more information, visit our website https://www.um.edu.mt/projects/masri/ --------------------------------------------------------------------------------