MASRI Synthetic

LDC2022S08

Introduction

MASRI (Maltese Automatic Speech Recognition I) Synthetic, Linguistic Data Consortium (LDC) Catalog Number LDC2022S08 and ISBN 1-58563-995-8, was developed by the MASRI team at the University of Malta and consists of approximately 99 hours of synthesized Maltese speech.

Data

Source sentences were extracted from the Maltese Language Resource Server (MLRS) corpus, comprised of written or transcribed Maltese covering various genres, including parliamentary debates, news, law, opinion, sports, culture, academic, literature and religious texts. Text was processed through the CrimsonWing text-to-speech system to generate speech files. Synthesized speech was created with 210 voices (105 male and 105 female).

Audio files are presented as 16kHz, 16-bit, single channel flac files. When uncompressed, they produce PCM wav files.

Transcripts are contained in a single plain text file encoded as UTF-8.

Directory Structure

Please see file.tbl for a complete file list as well as checksums for this publication.

Updates

Additional information, updates, bug fixes may be available in the LDC catalog entry for this corpus at LDC2022S08.

Content Copyright

Portions © 2022 University of Malta, © 2022 Trustees of the University of Pennsylvania