Global TIMIT Thai Global TIMIT Thai, also called THAIMIT, was collected by Nattanun Chanchaochai from August 5 to August 15 2016 with support from the Linguistic Data Consortium and consists of approximately 11.77 hours of read speech with time-aligned transcripts in Standard Thai. The Global TIMIT Initiative Recognizing the popularity among researchers of the original English TIMIT corpus (Garofalo, et al. 1993) and the applicability of its overall structure to language documentation and human language technology development, the Global TIMIT initiative aims to create a series of corpora in many linguistic varieties that preserve the key features that make the original so useful. The TIMIT Acoustic-Phonetic Continuous Speech Corpus (LDC93S1) was designed for acoustic-phonetic studies and for the development and evaluation of automatic speech recognition systems. However, a review of the papers citing TIMIT (Chanchaochai, et al. 2018) shows that it has been used for: speech recognition, speaker recognition, speech synthesis, speech coding, speech enhancement, voice activity detection, speech perception, overlap detection and source separation, diagnosis of speech and language disorders, and linguistic phonetics, among others. TIMIT's key features are specifically: 1) a large number of fluently-read sentences 2) containing a representative sample of linguistic patterns (phonetic, lexical, syntactic, semantic, pragmatic) 3) a relatively large number of individual speakers (anonymized) 4) high quality recordings 5) time-aligned lexical and phonetic transcription of all utterances 6) a pattern of distribution of sentences across speakers such that: some sentences are read by all speakers, other sentences are read by a few speakers and still others are read by just one speaker 7) broad availability. While the original TIMIT recruited 630 speakers to read 10 sentences each, Global TIMIT seeks to reduce recruiting costs while still presenting each speaker with a manageable task and yielding ~6000 read sentences. Global TIMIT Thai Global TIMIT Thai consists of 50 speakers each reading 120 sentences yielding a total of 6000 sentences and, as above, 11.77 hours of read speech. Sentence Selection/Development Sentence selection was accomplished by the following process. 1) Selecting ~10,000 sentences automatically from the Thai National Corpus II, the Thai Junior Encyclopedia, and Thai Wikipedia. 75% of the sentences were selected from the Thai National Corpus, based on searches using the most frequent words in the corpus documentation and then sub-selecting to provide representation across each of the six corpus genres: fiction, newspaper, non-academic, academic, law, and miscellaneous. 2) Manually checking selected sentences by a native-speaker linguist to remove any sentences deemed to be too short to contain meaningful phonetic content or too long to be read comfortably, or to contain inappropriate characters, rare words, foreign words or many numeric expressions until 2124 sentences remained. 3) Dividing those sentences randomly into groups such that 24 sentences are in the 'calibration' group to be read by all subjects, 300 are in the 'common' group each group to be read by 10 subjects and the remaining 1800 in the 'unique' group each to be read by only one subject. 4) Rechecking the resulting sentences by a native-speaker linguist and lightly editing to correct any errors and adjust spacing when needed to aid readability. Speakers Speakers were recruited in the Bangkok Metropolitan area, and were fluent in Standard Thai, literate and born and raised in Thailand. Given our interest in representation across regions of Thailand, we did not require that subjects were native speakers of Standard Thai. Demographic details were collected, including sex, date and place of birth, recording date, languages spoken and any comments the subjects made on any of the above. Recording The corpus was recorded in a quiet room at the Do D Foundation using a head-mounted noise-cancelling microphone with integrated A-to-D conversion and USB connection (Logitech H390) connected to a laptop (Dell XPS13-9343). The recording method yielded a single audio file for each sentence read. QC In general, the list of sentences presented to each speaker forms a rough transcript. In the cases where the speaker varied from the prompted sentence, the script was copied and edited to form an accurate transcript. These (tran)scripts were then used to create a corpus specific aligner and were then forced-aligned. The prompting sentences for each recording are available in a 'recordings' table while the aligner outputs, the 'words' and TextGrid files, represent what the speaker actually uttered. Alignment A custom HTK based aligner was created specifically for this corpus. Input to the aligner, in addition to the recorded audio and corrected transcripts, were: a 'word' tokenization provided by the Smart Word Analysis for Thai (SWATH) tool, pronunciations from the Mary R. Haas Thai Dictionary Project and ~1000 pronunciation entries that were added specifically for this corpus. The aligner provided timestamps at the 'words', phones and tones levels and also generated Praat TextGrids. Corpus Formats All speech data are presented as 16kHz, 16-bit audio converted from wav to flac format with lossless compression. Each audio file has accompanying '.words', '.phones' and '.tones' segmentation files. The ".words" files contain one row for each word or extent of silence in the file, represented as a triple of start time and end time, measured in seconds from the start of the file, and the word in standard orthography or else the string "sil" for silence. The format of the ".phones" files are identical except that their third column contains the phone uttered during that time extent rendered in the DARPABET representation as used to train the aligner. Similarly, the ".tones" file format is identical but for a representation of the tone in the third column. The resulting data is organized into 4 directories: 1) flac: containing 6000 audio files, one for each utterance 2) segmentation: containing 6000 of each of words, phone and tones files 3) TextGrids: containing 6000 Praat TextGrids, one for each 4) tsv: containing the segmentations converted to tab separated label files The latest versions of Elan can read the .flac audio and TextGrids files meaning the corpus should also be easily accessible to users of Elan as well as Praat and Audacity and those who prefer to interact with the data using their own tools. Corpus metadata include the following tables: sentence_types.tbl containing one row for each of the 2124 sentences selected to be read by the subjects with columns for the Sentence ID, sentence in Thai orthography and type: calibration, common and unique as described above. Note that there are two instances where a sentence in the common category differ by the presence of one space from the other 9 instances. As these were exactly what was presented to the speakers, we have left the spacing variation in the table. subjects.tbl containing 50 rows, one for each subject, with columns for Subject ID, Sex, Region and Province where the speaker was born and raised, Recording Date, Date of Birth, Height, Education, Languages spoken and any Comments the subject offered about any of the above. recordings.tbl containing one row for each recording of each sentence uttered by each speaker, 6000 rows in total, with columns for the Subject ID, Prompt # (giving the order in which the prompt was presented which differs by speaker because each speaker received a unique set of sentences with a unique randomization), Sentence ID, Sentence in Thai orthography and audio file name. audio_properties.tbl containing one row for each recording of each sentence uttered by each speaker, 6000 rows in total, with columns for the number of seconds, samples, and sectors in the audio, the file size, bit rate and audio file name as reported by soxi. Users should note that in collections of this type, with short spans of audio in individual files, such measures of pSNR can be more strongly affected by the amount of silence included before and after each compared to instances where a file contains a long span of continuous speech. audio_quality.tbl containing one row for each recording of each sentence uttered by each speaker, 6000 rows in total, with columns for the pseudoSNR (estimated by subtracting the 15th from the 85th quantile), number of samples and clipped samples, maximum absolute value of all sample values, the 15th, 85th, 90th and 95th quantiles and audio file name. Citation Uses of the corpus should please cite the corpus as immediately below and its reference paper in the Interspeech Proceedings: Nattanun Chanchaochai, Mark Liberman, Jiahong Yuan, Christopher Cieri, Jonathan Wright 2022 Global TIMIT Standard Thai Philadelphia, Linguistic Data Consortium References Chanchaochai, Nattanun, Christopher Cieri, Japhet Debrah, Hongwei Ding, Yue Jiang, Sishi Liao, Mark Liberman, Jonathan Wright, Jiahong Yuan, Juhong Zhan, Yuqing Zhan (2018) GlobalTIMIT: Acoustic-Phonetic Datasets for the World's Languages, Proceedings of the 19th Annual Conference of the International Speech Communication Association (Interspeech 2018) Hyderabad, September 2-6. Charoenpornsawat, P. (1999) Feature-based Thai word segmentation. Master’s thesis, Computer Engineering, Chulalongkorn University, 1999. Software by Theppitak Karoonboonyanan, Online: available from [https://linux.thai.net/projects/swath] Department of Linguistics, Chulalongkorn University (2013) Thai National Corpus II. Online: available from [http://www.arts.chula.ac.th/ling/TNCII/] Garofolo, John S., Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus, David S. Pallett, Nancy L. Dahlgren, Victor Zue (1993) TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web Download. Philadelphia: Linguistic Data Consortium. Haas, M. R. (1951) The Mary R. Haas Thai Dictionary Project. Online: available from [http://sealang.net/thai/dictionary.htm] Thai Junior Encyclopedia by Royal Command of His Majesty the King (1997) Thai Junior Encyclopedia. Online: available from [http://saranukromthai.or.th/sub/Ebook/Ebbok.php Sponsorship The authors acknowledge the generous support of the University of Pennsylvania's Office of the Vice Provost for Global Initiatives (PennGlobal) through its Penn China Research and Engagement Fund, and the School of Arts and Sciences through its Global Engagement Fund for their support of the Global TIMIT model which was then then applied to the Thai effort as well as the Linguistic Data Consortium for direct support of the Thai collection. Acknowledgements The authors would also like to thank the collection coordinator, Vhira Bubwiyaprond, the non-profit organization that hosted the recordings, the Do D Foundation, and all the speakers in the corpus. Updates None at this time. Copyright Portions © the Thai Junior Encyclopedia Foundation under the patronage of His Majesty King Bhumibol Adulyadej the Great, the Thai National Corpus and its sources, and the Wikipedia contributors.