Global TIMIT Thai

Global TIMIT Thai, also called THAIMIT, was collected by Nattanun
Chanchaochai from August 5 to August 15 2016 with support from the
Linguistic Data Consortium and consists of approximately 11.77 hours of
read speech with time-aligned transcripts in Standard Thai.

The Global TIMIT Initiative

Recognizing the popularity among researchers of the original English TIMIT
corpus (Garofalo, et al. 1993) and the applicability of its overall
structure to language documentation and human language technology
development, the Global TIMIT initiative aims to create a series of corpora
in many linguistic varieties that preserve the key features that make the
original so useful.  The TIMIT Acoustic-Phonetic Continuous Speech Corpus
(LDC93S1) was designed for acoustic-phonetic studies and for the
development and evaluation of automatic speech recognition
systems. However, a review of the papers citing TIMIT (Chanchaochai, et
al. 2018) shows that it has been used for: speech recognition, speaker
recognition, speech synthesis, speech coding, speech enhancement, voice
activity detection, speech perception, overlap detection and source
separation, diagnosis of speech and language disorders, and linguistic
phonetics, among others.

TIMIT's key features are specifically: 1) a large number of fluently-read
sentences 2) containing a representative sample of linguistic patterns
(phonetic, lexical, syntactic, semantic, pragmatic) 3) a relatively large
number of individual speakers (anonymized) 4) high quality recordings 5)
time-aligned lexical and phonetic transcription of all utterances 6) a
pattern of distribution of sentences across speakers such that: some
sentences are read by all speakers, other sentences are read by a few
speakers and still others are read by just one speaker 7) broad
availability.

While the original TIMIT recruited 630 speakers to read 10 sentences each,
Global TIMIT seeks to reduce recruiting costs while still presenting each
speaker with a manageable task and yielding ~6000 read sentences.

Global TIMIT Thai

Global TIMIT Thai consists of 50 speakers each reading 120 sentences
yielding a total of 6000 sentences and, as above, 11.77 hours of read
speech.

Sentence Selection/Development

Sentence selection was accomplished by the following process.

1) Selecting ~10,000 sentences automatically from the Thai National Corpus
II, the Thai Junior Encyclopedia, and Thai Wikipedia. 75% of the sentences
were selected from the Thai National Corpus, based on searches using the
most frequent words in the corpus documentation and then sub-selecting to
provide representation across each of the six corpus genres: fiction,
newspaper, non-academic, academic, law, and miscellaneous.

2) Manually checking selected sentences by a native-speaker linguist to
remove any sentences deemed to be too short to contain meaningful phonetic
content or too long to be read comfortably, or to contain inappropriate
characters, rare words, foreign words or many numeric expressions until
2124 sentences remained.

3) Dividing those sentences randomly into groups such that 24 sentences are
in the 'calibration' group to be read by all subjects, 300 are in the 'common'
group each group to be read by 10 subjects and the remaining 1800 in the
'unique' group each to be read by only one subject.

4) Rechecking the resulting sentences by a native-speaker linguist and
lightly editing to correct any errors and adjust spacing when needed to aid
readability.

Speakers

Speakers were recruited in the Bangkok Metropolitan area, and were fluent in
Standard Thai, literate and born and raised in Thailand. Given our interest in 
representation across regions of Thailand, we did not require that subjects
were native speakers of Standard Thai.

Demographic details were collected, including sex, date and place of birth,
recording date, languages spoken and any comments the subjects made on any
of the above.

Recording

The corpus was recorded in a quiet room at the Do D Foundation using a
head-mounted noise-cancelling microphone with integrated A-to-D conversion
and USB connection (Logitech H390) connected to a laptop (Dell XPS13-9343).

The recording method yielded a single audio file for each sentence read. 

QC

In general, the list of sentences presented to each speaker forms a rough
transcript. In the cases where the speaker varied from the prompted
sentence, the script was copied and edited to form an accurate
transcript. These (tran)scripts were then used to create a corpus specific
aligner and were then forced-aligned. The prompting sentences for each
recording are available in a 'recordings' table while the aligner outputs,
the 'words' and TextGrid files, represent what the speaker actually
uttered.

Alignment

A custom HTK based aligner was created specifically for this corpus. Input
to the aligner, in addition to the recorded audio and corrected
transcripts, were: a 'word' tokenization provided by the Smart Word
Analysis for Thai (SWATH) tool, pronunciations from the Mary R. Haas Thai
Dictionary Project and ~1000 pronunciation entries that were added
specifically for this corpus. The aligner provided timestamps at the
'words', phones and tones levels and also generated Praat TextGrids.

Corpus Formats

All speech data are presented as 16kHz, 16-bit audio converted from wav to
flac format with lossless compression.

Each audio file has accompanying '.words', '.phones' and '.tones'
segmentation files. The ".words" files contain one row for each word or
extent of silence in the file, represented as a triple of start time and
end time, measured in seconds from the start of the file, and the word in
standard orthography or else the string "sil" for silence.

The format of the ".phones" files are identical except that their third
column contains the phone uttered during that time extent rendered in the
DARPABET representation as used to train the aligner.

Similarly, the ".tones" file format is identical but for a representation
of the tone in the third column.

The resulting data is organized into 4 directories:
1) flac: containing 6000 audio files, one for each utterance
2) segmentation: containing 6000 of each of words, phone and tones files
3) TextGrids: containing 6000 Praat TextGrids, one for each
4) tsv: containing the segmentations converted to tab separated label files 

The latest versions of Elan can read the .flac audio and TextGrids files
meaning the corpus should also be easily accessible to users of Elan as
well as Praat and Audacity and those who prefer to interact with the data
using their own tools.

Corpus metadata include the following tables:

sentence_types.tbl containing one row for each of the 2124 sentences
selected to be read by the subjects with columns for the Sentence ID,
sentence in Thai orthography and type: calibration, common and unique as
described above. Note that there are two instances where a sentence in the
common category differ by the presence of one space from the other 9
instances. As these were exactly what was presented to the speakers, we
have left the spacing variation in the table.

subjects.tbl containing 50 rows, one for each subject, with columns for
Subject ID, Sex, Region and Province where the speaker was born and raised,
Recording Date, Date of Birth, Height, Education, Languages spoken and any
Comments the subject offered about any of the above.

recordings.tbl containing one row for each recording of each sentence
uttered by each speaker, 6000 rows in total, with columns for the Subject
ID, Prompt # (giving the order in which the prompt was presented which
differs by speaker because each speaker received a unique set of sentences
with a unique randomization), Sentence ID, Sentence in Thai orthography and
audio file name.

audio_properties.tbl containing one row for each recording of each sentence
uttered by each speaker, 6000 rows in total, with columns for the number of
seconds, samples, and sectors in the audio, the file size, bit rate and
audio file name as reported by soxi. Users should note that in collections
of this type, with short spans of audio in individual files, such measures
of pSNR can be more strongly affected by the amount of silence included
before and after each compared to instances where a file contains a long
span of continuous speech.

audio_quality.tbl containing one row for each recording of each sentence
uttered by each speaker, 6000 rows in total, with columns for the pseudoSNR
(estimated by subtracting the 15th from the 85th quantile), number of
samples and clipped samples, maximum absolute value of all sample values,
the 15th, 85th, 90th and 95th quantiles and audio file name.

Citation

Uses of the corpus should please cite the corpus as immediately below and
its reference paper in the Interspeech Proceedings:

Nattanun Chanchaochai, Mark Liberman, Jiahong Yuan, Christopher Cieri, Jonathan Wright
2022
Global TIMIT Standard Thai
Philadelphia, Linguistic Data Consortium

References

Chanchaochai, Nattanun, Christopher Cieri, Japhet Debrah, Hongwei Ding, Yue
Jiang, Sishi Liao, Mark Liberman, Jonathan Wright, Jiahong Yuan, Juhong
Zhan, Yuqing Zhan (2018) GlobalTIMIT: Acoustic-Phonetic Datasets for the
World's Languages, Proceedings of the 19th Annual Conference of the
International Speech Communication Association (Interspeech 2018)
Hyderabad, September 2-6.

Charoenpornsawat, P. (1999) Feature-based Thai word segmentation. Master’s
thesis, Computer Engineering, Chulalongkorn University, 1999. Software by
Theppitak Karoonboonyanan, Online: available from
[https://linux.thai.net/projects/swath]

Department of Linguistics, Chulalongkorn University (2013) Thai National
Corpus II. Online: available from [http://www.arts.chula.ac.th/ling/TNCII/]

Garofolo, John S., Lori F. Lamel, William M. Fisher, Jonathan G. Fiscus,
David S. Pallett, Nancy L. Dahlgren, Victor Zue (1993) TIMIT
Acoustic-Phonetic Continuous Speech Corpus LDC93S1. Web Download.
Philadelphia: Linguistic Data Consortium.

Haas, M. R. (1951) The Mary R. Haas Thai Dictionary Project. Online:
available from [http://sealang.net/thai/dictionary.htm]

Thai Junior Encyclopedia by Royal Command of His Majesty the King (1997)
Thai Junior Encyclopedia. Online: available from
[http://saranukromthai.or.th/sub/Ebook/Ebbok.php

Sponsorship

The authors acknowledge the generous support of the University of
Pennsylvania's Office of the Vice Provost for Global Initiatives
(PennGlobal) through its Penn China Research and Engagement Fund, and the
School of Arts and Sciences through its Global Engagement Fund for their
support of the Global TIMIT model which was then then applied to the Thai
effort as well as the Linguistic Data Consortium for direct support of the
Thai collection.

Acknowledgements

The authors would also like to thank the collection coordinator, Vhira
Bubwiyaprond, the non-profit organization that hosted the recordings, the
Do D Foundation, and all the speakers in the corpus.

Updates

None at this time.

Copyright Portions © the Thai Junior Encyclopedia Foundation under the
patronage of His Majesty King Bhumibol Adulyadej the Great, the Thai
National Corpus and its sources, and the Wikipedia contributors.