The DKU-JNU-EMA Electromagnetic Articulography Database
=======================================================

This corpus represents a collection of Chinese electromagnetic articulography data containing Mandarin Chinese and three Chinese dialects, namely Cantonese, Hakka, Teochew. It contains articulograpy data for 3330 reading utterances. There are 2-7 native speakers for each language or dialect. 

We use the NDI electromagnetic articulography speech research system (https://www.ndigital.com/msci/products/wave-speech-research) to capture the real-time vocal tract variable trajectories. In addition, subjects wore a head-mounted close talk microphone to record the speech signal simultaneously. Along with the speech research system, audio interface, MOTU MicroBook IIC, was used to capture the audio. EMA data were collected at 100Hz sampling rate, while the sinultaneous wave data was collected at 22kHz, and then downsample to 16kHz. Subjects were asked to place six sensors in mouth (location are upper lip, lower lip, lower incisor, tongue tip, tongue body, and tongue dorsum) and one at the bridge of nose as a reference point.

The DKU_JNU_EMA database was collected in Jinan University, China. The purpose for collecting this database is to provide articulography data in Chinese, more over, extend the diversity of articulography data, so that the data would be helpful in more research field, like speech recognition, acoustic-to-articulatory inversion, speech production, dialect recognition, etc. 


Data Composition
----------------

There are 4 sessions of utterances in the database, each language or dialect would have 2-3 sesstions:
  Sentence session: subjects read complete sentences or short texts.
  Consonant session: for each given consonant, subjects read related words composed by the specific consonant.
  Vowel session: for each given vowel, subjects read related words composed by the specific vowel.
  Tone session: for each given word, subjects read words with every tone of that language or dialect.

Each language and dialect has a reference alphabet, as well as the phonetically balanced texts and sentences selected for recording. The reference file can be found as PDF format in the 'docs' directory for each language.

In addition, the database contains subject's hard palate trace shapes parameters.

Data Description
----------------

The data collected (including the probe trajectory for palate trace) uses the upper part of the nose bridge, just below the midpoint of the eyebrows, as a reference point.
The pronunciation data trajectory files are stored as TSV files, containing 7 channels, specifically:

-------  ----------------------
Channel    Corresponding Part
-------  ----------------------
P01-CH0         Upper lip
P01-CH1         Lower lip
P02-CH0    Root of the tongue
P02-CH1   Middle of the tongue
P03-CH0     Tip of the tongue
P03-CH1           Gums
P04-CH0     Reference electrode

The probe trajectory files contain 6 channels, specifically:


-------  ----------------------
Channel    Corresponding Part
-------  ----------------------
P01-CH0           None
P01-CH1           None
P02-CH0           None
P02-CH1           None
P03-CH0          Probe
P04-CH0    Reference electrode

In the data files, Tx, Ty, Tz represent coordinate points.
Note:
 - Due to setup issues, during data usage, it is necessary to subtract the reference electrode coordinates from each channel's coordinates, e.g., subtract Tx in P04-CH0 from Tx in P01-CH0 to get the actual coordinate of Tx in P01-CH0.
 - The corresponding paper for this dataset matches X with the back and front movement trajectory of the attached points, and Y with the up and down movement trajectory. However, in this dataset, X actually corresponds to the vertical movement (up and down) trajectory and Y to the back and front trajectory.

Types of files
-----------------------------

The directory structure goes as follow:

root
├── data
│   └── language
│       ├── docs
│       └── speaker
│           └── section
│               ├── utterance.flac
│               ├── utterance.tsv
│               └── utterance.wco
└── docs

The recording utterances are recorded in order as in the documents(reference alphabet, texts and sentences). So the utterance name contain the order index and we can match utterance with it's text content (transcript) when needed.  

utterance.flac: 1 channel, 16000 sample rate, precision is 16-bit.

utterance.wco: the parameter of the NDI speech research system.

utterance.tsv: the data of each sensor (roll, pitch, yaw, X-Y-Z location)

For more description, please see the paper "The DKU-JNU-EMA Electromagnetic Articulography Database on Mandarin and Chinese Dialects with Tandem Feature based Acoustic-to-Articulatory Inversion"