####################################################
##### ARTICULATION INDEX CORPUS - LSCP version #####
####################################################


###############
A) INTRODUCTION
###############

The ARTICULATION INDEX CORPUS - LSCP version (AILSCP) was assembled from a subset of the original ARTICULATION INDEX corpus (AIC) distributed by the LDC (https://catalog.ldc.upenn.edu/LDC2005S22). See the online documentation for the original corpus at https://catalog.ldc.upenn.edu/docs/LDC2005S22/.

20 Speakers of American English (12 Males, 8 Females) were recorded while they pronounced syllables, some of which form actual words, but most of which are nonsense syllables. All possible Consonant-Vowel (CV) and Vowel-Consonant (VC) combinations were recorded for each speaker twice:

	- once in isolation

	- once within a carrier-sentence with the following structure: WORD1 WORD2 SYLLABLE WORD3

for a total of 25768 recorded syllables.


#######################################
B) DIFFERENCES FROM THE ORIGINAL CORPUS
#######################################

1 - The original AIC contains recording for some triphones (CVC, CCV or VCC) which are not included in the LSCP version.

2 - Time-alignments for the onset and offset of each word and syllable were obtained through forced-alignment with a standard HMM-GMM ASR system.

3 - The time-alignments for the beginning and end of the syllables (whether in isolation or within a carrier sentence) were manually adjusted. The time-alignments for the other words in carrier sentences were NOT manually adjusted.

4 - The recordings of isolated syllables were cut according to the manual time-alignments to remove the silent portions at the beginning and end (and the time-alignments were altered to correspond to the cut recordings).

5 - The naming scheme for the files was slightly altered for compatibility with the kaldi speech recognition toolkit (http://kaldi.sourceforge.net/): the symbol indicating the type of recording (isolated or within sentence) and the speaker identifier were swapped in the filenames (see section C of this file).

6 - The original AIC contains a wide-band (16 KHz, 16-bit PCM) and a narrow-band (8 KHz, 8 bit u-law) version of the recordings distributed in sphere (.spn) format. The LSCP version only contains the wide-band version distributed as wavefiles (.wav).

7 - Several files from the original corpus were problematic for a variety of reasons, they were corrected when possible and otherwise removed from the corpus (see section D of this file). Some recordings (n=52) which did not conform to the standard format of the corpus, but still had a usable syllable part, were included in the corpus. These recordings are included in the data/speech and data/annotations folders without differentiating them from the others, but they are tagged as 'weird' stimuli in the data/text/weird.txt file.


####################
C) FILES AND FORMATS
####################

Organization of the corpus folder:
	doc/
 		doc/readme.txt
	data/
 		data/speech/
 			... (Contains 25768 wavefiles)
 		data/text/
 			data/text/normal.txt 
 			data/text/weird.txt
 		data/annotations/
 			data/annotations/alignments.txt


doc/readme.txt: This file.


data/speech:
	Contains all the recordings as wavefiles (mono 16KHz 16-bit PCM encoding).
	Note that both "normal" and "weird" recordings wavefiles are included in this folder (see data/text/weird.txt description below).

	Example filename: m112_s_xuxz.wav

	There are 3 parts in each filename:

		1. An identifier for the speaker ('m112' in the example)

			List of the 20 talkers identifier (begins by 'f' for females and 'm' for males): 
				
				f101
				m102
				f103
				m104
				f105
				f106
				m107
				f108
				f109
				m110
				m111
				m112
				f113
				m114
				m115
				m116
				m117
				m118
				f119
				m120

		2. An identifier for the type of recording ('s' for isolated syllables and 'p' for syllables within a carrier sentence)
		
		3. The syllable ('xuxz' in the example) encoded using the following map between phonemes and ASCII characters:
			
			ASCII	IPA		Example

			a 		ɑː 		bott
			xq 		æ 		bat
			xa 		ʌ 		but
			c 		ɔː 		bought
			xw 		aʊ 		bout
			xy 		aɪ 		bite
			xr 		ɝ 		bird
			xe 		ɛ 		bet
			e 		eɪ 		bait
			xi 		ɪ 		bit
			i 		iː 		beet
			o 		oʊ 		boat
			xo 		ɔɪ 		boy
			xu 		ʊ 		book
			u 		uː 		boot

			b 		b 		bee
			xc 		ʧ 		choke
			d 		d 		day
			xd 		ð 		then
			f 		f 		fin
			g 		g 		gay
			h 		h 		hay
			xj 		ʤ 		joke
			k 		k 		key
			l 		l 		lay
			m 		m 		mom
			n 		n 		noon
			xg 		ŋ 		sing
			p 		p 		pea
			r 		r 		ray
			s 		s 		sea
			xs 		ʃ 		she
			t 		t 		tea
			xt 		θ 		thin
			v 		v 		van
			w 		w 		way
			y 		j 		yacht
			z 		z 		zone
			xz 		ʒ 		azure


data/text/normal.txt
	Contains the text corresponding to all wavefiles whose content conform to the standard format of the recordings (one file per line).

	Lines for isolated syllables recordings are in the format:

		FILE-ID SYLLABLE (for example: m112_s_ak a:k)
	
	where FILE-ID is the name of the corresponding wavefile minus the '.wav' extension and SYLLALBE is in the format S1:S2 where S1 is the ASCII code (see above) for the first phoneme of the syllable and S2 the ASCII code for the second phoneme.
	
	Lines for syllables in carrier sentences recordings are in the format:
	
		FILE-ID WORD1 WORD2 SYLLABLE WORD3 (for example:  m112_p_ak everyone study a:k nightly)
	
	where FILE-ID and SYLLABLE are as for the isolated syllable recordings and:
	
			WORD1 is one of the following words:
				[I you we they someone noone everyone people]

			WORD2 is one of the following words:
				[see saw hear perceive think say said speak pronounce write record observe try understand attempt repeat describe detect determine distinguish echo evoke produce elicit prompt suggest utter imagine ponder check monitor recall remember recognize report use utilize review sense show note notice spell read examine study propose watch view witness]

			WORD3 is one of the following words:
				[now again often today well clearly entirely nicely precisely anyway daily weekly yearly hourly monthly always easily sometime twice more evenly fluently gladly happily neatly nightly only properly first second third fourth fifth sixth seventh eighth ninth tenth steadily surely typically usually wisely]


data/text/weird.txt
	Contains the text corresponding to wavefiles who do not conform to the standard format of the recordings, but whose syllable part, at least, is correct (n=52 recordings). The format is the same as for data/text/normal.txt, except that the words in carrier-sentences can be non-standard, mispronounced or missing altogether (see details in section D).
	Note that both "normal" and "weird" recordings are included in the annotations and speech folders.


data/annotations/alignments.txt
	Contains the time-alignments for the onset and offset of the syllable and words in each recording. Each line correspond to a given word or syllable in a given recording and is in the format:

		FILE-ID WORD/SYLLABLE ONSET OFFSET (for example: m112_p_ak everyone 0.004 0.331, or: m112_p_ak a:k 0.801 1.132)

	where FILE-ID and WORD/SYLLABLE are in the same format as file identifiers, words and syllables in data/text/normal.txt and ONSET and OFFSET are times given in seconds with three digits after the decimal point (millisecond precision).
	
	Note that only the time-alignments for syllables have been manually adjusted.
	Note that both "normal" and "weird" recordings are included in alignments.txt (see data/text/weird.txt description above).

	Some statistics about the content of this file:
		Number of word types		133
		Number of word tokens		38634
		Number of syllable types	648
		Number of syllable tokens	25768


###################################
D) MISSING AND CORRECTED RECORDINGS 
###################################

Missing and corrected recordings can be classified in the 5 following categories.

1. Recordings missing from the original corpus by design.

	The original corpus contains all possible CV and VC combinations except for the following, that were considered not to be possible American English syllables (they are written using the ASCII encoding for phonemes described in section C, plus 'C' or 'V' to refer to any consonant or any vowel):

		V+h, V+w V+y, xg+V, 
		V+r except for ar, er, ir, or, ur which are present in the corpus,
		rxr, yxu
	
2. Other recordings missing from the original corpus (n=146).

	f101_p_al.wav
	f101_p_cb.wav
	f101_p_cn.wav
	f101_p_cp.wav
	f101_p_cxd.wav
	f101_p_cxs.wav
	f101_p_fxo.wav
	f101_p_on.wav
	f101_p_xdxq.wav
	f101_p_xrxs.wav
	f101_p_xum.wav
	f101_p_xun.wav
	f101_p_xus.wav
	f101_p_xuxd.wav
	f101_p_xuxj.wav
	f101_p_xwxd.wav
	f101_p_yxr.wav
	f101_p_yxy.wav
	f101_s_al.wav
	f101_s_cb.wav
	f101_s_cn.wav
	f101_s_cp.wav
	f101_s_cxd.wav
	f101_s_cxs.wav
	f101_s_fxo.wav
	f101_s_on.wav
	f101_s_xdxq.wav
	f101_s_xrxs.wav
	f101_s_xum.wav
	f101_s_xun.wav
	f101_s_xus.wav
	f101_s_xuxd.wav
	f101_s_xuxj.wav
	f101_s_xwxd.wav
	f101_s_yxr.wav
	f101_s_yxy.wav

	f103_p_xum.wav
	f103_p_xun.wav
	f103_p_xuxt.wav
	f103_s_xum.wav
	f103_s_xun.wav
	f103_s_xuxt.wav

	m104_p_cm.wav
	m104_s_cm.wav

	f106_p_mxu.wav
	f106_p_oxs.wav
	f106_p_xtxu.wav
	f106_p_xul.wav
	f106_s_mxu.wav
	f106_s_oxs.wav
	f106_s_xtxu.wav
	f106_s_xul.wav

	m107_p_axz.wav
	m107_p_tc.wav
	m107_p_xus.wav
	m107_p_xuxd.wav
	m107_p_xzxq.wav
	m107_s_axz.wav
	m107_s_tc.wav
	m107_s_xus.wav
	m107_s_xuxd.wav
	m107_s_xzxq.wav

	m110_p_xjxu.wav
	m110_s_xjxu.wav

	m111_p_xqxz.wav
	m111_s_xqxz.wav

	f113_p_axz.wav
	f113_p_exc.wav
	f113_p_exs.wav
	f113_p_ext.wav
	f113_p_exz.wav
	f113_p_ob.wav
	f113_p_oxg.wav
	f113_p_rxu.wav
	f113_p_rxw.wav
	f113_p_uxd.wav
	f113_p_vxr.wav
	f113_p_xcxu.wav
	f113_p_xda.wav
	f113_p_xdxu.wav
	f113_p_xip.wav
	f113_p_xjxu.wav
	f113_p_xon.wav
	f113_p_xoxj.wav
	f113_p_xrxd.wav
	f113_p_xrxg.wav
	f113_p_xrxz.wav
	f113_p_xtxi.wav
	f113_p_xun.wav
	f113_p_xup.wav
	f113_p_xuv.wav
	f113_p_xuxt.wav
	f113_p_xuz.wav
	f113_p_xwm.wav
	f113_p_xwn.wav
	f113_p_xwxg.wav
	f113_p_xwxs.wav
	f113_p_xyxt.wav
	f113_p_xyxz.wav
	f113_p_xzxe.wav
	f113_p_ya.wav
	f113_p_yxr.wav
	f113_p_za.wav
	f113_s_axz.wav
	f113_s_exc.wav
	f113_s_exs.wav
	f113_s_ext.wav
	f113_s_exz.wav
	f113_s_ob.wav
	f113_s_oxg.wav
	f113_s_rxu.wav
	f113_s_rxw.wav
	f113_s_uxd.wav
	f113_s_vxr.wav
	f113_s_xcxu.wav
	f113_s_xda.wav
	f113_s_xdxu.wav
	f113_s_xip.wav
	f113_s_xjxu.wav
	f113_s_xon.wav
	f113_s_xoxj.wav
	f113_s_xrxd.wav
	f113_s_xrxg.wav
	f113_s_xrxz.wav
	f113_s_xtxi.wav
	f113_s_xun.wav
	f113_s_xup.wav
	f113_s_xuv.wav
	f113_s_xuxt.wav
	f113_s_xuz.wav
	f113_s_xwm.wav
	f113_s_xwn.wav
	f113_s_xwxg.wav
	f113_s_xwxs.wav
	f113_s_xyxt.wav
	f113_s_xyxz.wav
	f113_s_xzxe.wav
	f113_s_ya.wav
	f113_s_yxr.wav
	f113_s_za.wav

	m116_p_axs.wav
	m116_s_axs.wav

	m117_p_cm.wav
	m117_p_cxj.wav
	m117_s_cm.wav
	m117_s_cxj.wav

3. Recordings present in the original corpus but removed from the LSCP version.

	- Any syllable who was not a CV or VC syllable was removed

	- An xh phoneme corresponding to IPA /ʍ/ (for example as in the beginnning of the word 'what'), was sometimes used in the original corpus but whas not recorded systematically and we did not include it in the LSCP version
	
	- Recordings where the syllable part was not correct (n=6):	
		m102_p_xsxi.wav
		m107_p_al.wav
		m107_p_gxo.wav
		m107_p_xtc.wav
		m107_p_xyxt.wav
		f113_p_ib.wav

4. Recordings present in the LSCP version but categorized as weird (see description of data/text/weird.txt in section C). These do not match the standard format for recordings in the corpus but still have a usable syllable part.

	Weird recordings (n=52):
		- wrong grammar (n=3):
			- f106_p_ixd everyone view i:xd
			- m104_p_gc g:c always
			- m117_p_xyxg think xy:xg again
		- mispronunciation/hesitation/cutting-problem (n=14):
			- f101_p_rxo everyoone watch r:xo usually ('everyoone')
			- f105_p_la everyonre review l:a today ('everyonre')
			- m112_p_dxw noone pronounce d:xw insteadily ('insteadily')
			- m102_p_gxe someonoo propose g:xe steadily ('someonoo')
			- m107_p_wa you distinguish w:a centh ('centh')
			- m112_p_oxz everyone try o:xz asecond ('asecond')
			- m112_p_xixt noone elicit xi:xt oonicely ('oonicely')
			- m114_p_axd we yeview a:xd nightly ('yeview')
			- m114_p_xal noone imagineh xa:l today ('imagineh')
			- m114_p_xep i promptu xe:p today ('promptu')
			- m114_p_xis i recall xi:s tclearly ('tclearly')
			- m115_p_cg wev remember c:g properly ('wev')
			- m116_p_fxo i think f:xo seighth ('seigth')
			- m116_p_wa they notice w:a stea ('stea')
		- out-of-grammar word (n=11):	
			- f113_p_ku someone echo k:u instead ('instead')
			- f113_p_po they sneak p:o daily ('sneak')
			- f113_p_vxi people saw v:xi early ('early')
			- f113_p_xed everyone echo xe:d early ('early')
			- m111_p_xexg they fear xe:xg often ('fear')
			- m112_p_xuf we describe xu:f frequently ('frequently')
			- m115_p_xwn they ponder xw:n quickly ('quickly')
			- m116_p_fu someone monitor f:u lately ('lately')
			- m116_p_lxy some distinguish l:xy now ('some')	
			- m116_p_uxg people smell u:xg typically ('smell')
			- m116_p_xep they cheer xe:p eighth ('cheer')	
		- added 's' or 'ed' (n=4):
			- f106_p_bxy someone writes b:xy always ('writes')
			- m107_p_ct people detect c:t anyways ('anyways')
			- m112_p_tc noone pronounces t:c fifth ('pronounces')
			- m116_p_fxu i remember f:xu sometimes ('sometimes')
		- unwanted past tense (n=2):
			- m112_p_hxw people noticed h:xw hourly ('noticed')
			- m117_p_cg they understood c:g typically ('understood')
		- weird pronunciation 'noone' as 'noon' (n=18):
			- m107_p_am noon perceive a:m ninth
			- m107_p_ek noon evoke e:k hourly
			- m107_p_et noon show e:t now
			- m107_p_oz noon ponder o:z second
			- m107_p_po noon prompt p:o often
			- m107_p_wxw noon observe w:xw often
			- m107_p_xes noon read xe:s seventh
			- m107_p_xet noon pronounce xe:t weekly
			- m107_p_xoxs noon say xo:xs always
			- m107_p_xys noon review xy:s clearly
			- m107_p_zc noon record z:c properly
			- m116_p_ra noon detect r:a fluently
			- m116_p_ul noon write u:l happily
			- m116_p_vo noon echo v:o neatly
			- m116_p_xcu noon recall xc:u eighth
			- m116_p_xet noon prompt xe:t properly
			- m116_p_xja noon use xj:a well
			- m116_p_xrxj noon note xr:xj sixth

5. Recordings present in the LSCP version, categorized as normal, but for which a correction was made (mainly wrong words in carrier sentences, complete description not available).

	- A large number (300+) of syllable within carrier sentences had the syllable part correctly labeled but not the word part (probably due to some mixup in prompt generation), while still conforming to the standard format for sentences (see entry for data/text/normal.txt in section C). We relabeled these recordings and included them with the 'normal' part of the corpus.