#################################################### ##### ARTICULATION INDEX CORPUS - LSCP version ##### #################################################### ############### A) INTRODUCTION ############### The ARTICULATION INDEX CORPUS - LSCP version (AILSCP) was assembled from a subset of the original ARTICULATION INDEX corpus (AIC) distributed by the LDC (https://catalog.ldc.upenn.edu/LDC2005S22). See the online documentation for the original corpus at https://catalog.ldc.upenn.edu/docs/LDC2005S22/. 20 Speakers of American English (12 Males, 8 Females) were recorded while they pronounced syllables, some of which form actual words, but most of which are nonsense syllables. All possible Consonant-Vowel (CV) and Vowel-Consonant (VC) combinations were recorded for each speaker twice: - once in isolation - once within a carrier-sentence with the following structure: WORD1 WORD2 SYLLABLE WORD3 for a total of 25768 recorded syllables. ####################################### B) DIFFERENCES FROM THE ORIGINAL CORPUS ####################################### 1 - The original AIC contains recording for some triphones (CVC, CCV or VCC) which are not included in the LSCP version. 2 - Time-alignments for the onset and offset of each word and syllable were obtained through forced-alignment with a standard HMM-GMM ASR system. 3 - The time-alignments for the beginning and end of the syllables (whether in isolation or within a carrier sentence) were manually adjusted. The time-alignments for the other words in carrier sentences were NOT manually adjusted. 4 - The recordings of isolated syllables were cut according to the manual time-alignments to remove the silent portions at the beginning and end (and the time-alignments were altered to correspond to the cut recordings). 5 - The naming scheme for the files was slightly altered for compatibility with the kaldi speech recognition toolkit (http://kaldi.sourceforge.net/): the symbol indicating the type of recording (isolated or within sentence) and the speaker identifier were swapped in the filenames (see section C of this file). 6 - The original AIC contains a wide-band (16 KHz, 16-bit PCM) and a narrow-band (8 KHz, 8 bit u-law) version of the recordings distributed in sphere (.spn) format. The LSCP version only contains the wide-band version distributed as wavefiles (.wav). 7 - Several files from the original corpus were problematic for a variety of reasons, they were corrected when possible and otherwise removed from the corpus (see section D of this file). Some recordings (n=52) which did not conform to the standard format of the corpus, but still had a usable syllable part, were included in the corpus. These recordings are included in the data/speech and data/annotations folders without differentiating them from the others, but they are tagged as 'weird' stimuli in the data/text/weird.txt file. #################### C) FILES AND FORMATS #################### Organization of the corpus folder: doc/ doc/readme.txt data/ data/speech/ ... (Contains 25768 wavefiles) data/text/ data/text/normal.txt data/text/weird.txt data/annotations/ data/annotations/alignments.txt doc/readme.txt: This file. data/speech: Contains all the recordings as wavefiles (mono 16KHz 16-bit PCM encoding). Note that both "normal" and "weird" recordings wavefiles are included in this folder (see data/text/weird.txt description below). Example filename: m112_s_xuxz.wav There are 3 parts in each filename: 1. An identifier for the speaker ('m112' in the example) List of the 20 talkers identifier (begins by 'f' for females and 'm' for males): f101 m102 f103 m104 f105 f106 m107 f108 f109 m110 m111 m112 f113 m114 m115 m116 m117 m118 f119 m120 2. An identifier for the type of recording ('s' for isolated syllables and 'p' for syllables within a carrier sentence) 3. The syllable ('xuxz' in the example) encoded using the following map between phonemes and ASCII characters: ASCII IPA Example a ɑː bott xq æ bat xa ʌ but c ɔː bought xw aʊ bout xy aɪ bite xr ɝ bird xe ɛ bet e eɪ bait xi ɪ bit i iː beet o oʊ boat xo ɔɪ boy xu ʊ book u uː boot b b bee xc ʧ choke d d day xd ð then f f fin g g gay h h hay xj ʤ joke k k key l l lay m m mom n n noon xg ŋ sing p p pea r r ray s s sea xs ʃ she t t tea xt θ thin v v van w w way y j yacht z z zone xz ʒ azure data/text/normal.txt Contains the text corresponding to all wavefiles whose content conform to the standard format of the recordings (one file per line). Lines for isolated syllables recordings are in the format: FILE-ID SYLLABLE (for example: m112_s_ak a:k) where FILE-ID is the name of the corresponding wavefile minus the '.wav' extension and SYLLALBE is in the format S1:S2 where S1 is the ASCII code (see above) for the first phoneme of the syllable and S2 the ASCII code for the second phoneme. Lines for syllables in carrier sentences recordings are in the format: FILE-ID WORD1 WORD2 SYLLABLE WORD3 (for example: m112_p_ak everyone study a:k nightly) where FILE-ID and SYLLABLE are as for the isolated syllable recordings and: WORD1 is one of the following words: [I you we they someone noone everyone people] WORD2 is one of the following words: [see saw hear perceive think say said speak pronounce write record observe try understand attempt repeat describe detect determine distinguish echo evoke produce elicit prompt suggest utter imagine ponder check monitor recall remember recognize report use utilize review sense show note notice spell read examine study propose watch view witness] WORD3 is one of the following words: [now again often today well clearly entirely nicely precisely anyway daily weekly yearly hourly monthly always easily sometime twice more evenly fluently gladly happily neatly nightly only properly first second third fourth fifth sixth seventh eighth ninth tenth steadily surely typically usually wisely] data/text/weird.txt Contains the text corresponding to wavefiles who do not conform to the standard format of the recordings, but whose syllable part, at least, is correct (n=52 recordings). The format is the same as for data/text/normal.txt, except that the words in carrier-sentences can be non-standard, mispronounced or missing altogether (see details in section D). Note that both "normal" and "weird" recordings are included in the annotations and speech folders. data/annotations/alignments.txt Contains the time-alignments for the onset and offset of the syllable and words in each recording. Each line correspond to a given word or syllable in a given recording and is in the format: FILE-ID WORD/SYLLABLE ONSET OFFSET (for example: m112_p_ak everyone 0.004 0.331, or: m112_p_ak a:k 0.801 1.132) where FILE-ID and WORD/SYLLABLE are in the same format as file identifiers, words and syllables in data/text/normal.txt and ONSET and OFFSET are times given in seconds with three digits after the decimal point (millisecond precision). Note that only the time-alignments for syllables have been manually adjusted. Note that both "normal" and "weird" recordings are included in alignments.txt (see data/text/weird.txt description above). Some statistics about the content of this file: Number of word types 133 Number of word tokens 38634 Number of syllable types 648 Number of syllable tokens 25768 ################################### D) MISSING AND CORRECTED RECORDINGS ################################### Missing and corrected recordings can be classified in the 5 following categories. 1. Recordings missing from the original corpus by design. The original corpus contains all possible CV and VC combinations except for the following, that were considered not to be possible American English syllables (they are written using the ASCII encoding for phonemes described in section C, plus 'C' or 'V' to refer to any consonant or any vowel): V+h, V+w V+y, xg+V, V+r except for ar, er, ir, or, ur which are present in the corpus, rxr, yxu 2. Other recordings missing from the original corpus (n=146). f101_p_al.wav f101_p_cb.wav f101_p_cn.wav f101_p_cp.wav f101_p_cxd.wav f101_p_cxs.wav f101_p_fxo.wav f101_p_on.wav f101_p_xdxq.wav f101_p_xrxs.wav f101_p_xum.wav f101_p_xun.wav f101_p_xus.wav f101_p_xuxd.wav f101_p_xuxj.wav f101_p_xwxd.wav f101_p_yxr.wav f101_p_yxy.wav f101_s_al.wav f101_s_cb.wav f101_s_cn.wav f101_s_cp.wav f101_s_cxd.wav f101_s_cxs.wav f101_s_fxo.wav f101_s_on.wav f101_s_xdxq.wav f101_s_xrxs.wav f101_s_xum.wav f101_s_xun.wav f101_s_xus.wav f101_s_xuxd.wav f101_s_xuxj.wav f101_s_xwxd.wav f101_s_yxr.wav f101_s_yxy.wav f103_p_xum.wav f103_p_xun.wav f103_p_xuxt.wav f103_s_xum.wav f103_s_xun.wav f103_s_xuxt.wav m104_p_cm.wav m104_s_cm.wav f106_p_mxu.wav f106_p_oxs.wav f106_p_xtxu.wav f106_p_xul.wav f106_s_mxu.wav f106_s_oxs.wav f106_s_xtxu.wav f106_s_xul.wav m107_p_axz.wav m107_p_tc.wav m107_p_xus.wav m107_p_xuxd.wav m107_p_xzxq.wav m107_s_axz.wav m107_s_tc.wav m107_s_xus.wav m107_s_xuxd.wav m107_s_xzxq.wav m110_p_xjxu.wav m110_s_xjxu.wav m111_p_xqxz.wav m111_s_xqxz.wav f113_p_axz.wav f113_p_exc.wav f113_p_exs.wav f113_p_ext.wav f113_p_exz.wav f113_p_ob.wav f113_p_oxg.wav f113_p_rxu.wav f113_p_rxw.wav f113_p_uxd.wav f113_p_vxr.wav f113_p_xcxu.wav f113_p_xda.wav f113_p_xdxu.wav f113_p_xip.wav f113_p_xjxu.wav f113_p_xon.wav f113_p_xoxj.wav f113_p_xrxd.wav f113_p_xrxg.wav f113_p_xrxz.wav f113_p_xtxi.wav f113_p_xun.wav f113_p_xup.wav f113_p_xuv.wav f113_p_xuxt.wav f113_p_xuz.wav f113_p_xwm.wav f113_p_xwn.wav f113_p_xwxg.wav f113_p_xwxs.wav f113_p_xyxt.wav f113_p_xyxz.wav f113_p_xzxe.wav f113_p_ya.wav f113_p_yxr.wav f113_p_za.wav f113_s_axz.wav f113_s_exc.wav f113_s_exs.wav f113_s_ext.wav f113_s_exz.wav f113_s_ob.wav f113_s_oxg.wav f113_s_rxu.wav f113_s_rxw.wav f113_s_uxd.wav f113_s_vxr.wav f113_s_xcxu.wav f113_s_xda.wav f113_s_xdxu.wav f113_s_xip.wav f113_s_xjxu.wav f113_s_xon.wav f113_s_xoxj.wav f113_s_xrxd.wav f113_s_xrxg.wav f113_s_xrxz.wav f113_s_xtxi.wav f113_s_xun.wav f113_s_xup.wav f113_s_xuv.wav f113_s_xuxt.wav f113_s_xuz.wav f113_s_xwm.wav f113_s_xwn.wav f113_s_xwxg.wav f113_s_xwxs.wav f113_s_xyxt.wav f113_s_xyxz.wav f113_s_xzxe.wav f113_s_ya.wav f113_s_yxr.wav f113_s_za.wav m116_p_axs.wav m116_s_axs.wav m117_p_cm.wav m117_p_cxj.wav m117_s_cm.wav m117_s_cxj.wav 3. Recordings present in the original corpus but removed from the LSCP version. - Any syllable who was not a CV or VC syllable was removed - An xh phoneme corresponding to IPA /ʍ/ (for example as in the beginnning of the word 'what'), was sometimes used in the original corpus but whas not recorded systematically and we did not include it in the LSCP version - Recordings where the syllable part was not correct (n=6): m102_p_xsxi.wav m107_p_al.wav m107_p_gxo.wav m107_p_xtc.wav m107_p_xyxt.wav f113_p_ib.wav 4. Recordings present in the LSCP version but categorized as weird (see description of data/text/weird.txt in section C). These do not match the standard format for recordings in the corpus but still have a usable syllable part. Weird recordings (n=52): - wrong grammar (n=3): - f106_p_ixd everyone view i:xd - m104_p_gc g:c always - m117_p_xyxg think xy:xg again - mispronunciation/hesitation/cutting-problem (n=14): - f101_p_rxo everyoone watch r:xo usually ('everyoone') - f105_p_la everyonre review l:a today ('everyonre') - m112_p_dxw noone pronounce d:xw insteadily ('insteadily') - m102_p_gxe someonoo propose g:xe steadily ('someonoo') - m107_p_wa you distinguish w:a centh ('centh') - m112_p_oxz everyone try o:xz asecond ('asecond') - m112_p_xixt noone elicit xi:xt oonicely ('oonicely') - m114_p_axd we yeview a:xd nightly ('yeview') - m114_p_xal noone imagineh xa:l today ('imagineh') - m114_p_xep i promptu xe:p today ('promptu') - m114_p_xis i recall xi:s tclearly ('tclearly') - m115_p_cg wev remember c:g properly ('wev') - m116_p_fxo i think f:xo seighth ('seigth') - m116_p_wa they notice w:a stea ('stea') - out-of-grammar word (n=11): - f113_p_ku someone echo k:u instead ('instead') - f113_p_po they sneak p:o daily ('sneak') - f113_p_vxi people saw v:xi early ('early') - f113_p_xed everyone echo xe:d early ('early') - m111_p_xexg they fear xe:xg often ('fear') - m112_p_xuf we describe xu:f frequently ('frequently') - m115_p_xwn they ponder xw:n quickly ('quickly') - m116_p_fu someone monitor f:u lately ('lately') - m116_p_lxy some distinguish l:xy now ('some') - m116_p_uxg people smell u:xg typically ('smell') - m116_p_xep they cheer xe:p eighth ('cheer') - added 's' or 'ed' (n=4): - f106_p_bxy someone writes b:xy always ('writes') - m107_p_ct people detect c:t anyways ('anyways') - m112_p_tc noone pronounces t:c fifth ('pronounces') - m116_p_fxu i remember f:xu sometimes ('sometimes') - unwanted past tense (n=2): - m112_p_hxw people noticed h:xw hourly ('noticed') - m117_p_cg they understood c:g typically ('understood') - weird pronunciation 'noone' as 'noon' (n=18): - m107_p_am noon perceive a:m ninth - m107_p_ek noon evoke e:k hourly - m107_p_et noon show e:t now - m107_p_oz noon ponder o:z second - m107_p_po noon prompt p:o often - m107_p_wxw noon observe w:xw often - m107_p_xes noon read xe:s seventh - m107_p_xet noon pronounce xe:t weekly - m107_p_xoxs noon say xo:xs always - m107_p_xys noon review xy:s clearly - m107_p_zc noon record z:c properly - m116_p_ra noon detect r:a fluently - m116_p_ul noon write u:l happily - m116_p_vo noon echo v:o neatly - m116_p_xcu noon recall xc:u eighth - m116_p_xet noon prompt xe:t properly - m116_p_xja noon use xj:a well - m116_p_xrxj noon note xr:xj sixth 5. Recordings present in the LSCP version, categorized as normal, but for which a correction was made (mainly wrong words in carrier sentences, complete description not available). - A large number (300+) of syllable within carrier sentences had the syllable part correctly labeled but not the word part (probably due to some mixup in prompt generation), while still conforming to the standard format for sentences (see entry for data/text/normal.txt in section C). We relabeled these recordings and included them with the 'normal' part of the corpus.