FORCED ALIGNMENT AND ANOMALY DETECTION (notes by Jack Mostow, revised 6/19/97) Utterances were time-aligned against transcripts to find segment boundaries, both to locate spoken words and phones, and to detect anomalous data. ALIGNMENT METHOD Each label file includes the forced alignment produced by Sphinx-II for an utterance of its signal file against its transcript file. Alignment was performed by the Sphinx-II speech recognizer's forced alignment mode, developed by Eric Thayer and Ravishankar Mosur. The acoustic models used in alignment were trained on adult female speech for the ATIS task. To improve accuracy, codebook means were adapted by Alex Hauptmann using a small pilot corpus of twelve children's fluent oral reading collected by Maxine Eskenazi. LEXICON The pronunciation dictionary used in the time-alignment was adapted from the CMU Speech Group's cmu_dict.04 dictionary, which can be obtained free of charge via the World Wide Web or anonymous ftp. The web URL for the CMU Pronouncing dictionary is: http://www.speech.cs.cmu.edu/cgi-bin/cmudict The address for anonymous ftp access is: ftp://ftp.cs.cmu.edu/project/fgdata/dict/cmudict.0.4.Z (That is, connect to host "ftp.cs.cmu.edu", use "anonymous" as the login name and your email address as password, then go to the directory "project/fgdata/dict" and retrieve the file cmudict.0.4.Z) Using either method, there is additional information available (in readme files or supplemental web pages) about the dictionary and other resources available from CMU. For the purpose of doing word alignment on the KIDS speech collection, the CMUDICT lexicon was augmented by 20 Weekly Reader word entries it did not include at the time the alignment was carried out, together with definitions of the phone and noise symbols used in the transcripts. The adapted version of the lexicon is provided as part of the present corpus, in the "tables" directory (file name: alignmnt.dic). The 20 word entries added were: ASTRONAUTS' AE S T R AH N AO T S CHEETAHS CH IY T AH Z CHIPETAS CH IH P AY T AH S FROG'S F R AA G Z GET-WELL G EH T W EH L HABS HH AE B Z HIPBONE HH IH P B OW N HOME-SCHOOL HH OW M S K UW L ICEFISH AY S F IH SH MEAT-EATING M IY T IY T IH NG MUNCHED M AH N CH TD PUPA P Y UW P AX SHEATHBILLS SH IY TH B IH L Z STEGOSAURUS S T EH G AA S AO R AX S SUPERHEROES S UW P AXR HH IH R OW Z SUPERHEROES(2) S UW P AXR HH IY R OW Z TLINGIT T L IH NG G IH T TREETOPS T R IY T AA P S TRICERATOPS T R AY S EH R AX T AA P S TV T IY V IY TYRANNOSAURUS T AY R AE N AA S AO R AX S The following set of phones was used: AA AE AH AO AW AX AXR AY B BD CH D DD DH DX EH ER EY F G GD HH IH IX IY JH K KD L M N NG OW OY P PD R S SH T TD TH TS UH UW V W Y Z ZH The following phones denote utterance-initial, -medial, and -final silences: SILb SIL SILe LEXICAL ENTRIES FOR PHONES To facilitate forced alignment of transcripts combining words, phones, and noises, the lexicon was augmented with entries for phones and noises. Since phonetically spelled transcriptions were delimited by the "/" character, lexical items were added for each phone corresponding to where it could occur. To illustrate, here are the respective lexical items added for occurrences of the phone AX at the start, middle, or end of a phone sequence, or by itself: /AX AX AX(/AX/) AX AX/ AX /AX/ AX Parenthesized identifiers in the lexicon distinguish alternative pronunciations for a given lexicon entry. In the label files, these identifiers show which pronunciation Sphinx chose as the best match for a given transcribed word, phone, or noise event. Thus the parenthesized annotation in "AX(/AX/)" distinguishes the lexical entry for the phone /AX/ from the English word "ax": AX AE K S For implementation reasons (namely that Sphinx requires the lexicon to include a non-annotated base pronunciation for each word), the dummy impossible pronunciation "K P T" was added for phones that are not words, e.g.: AA K P T ... ZH K P T LEXICAL ENTRIES FOR EVENTS AND REGIONS Bracketed symbols in transcripts represent various types of events and regions. For example, "[noise]" denotes an individual noise, while "[begin_noise] ... [end_noise]" indicates a noisy region. Event symbols include phenomena not simultaneous with transcribed speech: [crosstalk] -- off-microphone speech by another speaker [human_noise] -- non-speech sound produced by the speaker [microphone_noise] -- noises produced by touching the microphone [noise] -- any noise [see_transcript] -- used in label files to indicate other transcribed events [sil] -- silence as long as two more typical syllables for this speaker [whisper] -- whispered speech not otherwise transcribed Region delimiters denote noise and other phenomena concurrent with speech: [begin_crosstalk_noise] ... [end_crosstalk_noise] -- someone else speaking too [begin_microphone_noise] ... [end_microphone_noise] -- microphone being touched [begin_noise] ... [end_noise] -- interval of noise simultaneous with speech [begin_whisper] ... [end_whisper] -- whispered interval of transcribed speech Transcribers were allowed to label noise event and regions they could identify, e.g. [lipsmack]. To support forced alignment of noise events and regions, lexical entries for these symbols were added as shown here. For convenience of implementation, region delimiters such as "[begin_noise]" and "[end_noise]," which strictly speaking ought to have zero duration, were defined as silences to make them appear in the label files produced by forced alignment. To accommodate additional transcript symbols besides those listed above without having to continually expand the lexicon, all additional transcript symbols were translated to the catch-all symbol "[see_transcript]" prior to alignment. Where [SEE_TRANSCRIPT] occurs in the label files, see the corresponding transcript for the original symbol. There are fewer than 200 such occurrences. The lexicon was augmented with the following entries for events and regions: [BEGIN_CROSSTALK_NOISE] SIL [BEGIN_MICROPHONE_NOISE] SIL [BEGIN_NOISE] SIL [BEGIN_WHISPER] SIL [CROSSTALK] SIL [CROSSTALK](+EXHALE+) +EXHALE+ [CROSSTALK](+INHALE+) +INHALE+ [CROSSTALK](+NOISE+) +NOISE+ [CROSSTALK](+RUSTLE+) +RUSTLE+ [CROSSTALK](+SMACK+) +SMACK+ [CROSSTALK](+SWALLOW+) +SWALLOW+ [END_CROSSTALK_NOISE] SIL [END_MICROPHONE_NOISE] SIL [END_NOISE] SIL [END_WHISPER] SIL [HUMAN_NOISE] SIL [HUMAN_NOISE](+EXHALE+) +EXHALE+ [HUMAN_NOISE](+INHALE+) +INHALE+ [HUMAN_NOISE](+NOISE+) +NOISE+ [HUMAN_NOISE](+RUSTLE+) +RUSTLE+ [HUMAN_NOISE](+SMACK+) +SMACK+ [HUMAN_NOISE](+SWALLOW+) +SWALLOW+ [MICROPHONE_NOISE] SIL [MICROPHONE_NOISE](+EXHALE+) +EXHALE+ [MICROPHONE_NOISE](+INHALE+) +INHALE+ [MICROPHONE_NOISE](+NOISE+) +NOISE+ [MICROPHONE_NOISE](+RUSTLE+) +RUSTLE+ [MICROPHONE_NOISE](+SMACK+) +SMACK+ [MICROPHONE_NOISE](+SWALLOW+) +SWALLOW+ [NOISE] SIL [NOISE](+EXHALE+) +EXHALE+ [NOISE](+INHALE+) +INHALE+ [NOISE](+NOISE+) +NOISE+ [NOISE](+RUSTLE+) +RUSTLE+ [NOISE](+SMACK+) +SMACK+ [NOISE](+SWALLOW+) +SWALLOW+ [SEE_TRANSCRIPT] SIL [SEE_TRANSCRIPT](+EXHALE+) +EXHALE+ [SEE_TRANSCRIPT](+INHALE+) +INHALE+ [SEE_TRANSCRIPT](+NOISE+) +NOISE+ [SEE_TRANSCRIPT](+RUSTLE+) +RUSTLE+ [SEE_TRANSCRIPT](+SMACK+) +SMACK+ [SEE_TRANSCRIPT](+SWALLOW+) +SWALLOW+ [SIL] SIL [WHISPER] SIL [WHISPER](+EXHALE+) +EXHALE+ [WHISPER](+INHALE+) +INHALE+ [WHISPER](+NOISE+) +NOISE+ [WHISPER](+RUSTLE+) +RUSTLE+ [WHISPER](+SMACK+) +SMACK+ [WHISPER](+SWALLOW+) +SWALLOW+ LABEL FILE FORMAT The example below shows the forced alignment for the following transcript (from fabm/trans/fabm2as2.trn), chosen to illustrate various notations: fabm2as2: [noise] [begin_noise] butterflies [end_noise] are [begin_noise] /IH N S EH K S/ [end_noise] The alignment is given first at the word level, then phones. The format of each line is as follows: UttID:Level> Item StartFrame EndFrame AcousticScore where Level is word or phone. These two levels are now described and illustrated by corresponding sections of data/fabm/label/fabm2as2.lbl: WORD ITEMS Items at the word level include the following: words, e.g. BUTTERFLIES, ARE(2) noise symbols, e.g. [NOISE](+INHALE+) phonetic spellings, e.g. the item sequence /IH N S EH K S/ start of utterance, denoted end of utterance, denoted silence, denoted SIL region markers, e.g. [BEGIN_NOISE], [END_NOISE] fabm2as2:word> 0 2 -691946 fabm2as2:word> [NOISE](+INHALE+) 3 6 -735484 fabm2as2:word> [BEGIN_NOISE] 7 26 -3040007 fabm2as2:word> BUTTERFLIES 27 84 -9332849 fabm2as2:word> [END_NOISE] 85 87 -1107432 fabm2as2:word> ARE(2) 88 96 -1706023 fabm2as2:word> [BEGIN_NOISE] 97 99 -1039848 fabm2as2:word> /IH 100 105 -1068902 fabm2as2:word> N(/N/) 106 114 -1349913 fabm2as2:word> S(/S/) 115 121 -1165079 fabm2as2:word> EH 122 137 -3086438 fabm2as2:word> K(/K/) 138 146 -1525586 fabm2as2:word> S/ 147 167 -3266485 fabm2as2:word> [END_NOISE] 168 170 -603624 fabm2as2:word> 171 182 -1655149 Optional parenthesized identifiers distinguish which alternative was chosen by Sphinx-II from its pronunciation dictionary when it had a choice. For example, ARE(2) denotes the second of the following two pronunciations: ARE AA R ARE(2) AXR Likewise, S(/S/) distinguishes the phone /S/ from the letter name /EH S/: S EH S S(/S/) S [NOISE](+INHALE+) shows which of the following noise models Sphinx-II chose as the best match (as opposed to a human classification of the type of noise): [NOISE] SIL [NOISE](+EXHALE+) +EXHALE+ [NOISE](+INHALE+) +INHALE+ [NOISE](+NOISE+) +NOISE+ [NOISE](+RUSTLE+) +RUSTLE+ [NOISE](+SMACK+) +SMACK+ [NOISE](+SWALLOW+) +SWALLOW+ Note that region markers themselves are mapped onto silence intervals, as in: fabm2as2:word> [BEGIN_NOISE] 7 26 -3040007 PHONE ITEMS Phone items include the following: phone, e.g. AXR, AH(B,DX), B(SIL,AH)b, Z(AY,SIL)e utterance-initial silence, denoted SILb utterance-final silence, denoted SILe other silence, denoted SIL noise symbol, e.g. +INHALE+ fabm2as2:phone> SILb 0 2 -691946 fabm2as2:phone> +INHALE+ 3 6 -735484 fabm2as2:phone> SIL 7 26 -3040007 fabm2as2:phone> B(SIL,AH)b 27 32 -1043872 fabm2as2:phone> AH(B,DX) 33 36 -666868 fabm2as2:phone> DX(AH,AXR) 37 40 -793937 fabm2as2:phone> AXR 41 48 -1563626 fabm2as2:phone> F 49 58 -1786204 fabm2as2:phone> L(F,AY) 59 64 -881577 fabm2as2:phone> AY(L,Z) 65 79 -1649754 fabm2as2:phone> Z(AY,SIL)e 80 84 -947011 fabm2as2:phone> SIL 85 87 -1107432 fabm2as2:phone> AXR 88 96 -1706023 fabm2as2:phone> SIL 97 99 -1039848 fabm2as2:phone> IH 100 105 -1068902 fabm2as2:phone> N(IH,S)e 106 114 -1349913 fabm2as2:phone> S(N,EH)e 115 121 -1165079 fabm2as2:phone> EH(S,K) 122 137 -3086438 fabm2as2:phone> K(EH,S) 138 146 -1525586 fabm2as2:phone> S(K,SIL)e 147 167 -3266485 fabm2as2:phone> SIL 168 170 -603624 fabm2as2:phone> SILe 171 182 -1655149 Phones may be annotated to show which triphone model was used. E.g., AH(B,DX) denotes the phone /AH/ preceded by /B/ and followed by /DX/. The suffix b and e distinguishes word-initial and word-final versions of phone models, respectively. Thus B(SIL,AH)b denotes a word-initial /B/ preceded by silence and followed by /AH/. Similarly, Z(AY,SIL)e denotes a word-final /Z/ preceded by /AY/ and followed by silence. ALIGNMENT ANOMALIES A few .sph files have no corresponding .lbl files because Sphinx-II failed to find forced alignments, for reasons uncertain. Anomalies were detected both by failures of utterances to align, and by finding statistical outliers in successful alignments. The latter method revealed many errors subsequently corrected in the distributed database. Remaining sources of anomaly include acoustic models, noise, natural variation in reading behavior (such as very long /R/ or /AH/), and silence absorption by deletable stops. For example, some label files assign excessive durations to words ending in D: utterance:unit segment SF EF score fcmm1cu2:word> BAD 993 1954 -128156352 mdpj1bd2:word> FOOD 693 1128 -68727231 mbak2by2:word> DIED 167 600 -58038940 ... In the worst, fcmm1cu2.lbl assigns "BAD" a duration of nearly 10 seconds. Such misalignments are caused by deletable /DD/ consuming the ensuing silence: fcmm1cu2:phone> DD(AE,SIL)e 1021 1954 -122851054