FORCED ALIGNMENT AND ANOMALY DETECTION
(notes by Jack Mostow, revised 6/19/97)
Utterances were time-aligned against transcripts to find segment
boundaries, both to locate spoken words and phones, and to detect
anomalous data.
ALIGNMENT METHOD
Each label file includes the forced alignment produced by Sphinx-II
for an utterance of its signal file against its transcript file.
Alignment was performed by the Sphinx-II speech recognizer's forced
alignment mode, developed by Eric Thayer and Ravishankar Mosur. The
acoustic models used in alignment were trained on adult female speech
for the ATIS task. To improve accuracy, codebook means were adapted
by Alex Hauptmann using a small pilot corpus of twelve children's
fluent oral reading collected by Maxine Eskenazi.
LEXICON
The pronunciation dictionary used in the time-alignment was adapted
from the CMU Speech Group's cmu_dict.04 dictionary, which can be
obtained free of charge via the World Wide Web or anonymous ftp. The
web URL for the CMU Pronouncing dictionary is:
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
The address for anonymous ftp access is:
ftp://ftp.cs.cmu.edu/project/fgdata/dict/cmudict.0.4.Z
(That is, connect to host "ftp.cs.cmu.edu", use "anonymous" as the
login name and your email address as password, then go to the
directory "project/fgdata/dict" and retrieve the file cmudict.0.4.Z)
Using either method, there is additional information available (in
readme files or supplemental web pages) about the dictionary and other
resources available from CMU.
For the purpose of doing word alignment on the KIDS speech collection,
the CMUDICT lexicon was augmented by 20 Weekly Reader word entries it
did not include at the time the alignment was carried out, together
with definitions of the phone and noise symbols used in the
transcripts.
The adapted version of the lexicon is provided as part of the present
corpus, in the "tables" directory (file name: alignmnt.dic).
The 20 word entries added were:
ASTRONAUTS' AE S T R AH N AO T S
CHEETAHS CH IY T AH Z
CHIPETAS CH IH P AY T AH S
FROG'S F R AA G Z
GET-WELL G EH T W EH L
HABS HH AE B Z
HIPBONE HH IH P B OW N
HOME-SCHOOL HH OW M S K UW L
ICEFISH AY S F IH SH
MEAT-EATING M IY T IY T IH NG
MUNCHED M AH N CH TD
PUPA P Y UW P AX
SHEATHBILLS SH IY TH B IH L Z
STEGOSAURUS S T EH G AA S AO R AX S
SUPERHEROES S UW P AXR HH IH R OW Z
SUPERHEROES(2) S UW P AXR HH IY R OW Z
TLINGIT T L IH NG G IH T
TREETOPS T R IY T AA P S
TRICERATOPS T R AY S EH R AX T AA P S
TV T IY V IY
TYRANNOSAURUS T AY R AE N AA S AO R AX S
The following set of phones was used:
AA
AE
AH
AO
AW
AX
AXR
AY
B
BD
CH
D
DD
DH
DX
EH
ER
EY
F
G
GD
HH
IH
IX
IY
JH
K
KD
L
M
N
NG
OW
OY
P
PD
R
S
SH
T
TD
TH
TS
UH
UW
V
W
Y
Z
ZH
The following phones denote utterance-initial, -medial, and -final silences:
SILb
SIL
SILe
LEXICAL ENTRIES FOR PHONES
To facilitate forced alignment of transcripts combining words, phones,
and noises, the lexicon was augmented with entries for phones and
noises.
Since phonetically spelled transcriptions were delimited by the "/"
character, lexical items were added for each phone corresponding to
where it could occur. To illustrate, here are the respective lexical
items added for occurrences of the phone AX at the start, middle, or
end of a phone sequence, or by itself:
/AX AX
AX(/AX/) AX
AX/ AX
/AX/ AX
Parenthesized identifiers in the lexicon distinguish alternative
pronunciations for a given lexicon entry. In the label files, these
identifiers show which pronunciation Sphinx chose as the best match
for a given transcribed word, phone, or noise event. Thus the
parenthesized annotation in "AX(/AX/)" distinguishes the lexical entry
for the phone /AX/ from the English word "ax":
AX AE K S
For implementation reasons (namely that Sphinx requires the lexicon to
include a non-annotated base pronunciation for each word), the dummy
impossible pronunciation "K P T" was added for phones that are not
words, e.g.:
AA K P T
...
ZH K P T
LEXICAL ENTRIES FOR EVENTS AND REGIONS
Bracketed symbols in transcripts represent various types of events and
regions. For example, "[noise]" denotes an individual noise, while
"[begin_noise] ... [end_noise]" indicates a noisy region.
Event symbols include phenomena not simultaneous with transcribed speech:
[crosstalk] -- off-microphone speech by another speaker
[human_noise] -- non-speech sound produced by the speaker
[microphone_noise] -- noises produced by touching the microphone
[noise] -- any noise
[see_transcript] -- used in label files to indicate other transcribed events
[sil] -- silence as long as two more typical syllables for this speaker
[whisper] -- whispered speech not otherwise transcribed
Region delimiters denote noise and other phenomena concurrent with speech:
[begin_crosstalk_noise] ... [end_crosstalk_noise] -- someone else speaking too
[begin_microphone_noise] ... [end_microphone_noise] -- microphone being touched
[begin_noise] ... [end_noise] -- interval of noise simultaneous with speech
[begin_whisper] ... [end_whisper] -- whispered interval of transcribed speech
Transcribers were allowed to label noise event and regions they could identify,
e.g. [lipsmack].
To support forced alignment of noise events and regions, lexical
entries for these symbols were added as shown here. For convenience
of implementation, region delimiters such as "[begin_noise]" and
"[end_noise]," which strictly speaking ought to have zero duration,
were defined as silences to make them appear in the label files
produced by forced alignment.
To accommodate additional transcript symbols besides those listed
above without having to continually expand the lexicon, all additional
transcript symbols were translated to the catch-all symbol
"[see_transcript]" prior to alignment. Where [SEE_TRANSCRIPT] occurs
in the label files, see the corresponding transcript for the original
symbol. There are fewer than 200 such occurrences.
The lexicon was augmented with the following entries for events and
regions:
[BEGIN_CROSSTALK_NOISE] SIL
[BEGIN_MICROPHONE_NOISE] SIL
[BEGIN_NOISE] SIL
[BEGIN_WHISPER] SIL
[CROSSTALK] SIL
[CROSSTALK](+EXHALE+) +EXHALE+
[CROSSTALK](+INHALE+) +INHALE+
[CROSSTALK](+NOISE+) +NOISE+
[CROSSTALK](+RUSTLE+) +RUSTLE+
[CROSSTALK](+SMACK+) +SMACK+
[CROSSTALK](+SWALLOW+) +SWALLOW+
[END_CROSSTALK_NOISE] SIL
[END_MICROPHONE_NOISE] SIL
[END_NOISE] SIL
[END_WHISPER] SIL
[HUMAN_NOISE] SIL
[HUMAN_NOISE](+EXHALE+) +EXHALE+
[HUMAN_NOISE](+INHALE+) +INHALE+
[HUMAN_NOISE](+NOISE+) +NOISE+
[HUMAN_NOISE](+RUSTLE+) +RUSTLE+
[HUMAN_NOISE](+SMACK+) +SMACK+
[HUMAN_NOISE](+SWALLOW+) +SWALLOW+
[MICROPHONE_NOISE] SIL
[MICROPHONE_NOISE](+EXHALE+) +EXHALE+
[MICROPHONE_NOISE](+INHALE+) +INHALE+
[MICROPHONE_NOISE](+NOISE+) +NOISE+
[MICROPHONE_NOISE](+RUSTLE+) +RUSTLE+
[MICROPHONE_NOISE](+SMACK+) +SMACK+
[MICROPHONE_NOISE](+SWALLOW+) +SWALLOW+
[NOISE] SIL
[NOISE](+EXHALE+) +EXHALE+
[NOISE](+INHALE+) +INHALE+
[NOISE](+NOISE+) +NOISE+
[NOISE](+RUSTLE+) +RUSTLE+
[NOISE](+SMACK+) +SMACK+
[NOISE](+SWALLOW+) +SWALLOW+
[SEE_TRANSCRIPT] SIL
[SEE_TRANSCRIPT](+EXHALE+) +EXHALE+
[SEE_TRANSCRIPT](+INHALE+) +INHALE+
[SEE_TRANSCRIPT](+NOISE+) +NOISE+
[SEE_TRANSCRIPT](+RUSTLE+) +RUSTLE+
[SEE_TRANSCRIPT](+SMACK+) +SMACK+
[SEE_TRANSCRIPT](+SWALLOW+) +SWALLOW+
[SIL] SIL
[WHISPER] SIL
[WHISPER](+EXHALE+) +EXHALE+
[WHISPER](+INHALE+) +INHALE+
[WHISPER](+NOISE+) +NOISE+
[WHISPER](+RUSTLE+) +RUSTLE+
[WHISPER](+SMACK+) +SMACK+
[WHISPER](+SWALLOW+) +SWALLOW+
LABEL FILE FORMAT
The example below shows the forced alignment for the following
transcript (from fabm/trans/fabm2as2.trn), chosen to
illustrate various notations:
fabm2as2: [noise] [begin_noise] butterflies [end_noise] are
[begin_noise] /IH N S EH K S/ [end_noise]
The alignment is given first at the word level, then phones.
The format of each line is as follows:
UttID:Level> Item StartFrame EndFrame AcousticScore
where Level is word or phone. These two levels are now described and
illustrated by corresponding sections of data/fabm/label/fabm2as2.lbl:
WORD ITEMS
Items at the word level include the following:
words, e.g. BUTTERFLIES, ARE(2)
noise symbols, e.g. [NOISE](+INHALE+)
phonetic spellings, e.g. the item sequence /IH N S EH K S/
start of utterance, denoted
end of utterance, denoted
silence, denoted SIL
region markers, e.g. [BEGIN_NOISE], [END_NOISE]
fabm2as2:word> 0 2 -691946
fabm2as2:word> [NOISE](+INHALE+) 3 6 -735484
fabm2as2:word> [BEGIN_NOISE] 7 26 -3040007
fabm2as2:word> BUTTERFLIES 27 84 -9332849
fabm2as2:word> [END_NOISE] 85 87 -1107432
fabm2as2:word> ARE(2) 88 96 -1706023
fabm2as2:word> [BEGIN_NOISE] 97 99 -1039848
fabm2as2:word> /IH 100 105 -1068902
fabm2as2:word> N(/N/) 106 114 -1349913
fabm2as2:word> S(/S/) 115 121 -1165079
fabm2as2:word> EH 122 137 -3086438
fabm2as2:word> K(/K/) 138 146 -1525586
fabm2as2:word> S/ 147 167 -3266485
fabm2as2:word> [END_NOISE] 168 170 -603624
fabm2as2:word> 171 182 -1655149
Optional parenthesized identifiers distinguish which alternative was
chosen by Sphinx-II from its pronunciation dictionary when it had a
choice.
For example, ARE(2) denotes the second of the following two pronunciations:
ARE AA R
ARE(2) AXR
Likewise, S(/S/) distinguishes the phone /S/ from the letter name /EH S/:
S EH S
S(/S/) S
[NOISE](+INHALE+) shows which of the following noise models Sphinx-II
chose as the best match (as opposed to a human classification of the
type of noise):
[NOISE] SIL
[NOISE](+EXHALE+) +EXHALE+
[NOISE](+INHALE+) +INHALE+
[NOISE](+NOISE+) +NOISE+
[NOISE](+RUSTLE+) +RUSTLE+
[NOISE](+SMACK+) +SMACK+
[NOISE](+SWALLOW+) +SWALLOW+
Note that region markers themselves are mapped onto silence intervals,
as in:
fabm2as2:word> [BEGIN_NOISE] 7 26 -3040007
PHONE ITEMS
Phone items include the following:
phone, e.g. AXR, AH(B,DX), B(SIL,AH)b, Z(AY,SIL)e
utterance-initial silence, denoted SILb
utterance-final silence, denoted SILe
other silence, denoted SIL
noise symbol, e.g. +INHALE+
fabm2as2:phone> SILb 0 2 -691946
fabm2as2:phone> +INHALE+ 3 6 -735484
fabm2as2:phone> SIL 7 26 -3040007
fabm2as2:phone> B(SIL,AH)b 27 32 -1043872
fabm2as2:phone> AH(B,DX) 33 36 -666868
fabm2as2:phone> DX(AH,AXR) 37 40 -793937
fabm2as2:phone> AXR 41 48 -1563626
fabm2as2:phone> F 49 58 -1786204
fabm2as2:phone> L(F,AY) 59 64 -881577
fabm2as2:phone> AY(L,Z) 65 79 -1649754
fabm2as2:phone> Z(AY,SIL)e 80 84 -947011
fabm2as2:phone> SIL 85 87 -1107432
fabm2as2:phone> AXR 88 96 -1706023
fabm2as2:phone> SIL 97 99 -1039848
fabm2as2:phone> IH 100 105 -1068902
fabm2as2:phone> N(IH,S)e 106 114 -1349913
fabm2as2:phone> S(N,EH)e 115 121 -1165079
fabm2as2:phone> EH(S,K) 122 137 -3086438
fabm2as2:phone> K(EH,S) 138 146 -1525586
fabm2as2:phone> S(K,SIL)e 147 167 -3266485
fabm2as2:phone> SIL 168 170 -603624
fabm2as2:phone> SILe 171 182 -1655149
Phones may be annotated to show which triphone model was used. E.g.,
AH(B,DX) denotes the phone /AH/ preceded by /B/ and followed by /DX/.
The suffix b and e distinguishes word-initial and word-final versions
of phone models, respectively. Thus B(SIL,AH)b denotes a word-initial
/B/ preceded by silence and followed by /AH/. Similarly, Z(AY,SIL)e
denotes a word-final /Z/ preceded by /AY/ and followed by silence.
ALIGNMENT ANOMALIES
A few .sph files have no corresponding .lbl files because Sphinx-II
failed to find forced alignments, for reasons uncertain.
Anomalies were detected both by failures of utterances to align, and
by finding statistical outliers in successful alignments. The latter
method revealed many errors subsequently corrected in the distributed
database. Remaining sources of anomaly include acoustic models,
noise, natural variation in reading behavior (such as very long /R/ or
/AH/), and silence absorption by deletable stops. For example, some
label files assign excessive durations to words ending in D:
utterance:unit segment SF EF score
fcmm1cu2:word> BAD 993 1954 -128156352
mdpj1bd2:word> FOOD 693 1128 -68727231
mbak2by2:word> DIED 167 600 -58038940
...
In the worst, fcmm1cu2.lbl assigns "BAD" a duration of nearly 10 seconds.
Such misalignments are caused by deletable /DD/ consuming the ensuing silence:
fcmm1cu2:phone> DD(AE,SIL)e 1021 1954 -122851054