FORCED ALIGNMENT AND ANOMALY DETECTION
	        (notes by Jack Mostow, revised 6/19/97)

Utterances were time-aligned against transcripts to find segment
boundaries, both to locate spoken words and phones, and to detect
anomalous data.

			   ALIGNMENT METHOD

Each label file includes the forced alignment produced by Sphinx-II
for an utterance of its signal file against its transcript file.

Alignment was performed by the Sphinx-II speech recognizer's forced
alignment mode, developed by Eric Thayer and Ravishankar Mosur.  The
acoustic models used in alignment were trained on adult female speech
for the ATIS task.  To improve accuracy, codebook means were adapted
by Alex Hauptmann using a small pilot corpus of twelve children's
fluent oral reading collected by Maxine Eskenazi.


			       LEXICON

The pronunciation dictionary used in the time-alignment was adapted
from the CMU Speech Group's cmu_dict.04 dictionary, which can be
obtained free of charge via the World Wide Web or anonymous ftp.  The
web URL for the CMU Pronouncing dictionary is:

	http://www.speech.cs.cmu.edu/cgi-bin/cmudict

The address for anonymous ftp access is:

	ftp://ftp.cs.cmu.edu/project/fgdata/dict/cmudict.0.4.Z

(That is, connect to host "ftp.cs.cmu.edu", use "anonymous" as the
login name and your email address as password, then go to the
directory "project/fgdata/dict" and retrieve the file cmudict.0.4.Z)
Using either method, there is additional information available (in
readme files or supplemental web pages) about the dictionary and other
resources available from CMU.

For the purpose of doing word alignment on the KIDS speech collection,
the CMUDICT lexicon was augmented by 20 Weekly Reader word entries it
did not include at the time the alignment was carried out, together
with definitions of the phone and noise symbols used in the
transcripts.

The adapted version of the lexicon is provided as part of the present
corpus, in the "tables" directory (file name: alignmnt.dic).


The 20 word entries added were:

ASTRONAUTS'     AE S T R AH N AO T S
CHEETAHS        CH IY T AH Z
CHIPETAS        CH IH P AY T AH S
FROG'S          F R AA G Z
GET-WELL        G EH T W EH L
HABS            HH AE B Z
HIPBONE         HH IH P B OW N
HOME-SCHOOL     HH OW M S K UW L
ICEFISH         AY S F IH SH
MEAT-EATING     M IY T IY T IH NG
MUNCHED         M AH N CH TD
PUPA            P Y UW P AX
SHEATHBILLS     SH IY TH B IH L Z
STEGOSAURUS     S T EH G AA S AO R AX S
SUPERHEROES     S UW P AXR HH IH R OW Z
SUPERHEROES(2)  S UW P AXR HH IY R OW Z
TLINGIT         T L IH NG G IH T
TREETOPS        T R IY T AA P S
TRICERATOPS     T R AY S EH R AX T AA P S
TV              T IY V IY
TYRANNOSAURUS   T AY R AE N AA S AO R AX S

The following set of phones was used:
AA
AE
AH
AO
AW
AX
AXR
AY
B
BD
CH
D
DD
DH
DX
EH
ER
EY
F
G
GD
HH
IH
IX
IY
JH
K
KD
L
M
N
NG
OW
OY
P
PD
R
S
SH
T
TD
TH
TS
UH
UW
V
W
Y
Z
ZH

The following phones denote utterance-initial, -medial, and -final silences:

SILb
SIL
SILe

		      LEXICAL ENTRIES FOR PHONES

To facilitate forced alignment of transcripts combining words, phones,
and noises, the lexicon was augmented with entries for phones and
noises.

Since phonetically spelled transcriptions were delimited by the "/"
character, lexical items were added for each phone corresponding to
where it could occur.  To illustrate, here are the respective lexical
items added for occurrences of the phone AX at the start, middle, or
end of a phone sequence, or by itself:

/AX	AX
AX(/AX/)	AX
AX/	AX
/AX/	AX

Parenthesized identifiers in the lexicon distinguish alternative
pronunciations for a given lexicon entry.  In the label files, these
identifiers show which pronunciation Sphinx chose as the best match
for a given transcribed word, phone, or noise event.  Thus the
parenthesized annotation in "AX(/AX/)" distinguishes the lexical entry
for the phone /AX/ from the English word "ax":

AX      AE K S

For implementation reasons (namely that Sphinx requires the lexicon to
include a non-annotated base pronunciation for each word), the dummy
impossible pronunciation "K P T" was added for phones that are not
words, e.g.:

AA	K P T
...
ZH	K P T


		LEXICAL ENTRIES FOR EVENTS AND REGIONS

Bracketed symbols in transcripts represent various types of events and
regions.  For example, "[noise]" denotes an individual noise, while
"[begin_noise] ...  [end_noise]" indicates a noisy region.

Event symbols include phenomena not simultaneous with transcribed speech:

[crosstalk] -- off-microphone speech by another speaker
[human_noise] -- non-speech sound produced by the speaker
[microphone_noise] -- noises produced by touching the microphone
[noise]	-- any noise
[see_transcript] -- used in label files to indicate other transcribed events
[sil] -- silence as long as two more typical syllables for this speaker
[whisper] -- whispered speech not otherwise transcribed

Region delimiters denote noise and other phenomena concurrent with speech:

[begin_crosstalk_noise]	... [end_crosstalk_noise] -- someone else speaking too
[begin_microphone_noise] ... [end_microphone_noise] -- microphone being touched
[begin_noise] ... [end_noise] -- interval of noise simultaneous with speech
[begin_whisper]	... [end_whisper] -- whispered interval of transcribed speech

Transcribers were allowed to label noise event and regions they could identify,
e.g. [lipsmack].

To support forced alignment of noise events and regions, lexical
entries for these symbols were added as shown here.  For convenience
of implementation, region delimiters such as "[begin_noise]" and
"[end_noise]," which strictly speaking ought to have zero duration,
were defined as silences to make them appear in the label files
produced by forced alignment.

To accommodate additional transcript symbols besides those listed
above without having to continually expand the lexicon, all additional
transcript symbols were translated to the catch-all symbol
"[see_transcript]" prior to alignment.  Where [SEE_TRANSCRIPT] occurs
in the label files, see the corresponding transcript for the original
symbol.  There are fewer than 200 such occurrences.

The lexicon was augmented with the following entries for events and
regions:

[BEGIN_CROSSTALK_NOISE]	SIL
[BEGIN_MICROPHONE_NOISE]	SIL
[BEGIN_NOISE]	SIL
[BEGIN_WHISPER]	SIL
[CROSSTALK]	SIL
[CROSSTALK](+EXHALE+)	+EXHALE+
[CROSSTALK](+INHALE+)	+INHALE+
[CROSSTALK](+NOISE+)	+NOISE+
[CROSSTALK](+RUSTLE+)	+RUSTLE+
[CROSSTALK](+SMACK+)	+SMACK+
[CROSSTALK](+SWALLOW+)	+SWALLOW+
[END_CROSSTALK_NOISE]	SIL
[END_MICROPHONE_NOISE]	SIL
[END_NOISE]	SIL
[END_WHISPER]	SIL
[HUMAN_NOISE]	SIL
[HUMAN_NOISE](+EXHALE+)	+EXHALE+
[HUMAN_NOISE](+INHALE+)	+INHALE+
[HUMAN_NOISE](+NOISE+)	+NOISE+
[HUMAN_NOISE](+RUSTLE+)	+RUSTLE+
[HUMAN_NOISE](+SMACK+)	+SMACK+
[HUMAN_NOISE](+SWALLOW+)	+SWALLOW+
[MICROPHONE_NOISE]	SIL
[MICROPHONE_NOISE](+EXHALE+)	+EXHALE+
[MICROPHONE_NOISE](+INHALE+)	+INHALE+
[MICROPHONE_NOISE](+NOISE+)	+NOISE+
[MICROPHONE_NOISE](+RUSTLE+)	+RUSTLE+
[MICROPHONE_NOISE](+SMACK+)	+SMACK+
[MICROPHONE_NOISE](+SWALLOW+)	+SWALLOW+
[NOISE]	SIL
[NOISE](+EXHALE+)	+EXHALE+
[NOISE](+INHALE+)	+INHALE+
[NOISE](+NOISE+)	+NOISE+
[NOISE](+RUSTLE+)	+RUSTLE+
[NOISE](+SMACK+)	+SMACK+
[NOISE](+SWALLOW+)	+SWALLOW+
[SEE_TRANSCRIPT]	SIL
[SEE_TRANSCRIPT](+EXHALE+)	+EXHALE+
[SEE_TRANSCRIPT](+INHALE+)	+INHALE+
[SEE_TRANSCRIPT](+NOISE+)	+NOISE+
[SEE_TRANSCRIPT](+RUSTLE+)	+RUSTLE+
[SEE_TRANSCRIPT](+SMACK+)	+SMACK+
[SEE_TRANSCRIPT](+SWALLOW+)	+SWALLOW+
[SIL]	SIL
[WHISPER]	SIL
[WHISPER](+EXHALE+)	+EXHALE+
[WHISPER](+INHALE+)	+INHALE+
[WHISPER](+NOISE+)	+NOISE+
[WHISPER](+RUSTLE+)	+RUSTLE+
[WHISPER](+SMACK+)	+SMACK+
[WHISPER](+SWALLOW+)	+SWALLOW+


			  LABEL FILE FORMAT

The example below shows the forced alignment for the following
transcript (from fabm/trans/fabm2as2.trn), chosen to
illustrate various notations:

	fabm2as2: [noise] [begin_noise] butterflies [end_noise] are 
	[begin_noise] /IH N S EH K S/ [end_noise] 

The alignment is given first at the word level, then phones.

The format of each line is as follows:

UttID:Level>	Item	StartFrame	EndFrame	AcousticScore

where Level is word or phone.  These two levels are now described and
illustrated by corresponding sections of data/fabm/label/fabm2as2.lbl:


			      WORD ITEMS

Items at the word level include the following:
	words, e.g. BUTTERFLIES, ARE(2)
	noise symbols, e.g. [NOISE](+INHALE+)
	phonetic spellings, e.g. the item sequence /IH N S EH K S/
	start of utterance, denoted <s>
	end of utterance, denoted </s>
 	silence, denoted SIL
	region markers, e.g. [BEGIN_NOISE], [END_NOISE]

fabm2as2:word>                  <s>    0    2         -691946
fabm2as2:word>    [NOISE](+INHALE+)    3    6         -735484
fabm2as2:word>        [BEGIN_NOISE]    7   26        -3040007
fabm2as2:word>          BUTTERFLIES   27   84        -9332849
fabm2as2:word>          [END_NOISE]   85   87        -1107432
fabm2as2:word>               ARE(2)   88   96        -1706023
fabm2as2:word>        [BEGIN_NOISE]   97   99        -1039848
fabm2as2:word>                  /IH  100  105        -1068902
fabm2as2:word>               N(/N/)  106  114        -1349913
fabm2as2:word>               S(/S/)  115  121        -1165079
fabm2as2:word>                   EH  122  137        -3086438
fabm2as2:word>               K(/K/)  138  146        -1525586
fabm2as2:word>                   S/  147  167        -3266485
fabm2as2:word>          [END_NOISE]  168  170         -603624
fabm2as2:word>                 </s>  171  182        -1655149

Optional parenthesized identifiers distinguish which alternative was
chosen by Sphinx-II from its pronunciation dictionary when it had a
choice.

For example, ARE(2) denotes the second of the following two pronunciations:    

ARE                      	AA R
ARE(2)                   	AXR

Likewise, S(/S/) distinguishes the phone /S/ from the letter name /EH S/:

S                        	EH S
S(/S/)	S

[NOISE](+INHALE+) shows which of the following noise models Sphinx-II
chose as the best match (as opposed to a human classification of the
type of noise):

[NOISE]	SIL
[NOISE](+EXHALE+)	+EXHALE+
[NOISE](+INHALE+)	+INHALE+
[NOISE](+NOISE+)	+NOISE+
[NOISE](+RUSTLE+)	+RUSTLE+
[NOISE](+SMACK+)	+SMACK+
[NOISE](+SWALLOW+)	+SWALLOW+

Note that region markers themselves are mapped onto silence intervals,
as in:

fabm2as2:word>        [BEGIN_NOISE]    7   26        -3040007


			     PHONE ITEMS

Phone items include the following:
	phone, e.g. AXR, AH(B,DX), B(SIL,AH)b, Z(AY,SIL)e
	utterance-initial silence, denoted SILb
	utterance-final silence, denoted SILe
	other silence, denoted SIL
	noise symbol, e.g. +INHALE+

fabm2as2:phone>                 SILb    0    2         -691946
fabm2as2:phone>             +INHALE+    3    6         -735484
fabm2as2:phone>                  SIL    7   26        -3040007
fabm2as2:phone>           B(SIL,AH)b   27   32        -1043872
fabm2as2:phone>             AH(B,DX)   33   36         -666868
fabm2as2:phone>           DX(AH,AXR)   37   40         -793937
fabm2as2:phone>                  AXR   41   48        -1563626
fabm2as2:phone>                    F   49   58        -1786204
fabm2as2:phone>              L(F,AY)   59   64         -881577
fabm2as2:phone>              AY(L,Z)   65   79        -1649754
fabm2as2:phone>           Z(AY,SIL)e   80   84         -947011
fabm2as2:phone>                  SIL   85   87        -1107432
fabm2as2:phone>                  AXR   88   96        -1706023
fabm2as2:phone>                  SIL   97   99        -1039848
fabm2as2:phone>                   IH  100  105        -1068902
fabm2as2:phone>             N(IH,S)e  106  114        -1349913
fabm2as2:phone>             S(N,EH)e  115  121        -1165079
fabm2as2:phone>              EH(S,K)  122  137        -3086438
fabm2as2:phone>              K(EH,S)  138  146        -1525586
fabm2as2:phone>            S(K,SIL)e  147  167        -3266485
fabm2as2:phone>                  SIL  168  170         -603624
fabm2as2:phone>                 SILe  171  182        -1655149

Phones may be annotated to show which triphone model was used.  E.g.,
AH(B,DX) denotes the phone /AH/ preceded by /B/ and followed by /DX/.
The suffix b and e distinguishes word-initial and word-final versions
of phone models, respectively.  Thus B(SIL,AH)b denotes a word-initial
/B/ preceded by silence and followed by /AH/.  Similarly, Z(AY,SIL)e
denotes a word-final /Z/ preceded by /AY/ and followed by silence.


			 ALIGNMENT ANOMALIES

A few .sph files have no corresponding .lbl files because Sphinx-II
failed to find forced alignments, for reasons uncertain.

Anomalies were detected both by failures of utterances to align, and
by finding statistical outliers in successful alignments.  The latter
method revealed many errors subsequently corrected in the distributed
database.  Remaining sources of anomaly include acoustic models,
noise, natural variation in reading behavior (such as very long /R/ or
/AH/), and silence absorption by deletable stops.  For example, some
label files assign excessive durations to words ending in D:

utterance:unit                  segment   SF   EF       score

fcmm1cu2:word>                  BAD  993 1954      -128156352
mdpj1bd2:word>                 FOOD  693 1128       -68727231
mbak2by2:word>                 DIED  167  600       -58038940
...

In the worst, fcmm1cu2.lbl assigns "BAD" a duration of nearly 10 seconds.
Such misalignments are caused by deletable /DD/ consuming the ensuing silence:

fcmm1cu2:phone>          DD(AE,SIL)e 1021 1954      -122851054