COMLEX README file.

Click here to access the COMLEX Syntax web site.
To download complete documentation for COMLEX Syntax click here.

PRONLEX TRANSCRIPTION

Our idea is that the best current basis for speech recognition is to start with a simple and internally-consistent surface phonemic (allophonic) representation of citation forms in standard American dialect(s). Predictable variation due to dialect, reduction, or transcription uncertainty will be added in a second stage. In each such case, we have tried to define a standard transcription that will be suitable to support generation of the set of variant forms.

An illustrative example: some American dialect distinguish the vowels in "sawed" and "sod", while others do not; the ending "-ing" can be pronounced with a vowel more like "heed" or one more like "hid", and with a final consonant like that of "sing" or like that of "sin". This does not take account of considerable variation of actual quality in these sounds: thus some (New Yorkers) pronounce the vowel of "sawed" as a sequence of a vowel like that in "Sue" followed by one like that in "Bud", while in less stigmatized dialects it is a single vowel (that may or may not be like that in "sod").

Combining all these variants for the transcription of the word "dogging" we would get 12 pronunciations -- three versions of the first vowel, two versions of the second vowel, and two versions of the final consonant. Then someone else comes along to tell us that some Chicagoans not only merge the vowels in "sawed" and "sod" but also move both of them towards the front of mouth, with a sound similar (in extreme cases) to the more standard pronunciation of "sad". Now we have 4 X 2 X 2 = 16 pronunciations for the simple word "dogging" -- with a comparable 16 available for "logging" and "hogging" and so forth, and plenty of variants yet to catalogue.

Our approach is to give just one pronunciation in such a case. Some speech recognition researchers will want to use our lexicon to generate a network of predictable alternative transcriptions, taking account of dialect variation and reduction phenomena. Others may prefer to let statistical modeling of acoustic correlates handle some or all of such variation.

We want to present a consistent transcription for each lexical set -- so that in our example, "dogging" is not transcribed in one of the 16 ways while a second, different choice is made for "logging," and a third one for "hogging." We also want to choose a transcription that will support generation of all variants, so that distinctions made in some dialects should be made in our transcription if possible. Finally, we do want the transcription to indicate those variants that are lexically specific. Thus many cases of the prefix "re-" have both reduced and full variants (e.g. "reduction"), but many others do not (e.g. "recapitalization"). The difference apparently depends on how separable the prefix is from the rest of the word, but our lexicon simply has to list explicitly the cases that permit reduction.

In order to produce a consistent transcription, especially in a lexicon produced by several different people, we have had to develop a set of explicit principles for the many cases that are left unclear by a simple specification of an allophone set. This development is still underway. What follows is the current draft, at the end of a brief but intensive effort to produce a Release 0 WSJ30 vocabulary. The principles are still under development, and comments are welcome.

Here is the symbol set we are using. The "LONG" form is a modified arpabet designed by Bill Fisher at NIST. The "SHORT" form is a single-character-per-allophone version that we developed to reduce wear and tear on our transcribers' fingers.

LONG  SHORT   EXAMPLES             COMMENT
____________________________________________________________________
iy     i       heed, heat, he
ux     u        ?              sometimes used by TI for /u/ -- ignore
ih     I       hid, hit
ey     e       aid, hate, hay
eh     E       head, bet
ae     @       had, hat
aa     a       hod, hot
aax    a        ?               probably Brit: father, alms (vs. pot, botch)
ao     c       law, awe
ow     o       hoed, oats, owe
uh     U       could, hood
uw     u       who'd, hoot, who
ay     Y       hide, height, high
oy     O       Boyd, boy
aw     W       how'd, out, how
er     R       father(2); herd, hurt, her
ax     x       data (2);
ah     A       cud, bud
ix     X       credit(2)?	not used by us
wh     H       which
w      w       witch
y      y       yes
r      r       Ralph
l      l       lawn
m      m       me
em     M       ?                 syllabic m
n      n       no
en     N       button(2)
nx     G       hang
p      p       pot
b      b       bed
t      t       tone
d      d       done
dx     ?       Peter(2)           flap -- not used by us
k      k       kid
g      g       gaff
q      q       ?               Glottal stop -- not used by us
ch     C       check
jh     J       judge
f      f       fix
v      v       vex
th     T       thin
dh     D       this
s      s       six
z      z       zoo
sh     S       shin
zh     Z       pleasure(2)
hh     h       help
'1     '
'2     +
'3     +
'0     .

A note on stress and syllabification:

We distinguish main stress, non-main-stress, and lack-of-stress. For the convenience of the transcribers in entering and checking material, the stress marks may be put between the syllables. However, we have not tried to enforce a consistent set of principles for syllabification, and so the lexicon will be delivered with the stress marks preceding their vowels. Software is available from Bill Fisher that will syllabify arbitrary entries about as well as human annotators can do it, and more consistently.

PRINCIPLES FOR TRANSCRIBING ENGLISH WORDS

(1) Certain classes of words may contain exceptions to the rest of these principles. In the current release, we've tagged most instances; comprehensive tagging will be provided in the next release.

(a)	function words, e.g.:
 
	the     T'i     #FUNC
	am      '@m     #FUNC
	anyhow  'En.ih+W        #FUNC
	but     b'At    #FUNC
 
(b)	names, e.g.:
 
 	ditka   d'Itk.x	#NAME
	cadbury k'@db.xr.i      #NAME
	equicor 'Ekw.Ik+or      #NAME
 	tiananmen       t+i'an.xm'En    ty'En.xm.En     #NAME

(c)     foreign words, e.g.:

	calabasas       k+@l.xb'@s.xs   #FOR
	valenzano       v+@l.Enz'an.o   #FOR
	sumitomo        s+um.it'om.o    #FOR

(d)     abbreviations, e.g.:

 	calif.  k+@l.If'orny.x  #ABBREV
	corp.   k+orp.xr'eS.In  #ABBREV
	oct.    .akt'ob.R       #ABBREV

(e)     acronyms, e.g.:

	cmos    s'im+cs s'im+os #ACRO
	afscme  '@fskm+i        #ACRO
	sids    s'Idz   #ACRO

(f)     words with unclear status, possible typos:

	allegis .xl'EJ.Iz       #?
	attact  .xt'@kt #?
	wal     w'cl    #?

(2) DIALECTAL DIFFERENCES: Distinctions made by some dialects but not others are transcribed if possible; alternate pronunciations that reflect mergers can be derived by rule. Some examples:

In/En

Pronunciations that merge 'I' and 'E' before 'n' can be derived by replacing all instances of 'En' with 'In'

		pen	p'En
		pin	p'In
 		accent  '@ks+Ent        .@ks'Ent
		adventure       .@dv'EnC.R
		expend  .Eksp'End

er/@r/Er

Pronunciations that merge 'e', '@', and 'E' into 'E' before 'r' can be derived by replacing 'er' and '@r' sequences.

        	mary    m'er.i  #NAME
        	marry   m'@r.i
        	merry   m'Er.i
        	fare    f'er
        	garrett g'@r.It #NAME
        	guarantee       g+@r.xnt'i

c/a

Pronunciations from dialects that don't have 'open o' (low back rounded vowel, as in 'caught' vs. 'cot') can be derived by replacing all instances of 'c' with 'a'.

        	smaller sm'cl.R
        	smog    sm'cg
        	abroad  .xbr'cd
        	bylaws  b'Yl+cz
        	sausalito       s+cs.xl'it.o    #NAME

The various /o/ and /c/-like vowels before 'r' in words like "hoarse," "horse," "boring," "Maureen," etc. are all transcribed with /o/, although we recognize that the actual quality of such vowels is highly variable, and is not much like /o/ for most speakers.

		quarrel kw'or.xl
 		moral   m'or.xl
		maureen	m.or'in	#NAME
		workhorse	w'Rkh+ors

H/w - Pronunciations that merge 'H' and 'w' can be derived by replacing any 'H' with 'w':

        	buckwheat       b'AkH+it
       	 	meanwhile       m'inH+Yl
        	wharton	H'ort.N #NAME
        	where   H'er	#FUNC

(3) SCHWA and REDUCED VOWELS:

x

is used for unstressed vowels reduced to schwa, except for the specific environments in which 'I' is used, as explained below.

A

Wedge is used only for the stressed vowel in words like 'hut', 'mud'.

I

is used in the following environments:

+coronal or +palatal ____ +coronal or +palatal

	  sausage s'cs.IJ
	  wanted  w'cnt.Id
          amazes  .xm'ez.Iz
          argentines      'arJ.Int+inz    'arJ.Int+Ynz	#NAME
	  brokerages      br'ok.xr.IJ.Iz  br'okr.IJ.Iz

'l' is not included in the above environment, as it tends to have a lowering effect on I (to x).

          bechtel b'Ekt.xl

corresponding to orthographic 'i' in most cases

          antacid	+@nt'@s.Id
          aeronautical    +er.xn'ct.Ik.xl
          yetnikoff       y'Etn.Ik+cf     #NAME
          anthropologist  +@nTr.xp'al.xJ.ist
          arithmetic      .xr'ITm.xt.Ik   .@r.ITm'Et.Ik

NOTE: We consider this representation of schwa to be somewhat problematic, as all treatments of schwa have turned out to be, to some extent; and we certainly welcome input on this issue. The practices described here at least have the virtue of making a fairly accurate phonetic distinction that's consistent enough to be changed if necessary.

(4) SYLLABIC CONSONANTS and engma (ng):

R

R is used for tautosyllabic /xr/ and the stressed rhotic vowel in words like 'sir'

x.r

x.r is used otherwise

        	formerly        f'orm.Rl.i
        	overarching     'ov.R+arC.IG
        	undergo +And.Rg'o
        	thunderbird     T'And.Rb+Rd
        	glamorous       gl'@m.xr.xs
        	literally       l'It.xr.xl.i
        	winery  w'Yn.xr.i

N

Syllabic N is used when it's possible and natural to pronounce the word without a release to schwa.

        	ardent  'ard.Nt
        	written	r'It.N
        	satin   s'@t.N

L

Syllabic L will be used when it's possible and natural to pronounce the word without a release to schwa. In this release, all such cases are transcribed as /.xl/.

	bartlesville    b'art.Lzv+Il
        	gittleman       g'It.Lm.xn
        	littlebrook     l'It.Lbr+Uk

G/ng/nk

G is used for velar nasals, and followed by 'g' or 'k' if there's a release; 'ng' and 'nk' sequences are used when the two segments can be pronounced separately (even if they may be co-articulated as a velar nasal). The latter case occurs most frequently when the two segments span a syllable or morpheme boundary.

        	throwing        Tr'o.IG
        	tongs   t'cGz
        	torrington      t'or.IGt.xn     #NAME
        	bangkok b+@Gk'ak        #NAME
        	banks   b'@Gks
        	bilingual  	b+Yl'IGgw.xl
        	conclusion      k.xnkl'uZ.xn
        	dyncorp 'dYn+korp       #NAME
        	encompassing    .En'kAm.px.sIG

(5) STRESS:

The treatment of stress and reduction is problematic. We've eliminated tertiary stress in order to reduce the uncertainty; still, the question of where to mark secondary stress remains unclear. Our principles are intended to reflect one traditional mode of description -- alternative proposals are welcome.

As a general rule of thumb, we've only notated secondary stress on a syllable adjacent to the syllable with primary stress when it's clear that reduction isn't possible. Many of the syllables/morphemes that wouldn't normally get stress may carry secondary stress when they fall in an alternating pattern with the primary stressed syllable; and syllables/morphemes that normally carry secondary stress may be unstressed and reduced when adjacent to another stressed syllable, e.g.:

        	disabled        d.Is'eb.xld
        	dismissing      d.Ism'Is.IG
        	disaffected     d+Is.xf'Ekt.Id
        	disappearance   d+Is.xp+ir.Ins

Tense full vowels (i, e, u, o, Y, W) get some degree of stress, except that the following principles take precedence:

V.V

When vowels are adjacent in a word, the first V is tense; secondary stress should not be marked on the first vowel just because it's tense.

        radio   r'ed.i+o
        ambiguous       .@mb'Igy.u.xs
        delineated      d.xl'In.i+et.Id
        media   m'id.i.x
        factual f'@kC.u.xl
       	arsenio .ars'En.i+o     #NAME

Note that compound words may contain exceptions; a final V in the first part of a compound may be tense but unstressed, e.g.,

        petrochemical   p+Etr.ok'Em.Ik.xl
        antitrust       +@nt.itr'Ast    +@nt.Ytr'Ast

.i

Word-final /i/ (e.g., as in /-li/, /-ski/, /-ri/) usually doesn't get secondary stress; exceptions include e.g. when the syllable containing /i/ is in an alternating stress pattern with the primary stressed syllable, and doesn't easily reduce.

        shortly S'ortl.i
        happily h'@p.xl.i
        lasky   l'@sk.i #NAME
        Comanche        k.xm'@nC.i      #NAME
        reentry	r.i'Entr.i
        Cherokee        C'Er.xk+i       #NAME
        dutifully       d'ut.Ifl+i
        uzi     'uz+i

.o

Word-final reducible /o/, as in 'yellow', doesn't get secondary stress; non-reducible syllables with /o/ do

        yellow  y'El.o
        amarillo        +@m.xr'Il.o     #NAME
        echoes  'Ek+oz
        hero    h'ir+o
        anglo   'eGgl+o
        amoco   '@m.xk+o        #NAME

#FOR

Foreign words and names are often exceptions to our practices for marking stress, e.g.

       	konimoru        k+on.im'or.u    #FOR NAME
        mitsuzuka       m+Its.uz'uk.x   #FOR NAME
        fiero   f+i'Er+o        #FOR NAME
        peroni  p+er'on+i       #FOR NAME

OTHER STRESS ISSUES:

.@, .a

Word-initial [@] and [a] don't necessarily get secondary stress, although in case of an alternating stress pattern, they might.

        accept          .@ks'Ept
        ambassadors     .@mb'@s.Id.Rz
        alberto .al'bR+to       .@lb'Rt+o	#NAME
        admiring        .@dm'Yr.IG
        adheres .@dh'irz
        abdulla .abd'Al.x	#FOR NAME
        arsenio .ars'En.i+o	#NAME
        absolutely      +@bs.xl'utl.i

(6) MISC OTHER issues:

We haven't transcribed flapping,

        better  b'Et.R
        shutter S'At.R
        shudder S'Ad.R

and we're not transcribing intrustive /t/,

        tents   t'Ents
        tense   t'Ens

Comments are welcome.

Regards,
Cynthia McLemore