Han, Na-Rae. 2003
     Korean Telephone Conversations Lexicon. Philadelphia:  Linguistic Data
     Consortium, University of Pennsylvania.


        -----------------------------------------------------------
        Description of the Korean Telephone Conversations Lexicon
        -----------------------------------------------------------

CONTENTS

        1. Summary abstract
        2. Token coverage
        3. Lexicon information fields
        4. Orthographic convention and romanization
        5. Pronunciation
        6. Frequency counts and corpus
        7. Morphological Analysis

-----------------------------------------------------------------------
1.  Summary abstract

        The Korean Telephone Conversations Lexicon consists of 25251 words, and
contains separate fields with phonological, morphological, and frequency
information for each word.

-----------------------------------------------------------------------
2.  Token coverage

        The token coverage of the words occurring in the 100 Korean Telephone
Conversations Transcripts is 100%.


-----------------------------------------------------------------------
3. Lexicon information fields

	The LDC lexicon contains five tab-separated information fields:

Field 1: orthographic form in Hangul (headword)
Field 2: orthographic form in Yale romanization
Field 3: pronunciation
Field 4: frequency of the word in Korean Telephone Conversations Transcripts
Field 5: morphological analysis of the word

Each of these fields is described the sections below.

-----------------------------------------------------------------------
4. Orthographic convention and romanization

Orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung)
system. The romanization given in field 2 follows the Yale Romanization system,
which allows Hangul-romanization conversion in both directions. The 
romanized words were obtained using the "yale-roman.pl" script included 
in the release.

Yale system is the preferred romanization scheme for Korean among the 
linguistics community. Each Korean alphabet is mapped to a single or a 
sequence of roman alphabets as following:
 
  single consonants 
  (19: onset and coda)
  (exceptions: こ, す, え do not appear in coda)
	KIYEOK          ぁ      k
	SSANGKIYEOK     あ      k k
	NIEUN           い      n
	TIKEUT          ぇ      t
	SSANGTIKEUT     え      t t
	RIEUL           ぉ      l
	MIEUM           け      m
	PIEUP           げ      p
	SSANGPIEUP      こ      p p
	SIOS            さ      s
	SSANGSIOS       ざ      s s
	IEUNG           し      n g
	CIEUC           じ      c
	SSANGCIEUC      す      c c
	CHIEUCH         ず      c h
	KHIEUKH         せ      k h
	THIEUTH         ぜ      t h
	PHIEUPH         そ      p h
	HIEUH           ぞ      h

  multiple consonants (11: only in coda)
	KIYEOKSIOS      ぁさ    k s
	NIEUNCIEUC      いじ    n c
	NIEUNHIEUH      いぞ    n h
	RIEULKIYEOK     ぉぁ    l k
	RIEULMIEUM      ぉけ    l m
	RIEULPIEUP      ぉげ    l p
	RIEULSIOS       ぉさ    l s
	RIEULTHIEUTH    ぉぜ    l t h
	RIEULPHIEUPH    ぉそ    l p h
	RIEULHIEUH      ぉぞ    l h
	PIEUPSIOS       げさ    p s

  vowels (21)
	A               た      a
	AE              だ      a y
	YA              ち      y a
	YAE             剰      y a y
	EO              っ      e
	E               つ      e y
	YEO             づ      y e
	YE              て      y e y
	O               で      o
	WA              人      w a
	WAE             訊      w a y
	OE              須      o y
	YO              に      y o
	U               ぬ      w u
	WEO             趨      w e
	WE              裾      w e y
	WI              是      w i
	YU              ば      y u
	EU              ぱ      u
	YI              税      u y
	I               び      i

Some modifications are made to the original Yale system:

a. Syllabic boundaries
The end of a syllable is marked with ".", to avoid ambiguous mappings. 
Example:

	kakkak --> could be 唖唖 or 亜縁
	kak.kak. --> unambiguously 唖唖

b. Marking valueless onset "し"
Valueless onset "し" is marked by "ng" rather than being left empty: 
strict alphabet to alphabet transliteration

	 照括:  annyeng --> ngan.nyeng.

c. "yu" instead of "ywu"
Yale allows "ywu" for "ば" in addition to 'yu'. To avoid confusion, only 
"yu" is used. 


-----------------------------------------------------------------------
5.  Pronunciation
5.1. Overview

The pronunciation field of the lexicon was produced first
by a perl script ("kor2pron.pl", written by David Graff and Na-Rae Han)
which converts Korean words in EUC-KR encoding into their romanized form,
and then into a phonetic string based on Korean phonological/phonetic
rules. As pronunciation is predictable from the orthography most of the
time, automatic conversion produces reliable output. There are some
arbitrariness due to idiosyncracy of words or morphological information,
such as gemination of consonants at morphological boundaries, which were
manually checked and corrected. Also, for words which were classified as
having a dialectal pronunciation, pronunciation was provided manually. 

-----------------------------------------------------------------------
5.2. Phonetic alphabets

The set of phonetic alphabet used for the lexicon is based on the alphabet set
used in the Yale romanization system. For a complete transliteration table,
please refer to the document on Korean romanization. 

Some modifications were made, however, to make it more suitable as a 
phonetic alphabet. 

- aspirated consonants:
	kh -> K
	th -> T
	ph -> P
	ch -> C

- engma:
	ng -> G		in coda position
	ng -> null	in onset position

- liquid:
	l  -> r		word-initially; between vowels

- digraphic vowels:
	ay -> A		(蕉)
	ey -> E		(拭)
	oy -> O		(須)
	uy -> U		(税)

- double-consonants such as "ks", "nh", "lth" are ultimately broken into 
  their component phonemes, and therefore are not phonemic units themselves.

- geminates/fortis consonants ("kk", "pp", "tt", "cc", "ss") are phonemic 
  units: they contrast with plain stops ("k", "p", etc) in onset 
  positions. 

Entire inventory of vowels and consonants:

- vowels (21):
  a  e  i  o  u  A  E  O  U
  ya ye yo yu yA yE 
  wa we wu wi wA wE

- consonants (20)
  k  t  p  c  s  h  r  l  m  n  
  kk tt pp cc ss
  K  T  P  C  
  G

	phonetic	Yale		IPA or
	alphabet  	romanization	description
	----		----		----
	a				IPA a
	e				unrounded mid central vowel
	i				IPA i
	o				IPA o
	u				rounded high back vowel
	A		ay		unrounded front low vowel
	E		ey		unrounded front mid vowel
	O		oy		rounded back mid vowel
	U		uy		glide: unrounded high back to front
	wa				glide: rounded 'a'
	we				glide: rounded 'e'
	wu				rounded back high back vowel
	wi				glide: rounded 'i'
	wA		way		glide: rounded 'ay'
	wE		wey		glide: rounded 'ey'
	k				IPA k; g(voiced environment)
	t				IPA t; d(voiced environment)
	p				IPA p: b(voiced environment)
	c				palatal affricate
	s				IPA s
	h				IPA h
	r		NONE		coronal flap
	l				IPA light l
	m				IPA m
	n				IPA n
	kk				tense 'k'
	tt				tense 't'
	pp				tense 'p'
	cc				tense 'c'
	ss				tense 's'
	K		kh		aspirated 'k'
	T		th		aspirated 't'
	P		ph		aspirated 'p'
	C		ch		aspirated 'c'
	G		ng		engma

---------------------------------------------------------------------------
5.3. Phonological rules of Korean 

Much of Korean pronunciation is governed by the grammar of syllabic 
structure. A well-formed syllable in Korean requires:

- a syllable consists of one or no onset consonant, one vowel, and one or no
  coda consonant:
  (onset)?(vowel)(coda)?

- the set of vowels includes the 21 vowels defined above:
  [yw]?[aeiouAEOU]

- consonants allowed in onset position (all except l and G, plus ll):
  ([ktprnmKTPCsch]|kk|tt|pp|cc|ss|ll)

- consonants allowed in coda position:
  [ktplnmG]

---------------------------------------------------------------------------
5.4. Multiple pronunciation

A phonetic string is enclosed in "/ /": 

Korean			  Yale-romanization	 	 pronunciation
益君艦猿                  ku.le.ni.kka.                  /kurenikka/

Some phonetic/phonological processes seem to be optional, occurring in fast
speech only. When such an alternative pronunciation is available, it is given
alongside the preferred pronunciation, i.e.:

尽暗窮                    ngwass.ke.tun.                 /watkketun/wakketun/
廃腰                      han.pen.                       /hanpen/hampen/

---------------------------------------------------------------------------
5.5. Evaluation

1,000 samples were randomly extracted from the lexicon for evaluation
purposes. It was noted that there were 14 words with incorrect pronunciation
assigned, yielding an accuracy of 98.6%. 

All of these errors involved "tensification" or "geminization" of stop
consonants such as /k/, /c/ and /p/. This phonological process in Korean
is rather specific to lexical items, or otherwise involves a morphological
boundary. Both pieces of information are not always available from the
orthographic representation, which makes it difficult for an automated
phoneticization script to handle them correctly. Therefore a manual
scanning was done in the post-processing stage in order to correct these
items, but obviously not all of them were screened. Examples of such
errors are: 

-- tensification on morphological boundary
    楳差拝暗醤    /hAGpoKalkeya/ --> should be /hAGpoKalkkeya/
                           ^                             ^^
    降郊韓        /palpatak/     --> should be /palppatak/
                      ^                            ^^ 
-- morpheme-internal arbitrary tensification
     箭薦	  /celcE/        --> should be /celccE/
                      ^                            ^^
     反引         /hyokwa/       --> should be /hyokkwa/
                      ^                            ^^

-----------------------------------------------------------------------
6. Frequency counts and corpus

Field 4 of the lexicon shows the frequency of the head-word in the Korean
Telephone Conversations Transcripts . The corpus has 25,251 words. The frequency
counts provided are by representation of the word in Korean characters, not
pronunciation or morphological analysis. For example, even if one Korean string has
two or more distinct meanings and morphological analyses to go with them, they
are all included in the morphological analysis field, and all of the
occurrences are reflected in its frequency count. 

--------------------------------------------------------------------------
7. Morphological Analysis
7.1. Overview

- corpus size:    25,709 words
- morphological analyses: 51,014
  -- average number of morphological analyses per word: 1.98
- number of unique morphemes: 9,821

Morphological analyses are in the folllowing format:

	紫瑛/VV+奄/ENM+亀/PAU

where each morpheme is followed by its part-of-speech tag, and is
separated by "+". Many words have more than one possible morphological
analyese, as no attempt was made to disambiguate with regard to the
context in which they appear. Multiple analyses are separated by ";". 

These analyses were obtained using Klex, a finite-state lexical transducer
of Korean built by Na-Rae Han (nrh@ling.upenn.edu).  After the entries
were automatically produced, those words for which the analyzer failed to
recognize were given analyses by hand.  Finally, entries were checked
semi-manually, by producing a frequency histogram of morphemes and ruling
out those that were deemed highly unlikely by a human annotator. 


--------------------------------------------------------------------------
7.2. Part-of-speech tags

Klex uses a Part-of-Speech tag set which is based on the one
employed by the Korean Treebank Project with slight modification. The POS 
tagging guideline for the Korean Treebank can be found at:
ftp://ftp.cis.upenn.edu/pub/ircs/tr/01-09/

  Noun      NNC      common noun
            NNU      numeric noun
            NNX      dependent noun
            NPN      pronoun
            NPR      proper noun
            NFW      foreign word       

  Post-     PCA      case postposition
 position   PAD      adverbial
            PAN      adnominal          * not included in Korean Treebank
            PAU      auxiliary
            PCJ      conjunctive

 Predicate  VV       verb
            VJ       adjective
            VX       auxiliary predicate

  Verbal    EPF      pre-final ending
  ending    EFN      final ending
            ECS      non-final ending   * ECS and EAU in Korean Treebank
            EAN      adnominal ending
            ENM      nominalization ending

  Etc       CO       copula
            ADV      adverb
            ADC      conjunctive adverb
            DAN      adnominal modifier
            XSF      suffix
            XPF      prefix
            XSV      verbalization suffix
            XSJ      adjectivization suffix
            IJ       interjection

  Symbol    SFN      sentence-final symbols: . ? !! ......
            SCM      comma: ,
            SLQ      left delimiters: " ' ( < [ {
            SRQ      right delimiters: " ' ) > ] }
            SSY      symbol


--------------------------------------------------------------------------
7.3. Allomorphs

Allomorphs are those morphemes that have exactly the same meaning and function
b ut are realized in different forms, usually conditioned by phonological
environm ent. A large number of inflectional suffixes in Korean display such
property. Fo r example, "精" and "澗", "稽" and "生稽" are allomorphs of 
each other.
 
Klex treats such allomorphs as having a single underlying form.  All
allomorphs therefore take a single form in the upper (analyzed) string. 
Through application of a sequence of phonological rules, the correct
inflected forms (lower strings) are derived. This way, the topic markers
in "俳嘘-澗", "俳持-精" and "格-い" are equally assigned "精/PAU" in
"俳嘘/NNC+精/PAU", "俳持/NNC+精/PAU" and "格/NPN+精/PAU". 

      俳嘘/NNC+精/PAU     俳持/NNC+精/PAU      格/NPN+精/PAU
             |                    |                  |
          俳嘘澗               俳持精                獲     

The criteria used in determining the representative form among allomorphs
are as follows:

- the form should be fully syllabic, i.e. "製" is chosen and not "け" 
  (as in "森旨").
 
- the form for the <b>post-consonantal environement</b> is chosen, 
  i.e. "戚" instead of "亜". 
 
- epenthetic vowels are included, i.e. "生稽" and not "稽". (this 
  clause mostly overlaps with 1 and 2 above, as epenthetic vowels are 
  used in post-consonantal environments) 

-  for vowel harmony, "嬢" is chosen and not "焼", i.e. "嬢辞" and not 
   "焼辞".
 

--------------------------------------------------------------------------
7.4. Evaluation

2,500 word entries (1/10 of corpus) were randomly extracted for 
evaluation of the morphological analysis. The findings are:

	29	correct analysis missing
	9	ungrammatical analysis
	233	dubious analysis
     		(not downright ungrammatical, but highly dubious analysis
     		--many due to unlikely noun-compounding, archaic dictionary 
		entries on which the analyzer is based on)

Measured precision/recall per word:

- recall 	2471/2500 = 0.9884
- precision 	2491/2500 = 0.9964
- conservative precision (counting "dubious analyses" as incorrect)
		2258/2500 = 0.9032

In conclusion, 99% of the time a word in the lexicon is expected to have 
all possible morphological analyses. Also, either 99% or 90% of 
the words are free of incorrect parses, depending on the criteria.