Klex: Finite-State Lexical Transducer for Korean

               April 12, 2004

               Na-Rae Han
               University of Pennsylvania
               nrh@babel.ling.upenn.edu

               special thanks to:
               Ken Beesley (beesley@xrce.xerox.com, XRCE)
               Lauri Karttunen (karttunen@parc.xerox.com, XRCE)
               Martha Palmer (mpalmer@central.cis.upenn.edu, UPenn)
               Mike Maxwell (maxwell@ldc.upenn.edu, LDC)

* Note: this document is in ksc-5601 encoding. Characters in Hangul (Korean
alphabet) can be displayed by selecting Korean encoding in common browers such
as Netscape or InternetExplorer, or on xeterminals .
-----------------------------------------------------------------------------
TABLE OF CONTENTS

   INTRODUCTION
   RELEASE CONTENTS
   COMPILING klex.fst
   MORPHOLOGICAL ANALYSIS/GENERATION USING lookup
   TOKENIZATION OF INPUT
   SPECIAL SYMBOLS IN klex.fst
   ABSTRACT ALPHABETS IN UPPER STRINGS
   ALLOMORPHY IN Klex
   PART-OF-SPEECH TAGS

-----------------------------------------------------------------------------
INTRODUCTION

Klex is a finite-state lexical transducer for the Korean language,
with the lexical string on the upper side and the inflected surface
string on the lower side. Klex was developed on the XFST (Xerox Finite
State Tool) software platform, developed and distributed by the Xerox
Corporation.

A lexicon in the form of a transducer has the following basic structure:

           fly/VV+s/ECS         돕/VV+었/EPF+다/EFN
                 |                       |
               flies                   도왔다

A sequence of morphemes along with the respective part-of-speech
constitutes the upper string; a fully lexicalized form constitutes the
lower string. A transducer network as a whole consists of all such
possible morpheme sequence / word pairs in the language. Given the
lower lexicalized form, the transducer can produce the analyzed
morpheme sequence (the process of "looking-up"); conversely, the
transducer can be used in producing the fully inflected surface form
of grammatical sequence of morphemes (opposite of "looking-up", hence
Xerox's terminology of "looking-down"). These two operations are the
most typical applications of such lexical transducers, namely
morphological analysis and generation.

-----------------------------------------------------------------------------
RELEASE CONTENTS

This release of Klex contains a set of source files which, when compiled on
Xerox's XFST software, produce a set of binary files which constitute
transducers which can be run on the same software.

To compile the source code as well as to read-in and utilize the
resulting transducer net, it is required to have Xerox's finite-state software
tools which include XFST. The Xerox tools run under Solaris, Linux, Windows
and Macintosh OS-X operating sytems, and are distributed with the following
book:

    Beesley, Kenneth R., and Lauri Karttunen.  2003.  "Finite
    State Morphology." CSLI Studies in Computational Linguistics.
    Chicago: University of Chicago Press.

At present, the book is available for $40 (paperback).  The license for
the tools included with the book covers non-commercial use; they can
also be licensed for commercial use by contacting Xerox.  For further
information, see the following link:

    http://www.stanford.edu/~laurik/fsmbook/home.html

Files and directories included in the release (the suffix ".scr" is
used for xfst script files):

makefile             #sample makefile script
readme.txt           #this document
encoding/            #contains auxiliary .scr files for encoding conversion
   realroman.scr     #maps abstract Roman character '^r' to real Roman 'r'
   rom2kor.scr       #Romanization <--> Korean conversion transducer net
   rom2kor.8parts.scr		   #same as rom2kor.scr, broken into 8 pieces
   rom2kor.unicode.scr		   #uses Unicode1.0 instead of ksc-5601
   rom2kor.unicode.8parts.scr	   #same as above, broken into 8 pieces
hangul_docs/         #contains documents on Korean encoding and Romanization
   hangul-codes.txt  #table of various Hangul(Korean alphabet) encodings
   hangul-def.txt    #Hangul alphabets and Roman transliteration defined
   hangul-camo.txt   #supplement to hangul-def.txt
   yale-mapping.txt  #description of Yale Romanization system
lexicon/             #contains script files for compiling initial lexicon
   all-lexicon.scr   #the big lexicon file, concatenation of all others
   lex_header.scr    #header portion of the lexicon script
   lex_vend.scr      #part containing (mostly) verbal inflections
   lex_nend.scr      #part containing (mostly) noun inflections
   lex_ADC.scr, lex_ADV.scr, ... lex_VX.scr
                     #parts containing individual groups of POS roots
   convert_flag.sh   #"@D.V.Y@" flag notation, is incompatible with older
                     # versions of XFST: converts it to %@D%.V%.Y%@
rewrite_rules/       #rewrite rules directory
   rules.scr         #morpho-phonological/orthographical alternation rules
examples/            #contains input files for test purposes
   prince.txt        #small text from "Little Prince"
   prince.tok        #tokenized text
   prince.rom        #tokenized, and romanized
   prince.anl        #sample output of morphological analyzer

-----------------------------------------------------------------------------
COMPILING klex.fst

We suggest using the included makefile to compile the system.  You must
edit the makefile to indicate the two paths indicated near the top of
the supplied makefile, for the location of the xfst program, and for the
location of the directory where your klex files reside.  If you are
running Microsoft Windows, path names which contain a space character will
probably need to be quoted.

We have tested compilation using the modified makefile under Solaris,
Linux, and CygWin running under Ms-Windows.  If you experience
difficulties running the make process under these or other operating
systems, please let us know.  Be warned that compilation requires large
amounts of memory (you should not attempt compilation on a system with
less than 512 megabytes of memory) and time (allow several hours).

Compilation produces the following files:

klex.fst             #Korean finite-state lexicon
klex-rom-kor.fst     #lower side the same as klex.fst, but upper side is 
                      in Roman transliteration
klex-rab-kor.fst     #like -rom-kor, but includes abstract alphabets on 
                      upper side
klex-rab-rom.fst     #upper side the same as -rab-kor, but with Roman 
                      transliteration on lower side

encoding/
   realroman.fst     #maps abstract Roman character '^r' to real Roman 'r'
   rom2kor.fst       #Romanization <--> Korean conversion transducer net
   rom2kor.k-kh-kk.fst, rom2kor.t-th-tt.fst, ... rom2kor.ng.fst
                     #broken-down chunks of Romanization <--> Korean net

rewrite_rules/
   rules.fst         #morpho-phonological/orthographical alternation rules

There are several key steps in building klex.fst, each of which produces
intermediate versions of transducer network.

(1) all-lexicon.fst
    compiled from all-lexicon.scr, it has romanized characters on both upper
    and lower sides. The upper side contains part-of-speech tags; morpheme
    representations on lower sides are still in abstract forms.

-> rules.fst composed at bottom ->

(2) klex-rab-rom.fst
    now the morpheme representations on the lower sides are fully rewritten
    following morpho-phonological/orthographical rules to reflect surface forms.

-> Romanization to Hangul encoding mapping on the bottom ->

(3) klex-rab-kor.fst
    lower side is now in Hangul encoding; upper side is still in romanized
    form, with abstract alphabets present.

-> abstract alphabets are mapped onto non-abstract alphabets ->

(4) klex-rom-kor.fst
    upper side is cleared of abstract alphabets; keT.+ta. '걷다;걸어서 to 
    walk' is converted to ket-ta so it cannot be distinguished from other 
    ket.+ta. '걷다;걷어서 to collect'. 

-> Romanization to Hangul encoding mapping on the top
   note that it involves 8 smaller parts of converter net: this is to
   circumvent compilation error ("maximum number of symbols has been
   exceeded") within XFST which results when composition of the larger
   code conversion network as a whole is attempted. ->

(5) klex.fst
    Hangul (ksc-5601) encoding on both upper and lower sides.

The 'makefile' file contains a sample script used in compiling these
various transducer networks, using xfst's "-e" flag which lets the user
script the entire compiling process. Also, in case one wants to use
Unicode encoding instead of ksc-5601 Korean encoding, rom2kor.unicode.scr
and rom2kor.unicode.8parts.scr in encoding/ directory can be substituted.

-----------------------------------------------------------------------------
MORPHOLOGICAL ANALYSIS/GENERATION USING lookup

There are two ways of using klex.fst as a morphological
analyzer/generator. First, one can load up the transducer network inside
the XFST interface, and using apply up/apply down commands to find the
upper/lower counterpart of the given string:

========================================
-output from a Unix prompt:
% xfst
Copyright Xerox Corporation 1997-2003
Xerox Finite-State Tool, version 8.1.4

Type "help" to list all commands available or "help help" for further
help.

xfst[0]: load klex.fst
0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100%
xfst[1]: apply up 들었던
듣/VV+었/EPF+던/EAN
들/VV+었/EPF+던/EAN
들/VX+었/EPF+던/EAN
xfst[1]: apply down 듣/VV+었/EPF+던/EAN
들었던

========================================

For more information on working in XFST, refer to the following XRCE web
page:

http://www.xrce.xerox.com/competencies/content-analysis/fssoft/docs/fst-97/xfst97.html

For automated operations on large input data, "lookup", a utility
for morphological analysis using XFST, should be used. It is included in
the Xerox release of the software suits. Detailed documentation can be
found here:

http://www.xrce.xerox.com/competencies/content-analysis/fssoft/docs/lookup-97/lookup97.html

For Korean input/output, a flag option of "-flags mb" should be used to
allow for multi-byte characters on both sides. Some sample invocations:

/pkg/ldc/Xerox/Sun/lookup klex.fst -flags mbTT
/pkg/ldc/Xerox/Sun/lookup -f lookupscript -flags mbTT

-----------------------------------------------------------------------------
TOKENIZATION OF INPUT

Klex expects tokenized words as input, one word per line. Prior to
morphological analysis, all symbols such as punctuations and parentheses
must be split from the words they are attached to. There are some exceptions:

  multi-symbols
  ...             can be considered one symbol, not 3 periods
  ---             can be considered one symbol, not three dashes
  !!, ?!, etc.    can be considered one symbol

  arithmetic expressions
  . ,  - : /      between digits are considered part of number, i.e.:
                  12.00/NNU
                  12-13/NNU
                  12,000/NNU
                  12:00/NNU

-----------------------------------------------------------------------------
SPECIAL SYMBOLS IN klex.fst

Klex recognizes a few special symbols, all beginning with "^".  They
are used to preserve certain information on a word or a morpheme, which
can be lost in the course of tokenization.

^CNT   continued on left edge

Sometimes, suffixes and prefixes need to be separated from the
root in the course of tokenization, mostly due to the presence of
intervening symbols or parenthetical phrases. Klex does not
recognize these suffixes as a full word, and the desired
analysis is not achieved. To avoid it, the tokenization script
should preserve pre-tokenization spacing information with ^CNT,
which marks the left edge of a newly separated morpheme as originally
attached to the previous token. When a suffix is marked with ^CNT, Klex
recognizes it as originally being attached to a root, and provides
an analysis accordingly.

	유엔(UN)이
	       --> tokenized --> 
	유엔
	^CNT+(
	^CNT+UN
	^CNT+)
	^CNT+이       	

^EOS   end of sentence

  A mock-morpheme signifying end of sentence, for sentence-tokenized
        text inputs.

-----------------------------------------------------------------------------
ABSTRACT ALPHABETS IN UPPER STRINGS

Up until klex-rab-kor.fst, abstract alphabets are present on the upper side
which mark irregular verbal and adjectival stems. They are:

     P  'ㅂ' irregular as in '돕다' 'toP.ta.'
     T  'ㄷ' irregular as in '걷다' 'keT.ta.'
     S  'ㅅ' irregular as in '짓다' 'ciS.ta.'
     H  'ㅎ' irregular as in '하얗다' 'ha.ngyaH.ta.'
     L  'ㄹ' irregular as in '멀다' 'meL.ta.'
     R  '르' irregular as in '모르다' 'mo.Ru.ta.'

These abstract alphabets are subject to various transformations when
suffixed with particular morphemes that provide the environment. For
example, compare

     '돕+어 -> 도와' 'to.P+E. -> to.ngwa.' 
and
     '좁+어 -> 좁아' 'cop.+E -> cop.nga.' 

Therefore, they are used as an indicator in the rules.fst component in
deducing the correctly inflected surface forms for those stems with
irregularity.

When klex-rab-kor.fst is converted to klex-rk.fst, such abstract/ non-abstract
distinctions are lost, rendering two indistinguishable homonyms 'keT.ta.' ' 
걷다; 걸어 to walk' and 'ket.ta.' '걷다; 걷어 to collect'
indistinguishable. Alternatively, one can opt to preserve such distinctions by
leaving the abstract alphabets marked in the course of compiling by introducing
different routines in the makefile file.

-----------------------------------------------------------------------------
ALLOMORPHY IN Klex

Allomorphs are those morphemes that have exactly the same meaning and
function b ut are realized in different forms, usually conditioned by
phonological environm ent. A large number of inflectional suffixes in
Korean display such property. Fo r example, "은" and "는", "로" and
"으로" are allomorphs of each other.
 
Klex treats such allomorphs as having a single underlying form. All
allomorphs therefore take a single form in the upper (analyzed)
string. Thro ugh application of a sequence of phonological rules, the
correct inflected forms (lower strings) are derived. This way, the
topic markers in "학교-는", "학생-은" and "너-ㄴ" are equally assigned
"은/PAU" in "학교/NNC+은/PAU", "학생/NNC+은/PAU" and "너/NPN+은/PAU".

      학교/NNC+은/PAU     학생/NNC+은/PAU      너/NPN+은/PAU
             |                    |                  |
          학교는               학생은                넌     

The criteria used in determining the representative form among
allomorphs are as follows:

- The form should be fully syllabic, i.e. "음" is chosen and not "ㅁ" 
  (as in "예쁨"). 
- The form for the post-consonantal environement is chosen, i.e. "이" 
   instead of "가".  
- Epenthetic vowels are included, i.e. "으로" and not "로". 
  (this clause mostly overlaps with 1 and 2 above, as epenthetic vowels 
  are used in post-consonantal environments) 
- For vowel harmony, "어" is chosen and not "아", i.e. "어서" and not "아서". 

Here are some examples of common allomorphy:

	allomorphs	usage 			representative form
	었/았/ㅆ	먹었고/잡았고/샀고	었
	어/아/null	먹어/잡아/사		어
	은/ㄴ		먹은/산			은
	음/ㅁ		먹음/삼			음
	으시/시		잡으시고/오시고		으시
	으니/니		오니/잡으니		으니
	을까/ㄹ까	먹을까/살까		을까
	습니다/ㅂ니다	먹습니다/삽니다		습니다
	은/는/ㄴ	학생은/교수는/넌	은
	이/가		학생이/교수가		이
	을/를		학생을/교수를		을
	과/와		학생과/교수와		과
	으로/로		학생으로/교수로		으로
	이라도/라도	학생이라도/교수라도	이라도
	이야/야		학생이야/교수야		이야
	아/야		복순아/영희야		아
	이/null		학생이다/교수다		이

-----------------------------------------------------------------------------
PART-OF-SPEECH TAGS

Klex uses a Part-of-Speech tag set which is based on the one employed by 
the Korean Treebank Project with slight modification. The POS tagging 
guideline for the Korean Treebank can be found at: 
ftp://ftp.cis.upenn.edu/pub/ircs/tr/01-09/. 

  Noun      NNC      common noun	학교/NNC 학생/NNC
            NNU      numeric noun	1/NNU 하나/NNU 한/NNU
            NNX      dependent noun	것/NNX 마리/NNX
            NPN      pronoun		나/NPN 너/NPN 우리/NPN 그녀/NPN
            NPR      proper noun	김대중/NPR 한국/NPR 클린턴/NPR
            NFW      foreign word       Clinton/NFW UN/NFW

  Post-     PCA      case postposition  이/PCA 을/PCA 
 position   PAD      adverbial		으로/PAD 에서/PAD
            PAN*     adnominal 		의/PAN 이라는/PAN
            PAU      auxiliary		은/PAU 도/PAU 마저/PAU
            PCJ      conjunctive	과/PCJ

 Predicate  VV       verb		먹/VV+다/EFN 가/VV+다/EFN
            VJ       adjective		예쁘/VJ+다/EFN 작/VJ+다/EFN
            VX       auxiliary predicate 않/VX+다/EFN 말/VX+다/EFN

  Verbal    EPF      pre-final ending	먹/VV+으시/EFN+었/EPF+다/EFN
  ending    EFN      final ending	먹/VV+었/EFN+다/EFN 먹/VV+냐/EFN
            ECS**    non-final ending   먹/VV+고/ECS 먹/VV+어서/ECS
            EAN      adnominal ending   먹/VV+는/EAN 먹/VV+던/EAN
            ENM      nominalization ending 먹/VV+음/ENM 먹/VV+기/ENM

  Etc       CO       copula		학생/NNC+이/CO+다/EFN
            ADV      adverb		빨리/ADV 또/ADV
            ADC      conjunctive adverb	그러나/ADC 그리고/ADC
            DAN      adnominal modifier	이런/DAN 그/DAN 모든/DAN
            XSF      suffix		영희/NPR+씨/XSF 우리/NPN+들/XSF
            XPF      prefix		제/XPF+5/NNU+분대/NNC
            XSV      verbalization suffix 	공부/NNC+하/XSV+다/EFN
            XSJ      adjectivization suffix	침착/NNC+하/XSJ+다/EFN
            IJ       interjection	아/IJ 예/IJ 그래/IJ

  Symbol    SFN      sentence-final symbols . ? !! ......
            SCM      comma ,
            SLQ      left delimiters: " ' ( < [ {
            SRQ      right delimiters: " ' ) > ] }
            SSY      symbol

* not included in Korean Trebank phase 1
** ECS and EAU in Korean Treebank phase 1