Klex: Finite-State Lexical Transducer for Korean April 12, 2004 Na-Rae Han University of Pennsylvania nrh@babel.ling.upenn.edu special thanks to: Ken Beesley (beesley@xrce.xerox.com, XRCE) Lauri Karttunen (karttunen@parc.xerox.com, XRCE) Martha Palmer (mpalmer@central.cis.upenn.edu, UPenn) Mike Maxwell (maxwell@ldc.upenn.edu, LDC) * Note: this document is in ksc-5601 encoding. Characters in Hangul (Korean alphabet) can be displayed by selecting Korean encoding in common browers such as Netscape or InternetExplorer, or on xeterminals . ----------------------------------------------------------------------------- TABLE OF CONTENTS INTRODUCTION RELEASE CONTENTS COMPILING klex.fst MORPHOLOGICAL ANALYSIS/GENERATION USING lookup TOKENIZATION OF INPUT SPECIAL SYMBOLS IN klex.fst ABSTRACT ALPHABETS IN UPPER STRINGS ALLOMORPHY IN Klex PART-OF-SPEECH TAGS ----------------------------------------------------------------------------- INTRODUCTION Klex is a finite-state lexical transducer for the Korean language, with the lexical string on the upper side and the inflected surface string on the lower side. Klex was developed on the XFST (Xerox Finite State Tool) software platform, developed and distributed by the Xerox Corporation. A lexicon in the form of a transducer has the following basic structure: fly/VV+s/ECS µ½/VV+¾ú/EPF+´Ù/EFN | | flies µµ¿Ô´Ù A sequence of morphemes along with the respective part-of-speech constitutes the upper string; a fully lexicalized form constitutes the lower string. A transducer network as a whole consists of all such possible morpheme sequence / word pairs in the language. Given the lower lexicalized form, the transducer can produce the analyzed morpheme sequence (the process of "looking-up"); conversely, the transducer can be used in producing the fully inflected surface form of grammatical sequence of morphemes (opposite of "looking-up", hence Xerox's terminology of "looking-down"). These two operations are the most typical applications of such lexical transducers, namely morphological analysis and generation. ----------------------------------------------------------------------------- RELEASE CONTENTS This release of Klex contains a set of source files which, when compiled on Xerox's XFST software, produce a set of binary files which constitute transducers which can be run on the same software. To compile the source code as well as to read-in and utilize the resulting transducer net, it is required to have Xerox's finite-state software tools which include XFST. The Xerox tools run under Solaris, Linux, Windows and Macintosh OS-X operating sytems, and are distributed with the following book: Beesley, Kenneth R., and Lauri Karttunen. 2003. "Finite State Morphology." CSLI Studies in Computational Linguistics. Chicago: University of Chicago Press. At present, the book is available for $40 (paperback). The license for the tools included with the book covers non-commercial use; they can also be licensed for commercial use by contacting Xerox. For further information, see the following link: http://www.stanford.edu/~laurik/fsmbook/home.html Files and directories included in the release (the suffix ".scr" is used for xfst script files): makefile #sample makefile script readme.txt #this document encoding/ #contains auxiliary .scr files for encoding conversion realroman.scr #maps abstract Roman character '^r' to real Roman 'r' rom2kor.scr #Romanization <--> Korean conversion transducer net rom2kor.8parts.scr #same as rom2kor.scr, broken into 8 pieces rom2kor.unicode.scr #uses Unicode1.0 instead of ksc-5601 rom2kor.unicode.8parts.scr #same as above, broken into 8 pieces hangul_docs/ #contains documents on Korean encoding and Romanization hangul-codes.txt #table of various Hangul(Korean alphabet) encodings hangul-def.txt #Hangul alphabets and Roman transliteration defined hangul-camo.txt #supplement to hangul-def.txt yale-mapping.txt #description of Yale Romanization system lexicon/ #contains script files for compiling initial lexicon all-lexicon.scr #the big lexicon file, concatenation of all others lex_header.scr #header portion of the lexicon script lex_vend.scr #part containing (mostly) verbal inflections lex_nend.scr #part containing (mostly) noun inflections lex_ADC.scr, lex_ADV.scr, ... lex_VX.scr #parts containing individual groups of POS roots convert_flag.sh #"@D.V.Y@" flag notation, is incompatible with older # versions of XFST: converts it to %@D%.V%.Y%@ rewrite_rules/ #rewrite rules directory rules.scr #morpho-phonological/orthographical alternation rules examples/ #contains input files for test purposes prince.txt #small text from "Little Prince" prince.tok #tokenized text prince.rom #tokenized, and romanized prince.anl #sample output of morphological analyzer ----------------------------------------------------------------------------- COMPILING klex.fst We suggest using the included makefile to compile the system. You must edit the makefile to indicate the two paths indicated near the top of the supplied makefile, for the location of the xfst program, and for the location of the directory where your klex files reside. If you are running Microsoft Windows, path names which contain a space character will probably need to be quoted. We have tested compilation using the modified makefile under Solaris, Linux, and CygWin running under Ms-Windows. If you experience difficulties running the make process under these or other operating systems, please let us know. Be warned that compilation requires large amounts of memory (you should not attempt compilation on a system with less than 512 megabytes of memory) and time (allow several hours). Compilation produces the following files: klex.fst #Korean finite-state lexicon klex-rom-kor.fst #lower side the same as klex.fst, but upper side is in Roman transliteration klex-rab-kor.fst #like -rom-kor, but includes abstract alphabets on upper side klex-rab-rom.fst #upper side the same as -rab-kor, but with Roman transliteration on lower side encoding/ realroman.fst #maps abstract Roman character '^r' to real Roman 'r' rom2kor.fst #Romanization <--> Korean conversion transducer net rom2kor.k-kh-kk.fst, rom2kor.t-th-tt.fst, ... rom2kor.ng.fst #broken-down chunks of Romanization <--> Korean net rewrite_rules/ rules.fst #morpho-phonological/orthographical alternation rules There are several key steps in building klex.fst, each of which produces intermediate versions of transducer network. (1) all-lexicon.fst compiled from all-lexicon.scr, it has romanized characters on both upper and lower sides. The upper side contains part-of-speech tags; morpheme representations on lower sides are still in abstract forms. -> rules.fst composed at bottom -> (2) klex-rab-rom.fst now the morpheme representations on the lower sides are fully rewritten following morpho-phonological/orthographical rules to reflect surface forms. -> Romanization to Hangul encoding mapping on the bottom -> (3) klex-rab-kor.fst lower side is now in Hangul encoding; upper side is still in romanized form, with abstract alphabets present. -> abstract alphabets are mapped onto non-abstract alphabets -> (4) klex-rom-kor.fst upper side is cleared of abstract alphabets; keT.+ta. '°È´Ù;°É¾î¼­ to walk' is converted to ket-ta so it cannot be distinguished from other ket.+ta. '°È´Ù;°È¾î¼­ to collect'. -> Romanization to Hangul encoding mapping on the top note that it involves 8 smaller parts of converter net: this is to circumvent compilation error ("maximum number of symbols has been exceeded") within XFST which results when composition of the larger code conversion network as a whole is attempted. -> (5) klex.fst Hangul (ksc-5601) encoding on both upper and lower sides. The 'makefile' file contains a sample script used in compiling these various transducer networks, using xfst's "-e" flag which lets the user script the entire compiling process. Also, in case one wants to use Unicode encoding instead of ksc-5601 Korean encoding, rom2kor.unicode.scr and rom2kor.unicode.8parts.scr in encoding/ directory can be substituted. ----------------------------------------------------------------------------- MORPHOLOGICAL ANALYSIS/GENERATION USING lookup There are two ways of using klex.fst as a morphological analyzer/generator. First, one can load up the transducer network inside the XFST interface, and using apply up/apply down commands to find the upper/lower counterpart of the given string: ======================================== -output from a Unix prompt: % xfst Copyright Xerox Corporation 1997-2003 Xerox Finite-State Tool, version 8.1.4 Type "help" to list all commands available or "help help" for further help. xfst[0]: load klex.fst 0%>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>100% xfst[1]: apply up µé¾ú´ø µè/VV+¾ú/EPF+´ø/EAN µé/VV+¾ú/EPF+´ø/EAN µé/VX+¾ú/EPF+´ø/EAN xfst[1]: apply down µè/VV+¾ú/EPF+´ø/EAN µé¾ú´ø ======================================== For more information on working in XFST, refer to the following XRCE web page: http://www.xrce.xerox.com/competencies/content-analysis/fssoft/docs/fst-97/xfst97.html For automated operations on large input data, "lookup", a utility for morphological analysis using XFST, should be used. It is included in the Xerox release of the software suits. Detailed documentation can be found here: http://www.xrce.xerox.com/competencies/content-analysis/fssoft/docs/lookup-97/lookup97.html For Korean input/output, a flag option of "-flags mb" should be used to allow for multi-byte characters on both sides. Some sample invocations: /pkg/ldc/Xerox/Sun/lookup klex.fst -flags mbTT /pkg/ldc/Xerox/Sun/lookup -f lookupscript -flags mbTT ----------------------------------------------------------------------------- TOKENIZATION OF INPUT Klex expects tokenized words as input, one word per line. Prior to morphological analysis, all symbols such as punctuations and parentheses must be split from the words they are attached to. There are some exceptions: multi-symbols ... can be considered one symbol, not 3 periods --- can be considered one symbol, not three dashes !!, ?!, etc. can be considered one symbol arithmetic expressions . , - : / between digits are considered part of number, i.e.: 12.00/NNU 12-13/NNU 12,000/NNU 12:00/NNU ----------------------------------------------------------------------------- SPECIAL SYMBOLS IN klex.fst Klex recognizes a few special symbols, all beginning with "^". They are used to preserve certain information on a word or a morpheme, which can be lost in the course of tokenization. ^CNT continued on left edge Sometimes, suffixes and prefixes need to be separated from the root in the course of tokenization, mostly due to the presence of intervening symbols or parenthetical phrases. Klex does not recognize these suffixes as a full word, and the desired analysis is not achieved. To avoid it, the tokenization script should preserve pre-tokenization spacing information with ^CNT, which marks the left edge of a newly separated morpheme as originally attached to the previous token. When a suffix is marked with ^CNT, Klex recognizes it as originally being attached to a root, and provides an analysis accordingly. À¯¿£(UN)ÀÌ --> tokenized --> À¯¿£ ^CNT+( ^CNT+UN ^CNT+) ^CNT+ÀÌ ^EOS end of sentence A mock-morpheme signifying end of sentence, for sentence-tokenized text inputs. ----------------------------------------------------------------------------- ABSTRACT ALPHABETS IN UPPER STRINGS Up until klex-rab-kor.fst, abstract alphabets are present on the upper side which mark irregular verbal and adjectival stems. They are: P '¤²' irregular as in 'µ½´Ù' 'toP.ta.' T '¤§' irregular as in '°È´Ù' 'keT.ta.' S '¤µ' irregular as in 'Áþ´Ù' 'ciS.ta.' H '¤¾' irregular as in 'ÇϾé´Ù' 'ha.ngyaH.ta.' L '¤©' irregular as in '¸Ö´Ù' 'meL.ta.' R '¸£' irregular as in '¸ð¸£´Ù' 'mo.Ru.ta.' These abstract alphabets are subject to various transformations when suffixed with particular morphemes that provide the environment. For example, compare 'µ½+¾î -> µµ¿Í' 'to.P+E. -> to.ngwa.' and 'Á¼+¾î -> Á¼¾Æ' 'cop.+E -> cop.nga.' Therefore, they are used as an indicator in the rules.fst component in deducing the correctly inflected surface forms for those stems with irregularity. When klex-rab-kor.fst is converted to klex-rk.fst, such abstract/ non-abstract distinctions are lost, rendering two indistinguishable homonyms 'keT.ta.' ' °È´Ù; °É¾î to walk' and 'ket.ta.' '°È´Ù; °È¾î to collect' indistinguishable. Alternatively, one can opt to preserve such distinctions by leaving the abstract alphabets marked in the course of compiling by introducing different routines in the makefile file. ----------------------------------------------------------------------------- ALLOMORPHY IN Klex Allomorphs are those morphemes that have exactly the same meaning and function b ut are realized in different forms, usually conditioned by phonological environm ent. A large number of inflectional suffixes in Korean display such property. Fo r example, "Àº" and "´Â", "·Î" and "À¸·Î" are allomorphs of each other. Klex treats such allomorphs as having a single underlying form. All allomorphs therefore take a single form in the upper (analyzed) string. Thro ugh application of a sequence of phonological rules, the correct inflected forms (lower strings) are derived. This way, the topic markers in "Çб³-´Â", "Çлý-Àº" and "³Ê-¤¤" are equally assigned "Àº/PAU" in "Çб³/NNC+Àº/PAU", "Çлý/NNC+Àº/PAU" and "³Ê/NPN+Àº/PAU". Çб³/NNC+Àº/PAU Çлý/NNC+Àº/PAU ³Ê/NPN+Àº/PAU | | | Çб³´Â ÇлýÀº ³Í The criteria used in determining the representative form among allomorphs are as follows: - The form should be fully syllabic, i.e. "À½" is chosen and not "¤±" (as in "¿¹»Ý"). - The form for the post-consonantal environement is chosen, i.e. "ÀÌ" instead of "°¡". - Epenthetic vowels are included, i.e. "À¸·Î" and not "·Î". (this clause mostly overlaps with 1 and 2 above, as epenthetic vowels are used in post-consonantal environments) - For vowel harmony, "¾î" is chosen and not "¾Æ", i.e. "¾î¼­" and not "¾Æ¼­". Here are some examples of common allomorphy: allomorphs usage representative form ¾ú/¾Ò/¤¶ ¸Ô¾ú°í/Àâ¾Ò°í/»ò°í ¾ú ¾î/¾Æ/null ¸Ô¾î/Àâ¾Æ/»ç ¾î Àº/¤¤ ¸ÔÀº/»ê Àº À½/¤± ¸ÔÀ½/»ï À½ À¸½Ã/½Ã ÀâÀ¸½Ã°í/¿À½Ã°í À¸½Ã À¸´Ï/´Ï ¿À´Ï/ÀâÀ¸´Ï À¸´Ï À»±î/¤©±î ¸ÔÀ»±î/»ì±î À»±î ½À´Ï´Ù/¤²´Ï´Ù ¸Ô½À´Ï´Ù/»ð´Ï´Ù ½À´Ï´Ù Àº/´Â/¤¤ ÇлýÀº/±³¼ö´Â/³Í Àº ÀÌ/°¡ ÇлýÀÌ/±³¼ö°¡ ÀÌ À»/¸¦ ÇлýÀ»/±³¼ö¸¦ À» °ú/¿Í Çлý°ú/±³¼ö¿Í °ú À¸·Î/·Î ÇлýÀ¸·Î/±³¼ö·Î À¸·Î À̶óµµ/¶óµµ ÇлýÀ̶óµµ/±³¼ö¶óµµ ÀÌ¶óµµ À̾ß/¾ß ÇлýÀ̾ß/±³¼ö¾ß ÀÌ¾ß ¾Æ/¾ß º¹¼ø¾Æ/¿µÈñ¾ß ¾Æ ÀÌ/null ÇлýÀÌ´Ù/±³¼ö´Ù ÀÌ ----------------------------------------------------------------------------- PART-OF-SPEECH TAGS Klex uses a Part-of-Speech tag set which is based on the one employed by the Korean Treebank Project with slight modification. The POS tagging guideline for the Korean Treebank can be found at: ftp://ftp.cis.upenn.edu/pub/ircs/tr/01-09/. Noun NNC common noun Çб³/NNC Çлý/NNC NNU numeric noun 1/NNU Çϳª/NNU ÇÑ/NNU NNX dependent noun °Í/NNX ¸¶¸®/NNX NPN pronoun ³ª/NPN ³Ê/NPN ¿ì¸®/NPN ±×³à/NPN NPR proper noun ±è´ëÁß/NPR Çѱ¹/NPR Ŭ¸°ÅÏ/NPR NFW foreign word Clinton/NFW UN/NFW Post- PCA case postposition ÀÌ/PCA À»/PCA position PAD adverbial À¸·Î/PAD ¿¡¼­/PAD PAN* adnominal ÀÇ/PAN À̶ó´Â/PAN PAU auxiliary Àº/PAU µµ/PAU ¸¶Àú/PAU PCJ conjunctive °ú/PCJ Predicate VV verb ¸Ô/VV+´Ù/EFN °¡/VV+´Ù/EFN VJ adjective ¿¹»Ú/VJ+´Ù/EFN ÀÛ/VJ+´Ù/EFN VX auxiliary predicate ¾Ê/VX+´Ù/EFN ¸»/VX+´Ù/EFN Verbal EPF pre-final ending ¸Ô/VV+À¸½Ã/EFN+¾ú/EPF+´Ù/EFN ending EFN final ending ¸Ô/VV+¾ú/EFN+´Ù/EFN ¸Ô/VV+³Ä/EFN ECS** non-final ending ¸Ô/VV+°í/ECS ¸Ô/VV+¾î¼­/ECS EAN adnominal ending ¸Ô/VV+´Â/EAN ¸Ô/VV+´ø/EAN ENM nominalization ending ¸Ô/VV+À½/ENM ¸Ô/VV+±â/ENM Etc CO copula Çлý/NNC+ÀÌ/CO+´Ù/EFN ADV adverb »¡¸®/ADV ¶Ç/ADV ADC conjunctive adverb ±×·¯³ª/ADC ±×¸®°í/ADC DAN adnominal modifier ÀÌ·±/DAN ±×/DAN ¸ðµç/DAN XSF suffix ¿µÈñ/NPR+¾¾/XSF ¿ì¸®/NPN+µé/XSF XPF prefix Á¦/XPF+5/NNU+ºÐ´ë/NNC XSV verbalization suffix °øºÎ/NNC+ÇÏ/XSV+´Ù/EFN XSJ adjectivization suffix ħÂø/NNC+ÇÏ/XSJ+´Ù/EFN IJ interjection ¾Æ/IJ ¿¹/IJ ±×·¡/IJ Symbol SFN sentence-final symbols . ? !! ...... SCM comma , SLQ left delimiters: " ' ( < [ { SRQ right delimiters: " ' ) > ] } SSY symbol * not included in Korean Trebank phase 1 ** ECS and EAU in Korean Treebank phase 1