Han, Na-Rae. 2003 Korean Telephone Conversations Lexicon. Philadelphia: Linguistic Data Consortium, University of Pennsylvania. ----------------------------------------------------------- Description of the Korean Telephone Conversations Lexicon ----------------------------------------------------------- CONTENTS 1. Summary abstract 2. Token coverage 3. Lexicon information fields 4. Orthographic convention and romanization 5. Pronunciation 6. Frequency counts and corpus 7. Morphological Analysis ----------------------------------------------------------------------- 1. Summary abstract The Korean Telephone Conversations Lexicon consists of 25251 words, and contains separate fields with phonological, morphological, and frequency information for each word. ----------------------------------------------------------------------- 2. Token coverage The token coverage of the words occurring in the 100 Korean Telephone Conversations Transcripts is 100%. ----------------------------------------------------------------------- 3. Lexicon information fields The LDC lexicon contains five tab-separated information fields: Field 1: orthographic form in Hangul (headword) Field 2: orthographic form in Yale romanization Field 3: pronunciation Field 4: frequency of the word in Korean Telephone Conversations Transcripts Field 5: morphological analysis of the word Each of these fields is described the sections below. ----------------------------------------------------------------------- 4. Orthographic convention and romanization Orthographic Korean characters are in Hangul, encoded in KSC5601 (Wansung) system. The romanization given in field 2 follows the Yale Romanization system, which allows Hangul-romanization conversion in both directions. The romanized words were obtained using the "yale-roman.pl" script included in the release. Yale system is the preferred romanization scheme for Korean among the linguistics community. Each Korean alphabet is mapped to a single or a sequence of roman alphabets as following: single consonants (19: onset and coda) (exceptions: ¤³, ¤¹, ¤¨ do not appear in coda) KIYEOK ¤¡ k SSANGKIYEOK ¤¢ k k NIEUN ¤¤ n TIKEUT ¤§ t SSANGTIKEUT ¤¨ t t RIEUL ¤© l MIEUM ¤± m PIEUP ¤² p SSANGPIEUP ¤³ p p SIOS ¤µ s SSANGSIOS ¤¶ s s IEUNG ¤· n g CIEUC ¤¸ c SSANGCIEUC ¤¹ c c CHIEUCH ¤º c h KHIEUKH ¤» k h THIEUTH ¤¼ t h PHIEUPH ¤½ p h HIEUH ¤¾ h multiple consonants (11: only in coda) KIYEOKSIOS ¤¡¤µ k s NIEUNCIEUC ¤¤¤¸ n c NIEUNHIEUH ¤¤¤¾ n h RIEULKIYEOK ¤©¤¡ l k RIEULMIEUM ¤©¤± l m RIEULPIEUP ¤©¤² l p RIEULSIOS ¤©¤µ l s RIEULTHIEUTH ¤©¤¼ l t h RIEULPHIEUPH ¤©¤½ l p h RIEULHIEUH ¤©¤¾ l h PIEUPSIOS ¤²¤µ p s vowels (21) A ¤¿ a AE ¤À a y YA ¤Á y a YAE ¾ê y a y EO ¤Ã e E ¤Ä e y YEO ¤Å y e YE ¤Æ y e y O ¤Ç o WA ¿Í w a WAE ¿Ö w a y OE ¿Ü o y YO ¤Ë y o U ¤Ì w u WEO ¿ö w e WE ¿þ w e y WI À§ w i YU ¤Ð y u EU ¤Ñ u YI ÀÇ u y I ¤Ó i Some modifications are made to the original Yale system: a. Syllabic boundaries The end of a syllable is marked with ".", to avoid ambiguous mappings. Example: kakkak --> could be °¢°¢ or °¡±ï kak.kak. --> unambiguously °¢°¢ b. Marking valueless onset "¤·" Valueless onset "¤·" is marked by "ng" rather than being left empty: strict alphabet to alphabet transliteration ¾È³ç: annyeng --> ngan.nyeng. c. "yu" instead of "ywu" Yale allows "ywu" for "¤Ð" in addition to 'yu'. To avoid confusion, only "yu" is used. ----------------------------------------------------------------------- 5. Pronunciation 5.1. Overview The pronunciation field of the lexicon was produced first by a perl script ("kor2pron.pl", written by David Graff and Na-Rae Han) which converts Korean words in EUC-KR encoding into their romanized form, and then into a phonetic string based on Korean phonological/phonetic rules. As pronunciation is predictable from the orthography most of the time, automatic conversion produces reliable output. There are some arbitrariness due to idiosyncracy of words or morphological information, such as gemination of consonants at morphological boundaries, which were manually checked and corrected. Also, for words which were classified as having a dialectal pronunciation, pronunciation was provided manually. ----------------------------------------------------------------------- 5.2. Phonetic alphabets The set of phonetic alphabet used for the lexicon is based on the alphabet set used in the Yale romanization system. For a complete transliteration table, please refer to the document on Korean romanization. Some modifications were made, however, to make it more suitable as a phonetic alphabet. - aspirated consonants: kh -> K th -> T ph -> P ch -> C - engma: ng -> G in coda position ng -> null in onset position - liquid: l -> r word-initially; between vowels - digraphic vowels: ay -> A (¾Ö) ey -> E (¿¡) oy -> O (¿Ü) uy -> U (ÀÇ) - double-consonants such as "ks", "nh", "lth" are ultimately broken into their component phonemes, and therefore are not phonemic units themselves. - geminates/fortis consonants ("kk", "pp", "tt", "cc", "ss") are phonemic units: they contrast with plain stops ("k", "p", etc) in onset positions. Entire inventory of vowels and consonants: - vowels (21): a e i o u A E O U ya ye yo yu yA yE wa we wu wi wA wE - consonants (20) k t p c s h r l m n kk tt pp cc ss K T P C G phonetic Yale IPA or alphabet romanization description ---- ---- ---- a IPA a e unrounded mid central vowel i IPA i o IPA o u rounded high back vowel A ay unrounded front low vowel E ey unrounded front mid vowel O oy rounded back mid vowel U uy glide: unrounded high back to front wa glide: rounded 'a' we glide: rounded 'e' wu rounded back high back vowel wi glide: rounded 'i' wA way glide: rounded 'ay' wE wey glide: rounded 'ey' k IPA k; g(voiced environment) t IPA t; d(voiced environment) p IPA p: b(voiced environment) c palatal affricate s IPA s h IPA h r NONE coronal flap l IPA light l m IPA m n IPA n kk tense 'k' tt tense 't' pp tense 'p' cc tense 'c' ss tense 's' K kh aspirated 'k' T th aspirated 't' P ph aspirated 'p' C ch aspirated 'c' G ng engma --------------------------------------------------------------------------- 5.3. Phonological rules of Korean Much of Korean pronunciation is governed by the grammar of syllabic structure. A well-formed syllable in Korean requires: - a syllable consists of one or no onset consonant, one vowel, and one or no coda consonant: (onset)?(vowel)(coda)? - the set of vowels includes the 21 vowels defined above: [yw]?[aeiouAEOU] - consonants allowed in onset position (all except l and G, plus ll): ([ktprnmKTPCsch]|kk|tt|pp|cc|ss|ll) - consonants allowed in coda position: [ktplnmG] --------------------------------------------------------------------------- 5.4. Multiple pronunciation A phonetic string is enclosed in "/ /": Korean Yale-romanization pronunciation ±×·¯´Ï±î ku.le.ni.kka. /kurenikka/ Some phonetic/phonological processes seem to be optional, occurring in fast speech only. When such an alternative pronunciation is available, it is given alongside the preferred pronunciation, i.e.: ¿Ô°Åµç ngwass.ke.tun. /watkketun/wakketun/ Çѹø han.pen. /hanpen/hampen/ --------------------------------------------------------------------------- 5.5. Evaluation 1,000 samples were randomly extracted from the lexicon for evaluation purposes. It was noted that there were 14 words with incorrect pronunciation assigned, yielding an accuracy of 98.6%. All of these errors involved "tensification" or "geminization" of stop consonants such as /k/, /c/ and /p/. This phonological process in Korean is rather specific to lexical items, or otherwise involves a morphological boundary. Both pieces of information are not always available from the orthographic representation, which makes it difficult for an automated phoneticization script to handle them correctly. Therefore a manual scanning was done in the post-processing stage in order to correct these items, but obviously not all of them were screened. Examples of such errors are: -- tensification on morphological boundary ÇູÇÒ°Å¾ß /hAGpoKalkeya/ --> should be /hAGpoKalkkeya/ ^ ^^ ¹ß¹Ù´Ú /palpatak/ --> should be /palppatak/ ^ ^^ -- morpheme-internal arbitrary tensification ÀýÁ¦ /celcE/ --> should be /celccE/ ^ ^^ È¿°ú /hyokwa/ --> should be /hyokkwa/ ^ ^^ ----------------------------------------------------------------------- 6. Frequency counts and corpus Field 4 of the lexicon shows the frequency of the head-word in the Korean Telephone Conversations Transcripts . The corpus has 25,251 words. The frequency counts provided are by representation of the word in Korean characters, not pronunciation or morphological analysis. For example, even if one Korean string has two or more distinct meanings and morphological analyses to go with them, they are all included in the morphological analysis field, and all of the occurrences are reflected in its frequency count. -------------------------------------------------------------------------- 7. Morphological Analysis 7.1. Overview - corpus size: 25,709 words - morphological analyses: 51,014 -- average number of morphological analyses per word: 1.98 - number of unique morphemes: 9,821 Morphological analyses are in the folllowing format: »ç±Í/VV+±â/ENM+µµ/PAU where each morpheme is followed by its part-of-speech tag, and is separated by "+". Many words have more than one possible morphological analyese, as no attempt was made to disambiguate with regard to the context in which they appear. Multiple analyses are separated by ";". These analyses were obtained using Klex, a finite-state lexical transducer of Korean built by Na-Rae Han (nrh@ling.upenn.edu). After the entries were automatically produced, those words for which the analyzer failed to recognize were given analyses by hand. Finally, entries were checked semi-manually, by producing a frequency histogram of morphemes and ruling out those that were deemed highly unlikely by a human annotator. -------------------------------------------------------------------------- 7.2. Part-of-speech tags Klex uses a Part-of-Speech tag set which is based on the one employed by the Korean Treebank Project with slight modification. The POS tagging guideline for the Korean Treebank can be found at: ftp://ftp.cis.upenn.edu/pub/ircs/tr/01-09/ Noun NNC common noun NNU numeric noun NNX dependent noun NPN pronoun NPR proper noun NFW foreign word Post- PCA case postposition position PAD adverbial PAN adnominal * not included in Korean Treebank PAU auxiliary PCJ conjunctive Predicate VV verb VJ adjective VX auxiliary predicate Verbal EPF pre-final ending ending EFN final ending ECS non-final ending * ECS and EAU in Korean Treebank EAN adnominal ending ENM nominalization ending Etc CO copula ADV adverb ADC conjunctive adverb DAN adnominal modifier XSF suffix XPF prefix XSV verbalization suffix XSJ adjectivization suffix IJ interjection Symbol SFN sentence-final symbols: . ? !! ...... SCM comma: , SLQ left delimiters: " ' ( < [ { SRQ right delimiters: " ' ) > ] } SSY symbol -------------------------------------------------------------------------- 7.3. Allomorphs Allomorphs are those morphemes that have exactly the same meaning and function b ut are realized in different forms, usually conditioned by phonological environm ent. A large number of inflectional suffixes in Korean display such property. Fo r example, "Àº" and "´Â", "·Î" and "À¸·Î" are allomorphs of each other. Klex treats such allomorphs as having a single underlying form. All allomorphs therefore take a single form in the upper (analyzed) string. Through application of a sequence of phonological rules, the correct inflected forms (lower strings) are derived. This way, the topic markers in "Çб³-´Â", "Çлý-Àº" and "³Ê-¤¤" are equally assigned "Àº/PAU" in "Çб³/NNC+Àº/PAU", "Çлý/NNC+Àº/PAU" and "³Ê/NPN+Àº/PAU". Çб³/NNC+Àº/PAU Çлý/NNC+Àº/PAU ³Ê/NPN+Àº/PAU | | | Çб³´Â ÇлýÀº ³Í The criteria used in determining the representative form among allomorphs are as follows: - the form should be fully syllabic, i.e. "À½" is chosen and not "¤±" (as in "¿¹»Ý"). - the form for the post-consonantal environement is chosen, i.e. "ÀÌ" instead of "°¡". - epenthetic vowels are included, i.e. "À¸·Î" and not "·Î". (this clause mostly overlaps with 1 and 2 above, as epenthetic vowels are used in post-consonantal environments) - for vowel harmony, "¾î" is chosen and not "¾Æ", i.e. "¾î¼­" and not "¾Æ¼­". -------------------------------------------------------------------------- 7.4. Evaluation 2,500 word entries (1/10 of corpus) were randomly extracted for evaluation of the morphological analysis. The findings are: 29 correct analysis missing 9 ungrammatical analysis 233 dubious analysis (not downright ungrammatical, but highly dubious analysis --many due to unlikely noun-compounding, archaic dictionary entries on which the analyzer is based on) Measured precision/recall per word: - recall 2471/2500 = 0.9884 - precision 2491/2500 = 0.9964 - conservative precision (counting "dubious analyses" as incorrect) 2258/2500 = 0.9032 In conclusion, 99% of the time a word in the lexicon is expected to have all possible morphological analyses. Also, either 99% or 90% of the words are free of incorrect parses, depending on the criteria.