Morphologically Annotated Korean Text September 17, 2003 Na-Rae Han University of Pennsylvania nrh@babel.ling.upenn.edu Contributors: Seung-yun Yang (syyang@unagi.cis.upenn.edu) Mike Maxwell (maxwell@ldc.upenn.edu) ** Note: this document is in ksc-5601 encoding. Characters in Hangul (Korean alphabet) can be displayed with Korean xterminals such as hanterm, or by selecting Korean encoding in common web browsers such as Netscape or Internet Explorer. The file readme.html contains the same information, but using Unicode character entities for Hangul characters. ----------------------------------------------------------------------- - Morphologically analyzed and part-of-speech annotated Korean corpus - Encoded in ksc-5601 character set - Size of corpus 1,574 sentences 41,024 words (with tokenized symbols) 77,173 morphemes - Data format: one head word per line, word and its morphologically analyzed output are separated by a tab. Within analysis, each morpheme is followed by "/" and its part-of-speech; morphemes are separated by "+". i.e., À¯¿£/NPR+Àº/PAU. ^EOS is a special symbol denoting the end of a sentence. - The morphologically tagged output is compatible with Klex, the Finite-State Transducer Lexicon of Korean (LDC2004L01). - The morphologically tagged output conforms to the Korean Treebank POS annotation standards, as found in ftp://ftp.cis.upenn.edu/pub/ircs/tr/01-09/. - The original text of the corpus is a part of the Korean Newswire corpus (LDC2000T45). The newswire corpus is a collection of Korean Press Agency news articles from June 2, 1994, to March 20, 2000. The portion included in this release consists of a small number of hand-picked articles. - The corpus is a part of the Korean Treebank Phase 2. Between 2001 and 2002 the project was conducted under subcontract from Cogentex Inc, sponsor number Cogentex 5-33436. The original text was tokenized using a tokenization script, which was then automatically analyzed using Klex. Since there can be multiple possible morphological analyses, the output was fed through a statistical ranking system in order to select the best possible analysis for the word in the text environment. The part-of-speech tagged result was then manually corrected by Seung-yun Yang and Na-Rae Han, graduate students in the University of Pennsylvania Linguistics Department. - The corpus uses a Part-of-Speech tag set which is based on the one employed by the Korean Treebank Project with slight modification. The POS tagging guidelines for the Korean Treebank can be found at: ftp://ftp.cis.upenn.edu/pub/ircs/tr/01-09/. Noun NNC common noun Çб³/NNC Çлý/NNC NNU numeric noun 1/NNU Çϳª/NNU ÇÑ/NNU NNX dependent noun °Í/NNX ¸¶¸®/NNX NPN pronoun ³ª/NPN ³Ê/NPN ¿ì¸®/NPN ±×³à/NPN NPR proper noun ±è´ëÁß/NPR Çѱ¹/NPR Ŭ¸°ÅÏ/NPR NFW foreign word Clinton/NFW UN/NFW Post- PCA case postposition ÀÌ/PCA À»/PCA position PAD adverbial À¸·Î/PAD ¿¡¼­/PAD PAN* adnominal ÀÇ/PAN À̶ó´Â/PAN PAU auxiliary Àº/PAU µµ/PAU ¸¶Àú/PAU PCJ conjunctive °ú/PCJ Predicate VV verb ¸Ô/VV+´Ù/EFN °¡/VV+´Ù/EFN VJ adjective ¿¹»Ú/VJ+´Ù/EFN ÀÛ/VJ+´Ù/EFN VX auxiliary predicate ¾Ê/VX+´Ù/EFN ¸»/VX+´Ù/EFN Verbal EPF pre-final ending ¸Ô/VV+À¸½Ã/EFN+¾ú/EPF+´Ù/EFN ending EFN final ending ¸Ô/VV+¾ú/EFN+´Ù/EFN ¸Ô/VV+³Ä/EFN ECS** non-final ending ¸Ô/VV+°í/ECS ¸Ô/VV+¾î¼­/ECS EAN adnominal ending ¸Ô/VV+´Â/EAN ¸Ô/VV+´ø/EAN ENM nominalization ending ¸Ô/VV+À½/ENM ¸Ô/VV+±â/ENM Etc CO copula Çлý/NNC+ÀÌ/CO+´Ù/EFN ADV adverb »¡¸®/ADV ¶Ç/ADV ADC conjunctive adverb ±×·¯³ª/ADC ±×¸®°í/ADC DAN adnominal modifier ÀÌ·±/DAN ±×/DAN ¸ðµç/DAN XSF suffix ¿µÈñ/NPR+¾¾/XSF ¿ì¸®/NPN+µé/XSF XPF prefix Á¦/XPF+5/NNU+ºÐ´ë/NNC XSV verbalization suffix °øºÎ/NNC+ÇÏ/XSV+´Ù/EFN XSJ adjectivization suffix ħÂø/NNC+ÇÏ/XSJ+´Ù/EFN IJ interjection ¾Æ/IJ ¿¹/IJ ±×·¡/IJ Symbol SFN sentence-final symbols . ? !! ...... SCM comma , SLQ left delimiters: " ' ( < [ { SRQ right delimiters: " ' ) > ] } SSY symbol * not included in Korean Trebank phase 1 ** ECS and EAU in Korean Treebank phase 1