Morphologically Annotated Korean Text
|Item Name:||Morphologically Annotated Korean Text|
|LDC Catalog No.:||LDC2004T03|
|Release Date:||February 16, 2004|
|Application(s):||morphology learning, morphology, finite state technology, natural language processing, parsing|
LDC User Agreement for Non-Members
|Online Documentation:||LDC2004T03 Documents|
|Licensing Instructions:||Subscription & Standard Members, and Non-Members|
|Citation:||Han, Na-Rae. Morphologically Annotated Korean Text LDC2004T03. Web Download. Philadelphia: Linguistic Data Consortium, 2004.|
Morphologically Annotated Korean Text was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T03 and ISBN 1-58563-284-8.
This is a collection of Korean text with annotated morphological analysis and part-of-speech tags. The source text was extracted from the Korean Newswire corpus. The newswire corpus is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. The portion included in this release consists of a small number of hand-picked articles.
The corpus is part of the Korean Treebank Phase 2. Between 2001 and 2002, the project was conducted under subcontract from Cogentex Inc., sponsor number Cogentex 5-33436. The text was tokenized and then automatically analyzed using Klex. Since there can be multiple possible morphological analyses, the output was fed through a statistical ranking system in order to select the best possible analysis for the word in the text environment. The part-of-speech tagged result was then manually corrected by Seung-yun Yang and Na-Rae Han, graduate students in the University of Pennsylvania Linguistics Department.
DataThe data consists of one single file, totalling approximately 880KB in uncompressed form.
The text contains 1,574 sentences with 41,024 words and 77,173 morphemes in total. The text file is in ksc-5601 encoding. Characters in Hangul (Korean alphabet) can be displayed with Korean X-terminals such as hanterm, or by selecting Korean encoding in common web browsers such as Netscape or Internet Explorer.
The data is formatted as follows: one head word per line, the word and its morphologically analyzed output are separated by a tab. Each morpheme is followed by "/" and its part-of-speech; morphemes are separated by "+". ^EOS is a special symbol denoting the end of a sentence.
Morphologically analyzed and part-of-speech tagged data can be useful in the following applications: training of statistical morphological analyzers and part-of-speech taggers, evaluation of pre-existing morphological analyzers and part-of-speech taggers.
There are no updates available at this time.
The Morphologically Annotated Korean Text corpus was funded in part through a 5-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via TalkBank, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium.