Morphologically Annotated Korean Text


Item Name: Morphologically Annotated Korean Text
Authors: Na-Rae Han
LDC Catalog No.: LDC2004T03
ISBN: 1-58563-284-8
Release Date: Feb 16, 2004
Data Type: text
Data Source(s): newswire
Project(s): Talkbank
Application(s): finite state technology, morphology, morphology learning, natural language processing, parsing
Language(s): Korean
Language ID(s): kor
Distribution: Web Download
Member fee: $0 for 2004 members
Non-member Fee: US $300.00
Reduced-License Fee: US $150.00
Extra-Copy Fee: N/A
Non-member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Na-Rae Han
2004
Morphologically Annotated Korean Text
Linguistic Data Consortium, Philadelphia

Introduction

Morphologically Annotated Korean Text was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T03 and ISBN 1-58563-284-8.

This is a collection of Korean text with annotated morphological analysis and part-of-speech tags. The source text was extracted from the Korean Newswire corpus. The newswire corpus is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. The portion included in this release consists of a small number of hand-picked articles.

The corpus is part of the Korean Treebank Phase 2. Between 2001 and 2002, the project was conducted under subcontract from Cogentex Inc., sponsor number Cogentex 5-33436. The text was tokenized and then automatically analyzed using Klex. Since there can be multiple possible morphological analyses, the output was fed through a statistical ranking system in order to select the best possible analysis for the word in the text environment. The part-of-speech tagged result was then manually corrected by Seung-yun Yang and Na-Rae Han, graduate students in the University of Pennsylvania Linguistics Department.

Data

The data consists of one single file, totalling approximately 880KB in uncompressed form.

The text contains 1,574 sentences with 41,024 words and 77,173 morphemes in total. The text file is in ksc-5601 encoding. Characters in Hangul (Korean alphabet) can be displayed with Korean X-terminals such as hanterm, or by selecting Korean encoding in common web browsers such as Netscape or Internet Explorer.

The data is formatted as follows: one head word per line, the word and its morphologically analyzed output are separated by a tab. Each morpheme is followed by "/" and its part-of-speech; morphemes are separated by "+". ^EOS is a special symbol denoting the end of a sentence.

Morphologically analyzed and part-of-speech tagged data can be useful in the following applications: training of statistical morphological analyzers and part-of-speech taggers, evaluation of pre-existing morphological analyzers and part-of-speech taggers.

The morphologically tagged output is compatible with Klex: Finite-State Lexical Transducer for Korean. It also conforms to the Korean Treebank POS annotation standards.

Updates

There are no updates available at this time.

Sponsorship

The Morphologically Annotated Korean Text corpus was funded in part through a 5-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via TalkBank, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium.

Note

The cost of the first 50 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge. After these first 50 copies are distributed, additional copies will be available for the cost of $300.

Content Copyright

Portions 1994-2000 Korean Press Agency, 2004 Trustees of the University of Pennsylvania