Home › Language Resources › Data

Morphologically Annotated Korean Text

Item Name:	Morphologically Annotated Korean Text
Author(s):	Na-Rae Han
LDC Catalog No.:	LDC2004T03
ISBN:	1-58563-284-8
ISLRN:	338-479-223-657-5
DOI:	https://doi.org/10.35111/gah6-2c23
Release Date:	February 16, 2004
Member Year(s):	2004
DCMI Type(s):	Text
Data Source(s):	newswire
Project(s):	Talkbank
Application(s):	morphology learning, morphology, finite state technology, natural language processing, parsing
Language(s):	Korean
Language ID(s):	kor
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2004T03 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Han, Na-Rae. Morphologically Annotated Korean Text LDC2004T03. Web Download. Philadelphia: Linguistic Data Consortium, 2004.
Related Works: Hide	View isAnnotationOf LDC2000T45 Korean Newswire isSimilarWith LDC2006T09 Korean Treebank Annotations Version 2.0 relatesTo LDC2004L01 Klex: Finite-State Lexical Transducer for Korean

Introduction

Morphologically Annotated Korean Text was produced by Linguistic Data Consortium (LDC) catalog number LDC2004T03 and ISBN 1-58563-284-8.

This is a collection of Korean text with annotated morphological analysis and part-of-speech tags. The source text was extracted from the Korean Newswire corpus. The newswire corpus is a collection of Korean Press Agency news articles from June 2, 1994 to March 20, 2000. The portion included in this release consists of a small number of hand-picked articles.

The corpus is part of the Korean Treebank Phase 2. Between 2001 and 2002, the project was conducted under subcontract from Cogentex Inc., sponsor number Cogentex 5-33436. The text was tokenized and then automatically analyzed using Klex. Since there can be multiple possible morphological analyses, the output was fed through a statistical ranking system in order to select the best possible analysis for the word in the text environment. The part-of-speech tagged result was then manually corrected by Seung-yun Yang and Na-Rae Han, graduate students in the University of Pennsylvania Linguistics Department.

Data

The data consists of one single file, totalling approximately 880KB in uncompressed form.

The text contains 1,574 sentences with 41,024 words and 77,173 morphemes in total. The text file is in ksc-5601 encoding. Characters in Hangul (Korean alphabet) can be displayed with Korean X-terminals such as hanterm, or by selecting Korean encoding in common web browsers such as Netscape or Internet Explorer.

The data is formatted as follows: one head word per line, the word and its morphologically analyzed output are separated by a tab. Each morpheme is followed by "/" and its part-of-speech; morphemes are separated by "+". ^EOS is a special symbol denoting the end of a sentence.

Morphologically analyzed and part-of-speech tagged data can be useful in the following applications: training of statistical morphological analyzers and part-of-speech taggers, evaluation of pre-existing morphological analyzers and part-of-speech taggers.

The morphologically tagged output is compatible with Klex: Finite-State Lexical Transducer for Korean. It also conforms to the Korean Treebank POS annotation standards.

Samples

Please view this sample.

Updates

There are no updates available at this time.

Sponsorship

The Morphologically Annotated Korean Text corpus was funded in part through a 5-year grant (BCS-998009, KDI, SBE) from the National Science Foundation via TalkBank, an interdisciplinary project to foster research and development in communicative behavior by providing tools and standards for analysis and distribution of language data. Additional funding was provided by Linguistic Data Consortium.

Note

The cost of the first 50 copies of this publication (not counting the copies distributed to LDC members) is covered by NSF Grant Number BCS-998009, and therefore free of charge. After these first 50 copies are distributed, additional copies will be available for the cost of $300.