Home › Language Resources › Data

Penn Korean Universal Dependency Treebank

Item Name:	Penn Korean Universal Dependency Treebank
Author(s):	Jinho D. Choi, Na-Rae Han, Jena D. Hwang, Hansaem Kim
LDC Catalog No.:	LDC2023T05
ISLRN:	522-574-570-040-8
DOI:	https://doi.org/10.35111/d63z-aw81
Release Date:	April 17, 2023
Member Year(s):	2023
DCMI Type(s):	Text
Data Source(s):	newswire
Application(s):	automatic content extraction, discourse analysis, information detection, information extraction, morphology learning, parsing, part of speech tagging, syntactic parsing
Language(s):	Korean
Language ID(s):	kor
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2023T05 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Choi, Jinho D., et al. Penn Korean Universal Dependency Treebank LDC2023T05. Web Download. Philadelphia: Linguistic Data Consortium, 2023.
Related Works: Hide	View isAnnotationOf LDC2000T45 Korean Newswire LDC2006T09 Korean Treebank Annotations Version 2.0

Introduction

Penn Korean Universal Dependency Treebank contains 5,010 sentences and 132,041 tokens annotated in dependency format under the Universal Dependencies framework. It is a conversion of Korean Treebank Annotations Version 2.0 (LDC2006T09) which was produced in constituency format.

In general, dependency grammar is based on the idea that the verb is the center of the clause structure and that other units in the sentence are connected to the verb as directed links or dependencies. This is a one-to-one correspondence: for every element in the sentence there is one node in the sentence structure that corresponds to that element. In constituency or phrase structure grammars, on the other hand, clauses are divided into noun phrases and verb phrases and in each sentence, one or more nodes may correspond to one element.

Data

The source text is newswire stories from the Linguistic Data Consortium's Korean Press Agency collection contained in Korean Newswire (LDC2000T45).

Sentences were automatically converted for dependency annotation; the output was manually checked. The corpus contains 112 files in CoNLL-U format, the Universal Dependencies standard, with a mapping to their counterpart in LDC2006T09.

Samples

Please view the following sample:

CoNLL-U

Updates

None at this time.

Penn Korean Universal Dependency Treebank

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees