Korean Propbank

Item Name: Korean Propbank
Author(s): Martha Palmer, Shijong Ryu, Jinyoung Choi, Sinwon Yoon, Yeongmi Jeon
LDC Catalog No.: LDC2006T03
ISBN: 1-58563-374-7
ISLRN: 815-941-649-807-9
DOI: https://doi.org/10.35111/j0yk-ph77
Release Date: March 24, 2006
Member Year(s): 2006
DCMI Type(s): Text
Data Source(s): newswire
Application(s): discourse analysis, information extraction, language identification, language modeling, language teaching, natural language processing, parsing
Language(s): Korean
Language ID(s): kor
License(s): Korean Propbank
LDC User Agreement for Non-Members
Online Documentation: LDC2006T03 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Palmer, Martha, et al. Korean Propbank LDC2006T03. Web Download. Philadelphia: Linguistic Data Consortium, 2006.
Related Works: View

Introduction

Korean Propbank was developed by the Computer and Information Sciences Department at the University of Pennsylvania and is comprised of approximately 33,300 predicates annotated in 186,300 words of Korean text. The text used in Propbank comes from Korean English Treebank Annotations (LDC2002T26) and Korean Treebank Version 2.0 (LDC2006T09). Each verb and adjective occurring in the Treebank has been treated as a semantic predicate and the surrounding text has been annotated for arguments and adjuncts of the predicate. The verbs and adjectives have also been tagged with coarse grained senses.

Data

This table gives a breakdown of the thousands of words and number of annotations contained in the corpus, broken down by source:

Source K-words Predicates Annotated
Virginia Corpus 54.5 9,590
Newswire Corpus 131.8 23,700
Total 186.3 33,300

There are two basic components to Korean Propbank:

  • The Verb Lexicon: A frames file, consisting of one or more frame sets, has been created for each predicate occurring in the Treebank. These files serve as a reference for the annotators and for users of the data. 2,749 such files have been created, totaling about ~10 MB of uncompressed data. The XML format and KSC 5,601 character set encoding are used in the frames file.
  • The Annotation: There are two annotation files. The virginia-verbs.pb file has 9,588 annotated predicate tokens. These predicate tokens include all those occurring in 54.5 K-words of the Korean English Treebank Annotations, totaling ~791 KB of uncompressed data. The newswire-verbs.pb file has 23,707 annotated predicate tokens. These predicate tokens include all those occurring in 131.8 K-words of the Korean Treebank Version 2.0, totaling ~2,054 KB of uncompressed data.

Samples

For an example of this corpus, please view this sample (TXT).

Updates

None at this time.

Available Media

View Fees





Login for the applicable fee