Korean English Treebank Annotations


Item Name: Korean English Treebank Annotations
Authors: Martha Palmer, Chung-Hye Han, Na-Rae Han, Eon-Suk Ko, Hee-Jong Yi, Alan Lee, Chris Walker, John Duda, and Nianwen Xue
LDC Catalog No.: LDC2002T26
ISBN: 1-58563-236-8
Release Date: May 13, 2002
Data Type: text
Data Source(s): varied
Application(s): natural language processing, parsing, tagging
Language(s): English, Korean
Language ID(s): eng, kor
Distribution: Web Download
Member fee: $0 for 2002 members
Non-member Fee: US $1000.00
Reduced-License Fee: US $500.00
Extra-Copy Fee: N/A
Non-member License: yes
Member License: yes
Online documentation: yes
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Martha Palmer, et al.
2002
Korean English Treebank Annotations
Linguistic Data Consortium, Philadelphia

Introduction

This file contains documentation on the Korean English Treebank Annotations, Linguistic Data Consortium (LDC) catalog number LDC2002T26 and ISBN 1-58563-236-8.

This corpus consists of 33 texts originally written in Korean and translated into English for the purpose of language training in a military setting. The conversations are not authentic dialogues but were constructed for pedagogical purposes. The texts were made available for linguistic research by the Defense Language Institute (DLI). They were delivered on paper to the Institute for Research in Cognitive Science (IRCS) at the University of Pennsylvania, where they were converted to digital form using the KSC 5601 character set encoding (also known as KS X 1001 Wansung).

Both the Korean and English texts are presented with complete Treebank annotation which was done manually at IRCS, including syntactic constituent bracketing and part-of-speech (POS) tagging. Further documentation about the parsing and POS specifications used in these annotations can be found on the Korean NLP web site.

Data

There are 66 data files: 33 for Korean and 33 for English. The text files mostly contain sets of question and answer sentences. A full, unannotated sentence is presented first, on a single line with an initial semi-colon character ";" -- the first token on such lines (the string preceding the first space character on the line) is a sentence-identifier tag that matches the English and Korean versions of the sentence. The parsed/POS-tagged annotation of the sentence follows on subsequent lines.

Updates

There are no updates at this time.

Content Copyright

Portions (c) 2001-2002 CoGenTex, Inc., Trustees of the University of Pennsylvania