Ancient Chinese Corpus
Item Name: | Ancient Chinese Corpus |
Author(s): | Xiaohe Chen, Bin Li, Minxuan Feng, Chao Xu, Runhua Xu, Min Shi, Lili Yu, Lei Xiao, Qingqing Wang |
LDC Catalog No.: | LDC2017T14 |
ISBN: | 1-58563-816-1 |
ISLRN: | 924-985-704-453-5 |
DOI: | https://doi.org/10.35111/ctjv-ez04 |
Release Date: | October 18, 2017 |
Member Year(s): | 2017 |
DCMI Type(s): | Text |
Data Source(s): | non-fiction |
Application(s): | machine learning, part of speech tagging, historical linguistics |
Language(s): | Literary Chinese |
Language ID(s): | lzh |
License(s): |
LDC User Agreement for Non-Members |
Online Documentation: | LDC2017T14 Documents |
Licensing Instructions: | Subscription & Standard Members, and Non-Members |
Citation: | Chen, Xiaohe, et al. Ancient Chinese Corpus LDC2017T14. Web Download. Philadelphia: Linguistic Data Consortium, 2017. |
Introduction
Ancient Chinese Corpus was developed at Nanjing Normal University. It contains word-segmented and part-of-speech tagged text from Zuozhuan, an ancient Chinese work believed to date from the Warring States Period (475-221 BC). Zuozhuan is a commentary on the Chunqui, a history of the Chinese Spring and Autumn period (770-476 BC). This release is part of a continuing project to develop a large, part-of-speech tagged ancient Chinese corpus.
Data
Ancient Chinese Corpus consists of 180,000 Chinese characters and 195,000 segment units (including words and punctuation). The part-of-speech tag set was developed by Nanjing Normal University and contains 17 tags.
This release contains two text files: 268 paragraphs and 10,560 lines. A line is one sentence; paragraphs are separated by one empty line. Each word is tagged with its part-of-speech and separated by a space.
The files are presented in UTF-8 plain text files using traditional Chinese script.
Samples
Please view this sample.
Updates
None at this time.