Chinese Sentence Pattern Structure Treebank

Item Name: Chinese Sentence Pattern Structure Treebank
Author(s): Weiming Peng, Min Zhao, Jing He, Yuchen Song, Tianbao Song, Dongdong Guo, Jingbo Sun, Shuqin Zhu, Yinbin Zhang, Zuntian Wei, Jiajia Hu, Jihua Song, Zhifang Sui, Ning Wang
LDC Catalog No.: LDC2025T06
ISLRN: 916-484-709-412-8
DOI: https://doi.org/10.35111/hx6v-6p30
Release Date: June 16, 2025
Member Year(s): 2025
DCMI Type(s): Text
Data Source(s): essays, fiction, non-fiction
Application(s): historical linguistics, information extraction, linguistic analysis, natural language processing, syntactic parsing
Language(s): Mandarin Chinese, Chinese
Language ID(s): cmn, zho
License(s): LDC User Agreement for Non-Members
Online Documentation: LDC2025T06 Documents
Licensing Instructions: Subscription & Standard Members, and Non-Members
Citation: Peng, Weiming, et al. Chinese Sentence Pattern Structure Treebank LDC2025T06. Web Download. Philadelphia: Linguistic Data Consortium, 2025.
Related Works: View

Introduction

Chinese Sentence Pattern Structure Treebank (the SPS Treebank) was developed at Beijing Normal University and Peking University. It contains 5,016 sentences and 119,627 tokens syntactically annotated following the concept of sentence constituent analysis which emphasizes sentence pattern structure. This concept is based on linguist Jinxi Li's The New Chinese Grammar. The source data consists of 27 chapters extracted from modern Mandarin and ancient Chinese works.

Data

The SPS Treebank has three annotation layers: lexical sense and structural mode for dynamic words; syntactic structure for clauses; and inter-clause relation within complex sentence and sentence clusters. These structures can be visualized using the Jbw-viewer tool.

Below are the text data sources and volumes contained in this release:

 
Book Name Chapters Characters Sentences
Selected Work of Luxun (《鲁迅全集》) 8 25,545 948
Selected Work of Mao Zedong (《毛泽东选集》) 2 32,454 771
From the Soil: The Foundations of Chinese Society (《乡土中国》) 4 16,018 532
A Dream in Red Mansions (《红楼梦》) 5 33,087 1,781
The Analects of Confucius (《论语》) 6 5,392 517
Mencius (《孟子》) 2 6,771 467
Total: 27 119,267 5,016


The data is presented in UTF-8 encoding. Each file contains the three-layer annotation stored in XML format. All files were automatically verified and manually checked.

Samples

Please view the following samples:

Updates

None at this time..

Available Media

View Fees





Login for the applicable fee