Home › Language Resources › Data

The Xi’an Multi-Language Learner Corpus

Item Name:	The Xi’an Multi-Language Learner Corpus
Author(s):	Xiao Zhang, Ling Zhang, Tian Dang, Yuanzhao Feng, Yujing Ji, Xiaohui Jiang, Zhewen Kang, Yan Lu, Wen Nie, Hanyu Ren, Canjun Wang, Jiayi Wang, Yu Wang, Chen Wu, Mei Wu, Tingting Xu, Ruhai Yang, Kai Zhao, Ran Zhao, Quanjie Zhou, Lei Zhu
LDC Catalog No.:	LDC2025T03
ISLRN:	615-404-265-320-6
DOI:	https://doi.org/r333-vr13
Release Date:	March 17, 2025
Member Year(s):	2025
DCMI Type(s):	Text
Data Source(s):	essays
Application(s):	cross-linguistic comparison, language learning
Language(s):	Arabic, Filipino, English, French, German, Hindi, Indonesian, Korean, Malay, Persian, Russian, Swahili, Thai, Turkish, Urdu
Language ID(s):	ara, fil, eng, fra, deu, hin, ind, kor, msa, fas, rus, swa, tha, tur, urd
License(s):	LDC User Agreement for Non-Members
Online Documentation:	LDC2025T03 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Zhang, Xiao, et al. The Xi’an Multi-Language Learner Corpus LDC2025T03. Web Download. Philadelphia: Linguistic Data Consortium, 2025.
Related Works: Hide	View relatesTo LDC2014T06 ETS Corpus of Non-Native Written English LDC2015S10 Arabic Learner Corpus LDC2022T04 Qatari Corpus of Argumentative Writing isProcessedBy AntConc 4.2.4 https://www.laurenceanthony.net/software/antconc/ Lancs Box X 4.0 https://lancsbox.lancs.ac.uk/

Introduction

The Xi’an Multi-Language Learner Corpus was developed by Xi'an International Studies University (XISU). It is comprised of 526 argumentative essays in 15 languages by Chinese L1 university students studying second languages, along with student metadata and writing prompts. It was developed to support second language learner research and to provide a database for cross-linguistic comparison of second languages.

Data

The essays were produced by undergraduate students at XISU and Yunnan Minzu University (YMU) in response to writing prompts prepared by the corpus development team. Data was collected in 2023 and 2024. Participating students were linguistic majors or studying one of the foreign languages available at XISU and YMU. Off-topic essays and incomplete texts were excluded

All texts were cleaned and formatted. No changes were made to the texts in relation to grammatical tense or turn of phrase accuracy.

Text and token counts by language are as follows:

Language	texts	tokens
Arabic	8	1,762
English	107	32,822
Filipino	10	1,371
French	129	39,944
German	78	10,941
Hindi	16	2,972
Indonesian	14	2,630
Korean	24	2,630
Malay	36	5,208
Persian	12	1,751
Russian	33	8,018
Swahili	10	1,840
Thai	12	1,661
Turkish	22	3,719
Urdu	15	3,645

LancsBox X 4.0 was used for counting Swahili, Persian, French, Urdu, and Hindi tokens. AntConc 4.2.4 was used for counting tokens in the other languages.

The essays and writing prompts are stored in UTF-8 encoded plain text files. Metadata is presented in .csv files.

The Xi’an Multi-Language Learner Corpus

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees