The Xi’an Multi-Language Learner Corpus Author(s): Xiao Zhang, Ling Zhang, Tian Dang, Yuanzhao Feng, Yujing Ji, Xiaohui Jiang, Zhewen Kang, Yan Lu, Wen Nie, Hanyu Ren, Canjun Wang, Jiayi Wang, Yu Wang, Chen Wu, Mei Wu, Tingting Xu, Ruhai Yang, Kai Zhao, Ran Zhao, Quanjie Zhou, Lei Zhu DCMI Type(s): Text Data Source(s): Essays Language(s): Arabic, Filipino, English, French, German, Hindi, Indonesian, Korean, Malay, Persian, Russian, Swahili, Thai, Turkish, Urdu Language ID(s): ARA, FIL, ENG, FRE, GER, HIN, IND, KOR, MAY, PER, RUS, SWA, THA, TUR, URD Introduction The Xi’an Multi-Language Learner Corpus was developed by the Team of Corpus of Multi-languages at Xi'an International Studies University. The corpus contains essays written by Chinese university students who are foreign language majors. It was developed to support the development of learner corpus research. The goal of the corpus release is to provide a source to investigate L2 writing of multi languages by L1 Chinese university students, as well as to provide a database for cross-linguistic comparison of second languages. Data With the official approval authorized by Xi’an International Studies University (XISU) and Yunnan Minzu University (YMU) for data collection, we invited undergraduate students to participate in the project. Each participant signed a consent form and filled in a survey table. Data collecting procedure was in accordance with BERA Ethical Guidelines for Educational Research (4th edition). Each of the participating students majors in one of the foreign languages which are available at XISU and YMU. All of The Xi’an Multi-Language Learner Corpusthe data from XISU was collected in 2024 in class under controlled conditions. Essays written by undergraduates at YMU were collected from the final exams in 2023 and 2024. Their consent form and survey table were collected in 2024. The Xi’an Multi-Language Learner Corpus consists of: L2 argumentative essays written in Arabic (n=8), English (n=107), French (n=129), German (n=78), Hindi (n=16), Indonesian (n=14), Korean (n=24), Malay (n=36), Persian (n=12), Russian (n=33), Thai (n=10), and Turkish (n=22); L2 narrative essays written in Swahili (n=10) and Urdu (n=15); L2 descriptive essays written in Filipino (n=10) and Thai (n=2). Writing prompts that given to the participants were prepared by our team members. Different writing prompts of the 15 languages were prepared based on the participants’ levels of language proficiency. All writing prompts are released with the corpus. As a specialized corpus for L2 writing by Chinese university students, the corpus consists of 526 essays, 120,914 tokens in total. Off-topic essays and incomplete texts were excluded. Data is stored in UTF-8 encoded plain text files. Details of the corpus are presented in Table 1. Table 1. Composition of The Xi’an Multi-Language Learner Corpus Language # texts Average essay length (word count) Essay Length Range (word count) Tokens Arabic 8 218.8 151-500 1,762 English 107 303.3 184-441 32,822 Filipino 10 132.2 77-203 1,371 French 129 310.2 113-794 39,944 German 78 139.8 77-230 10,941 Hindi 16 185.8 141-230 2,972 Indonesian 14 184.9 121-262 2,630 Korean 24 109.5 57-207 2,630 Malay 36 142.5 55-207 5,208 Persian 12 145.8 117-196 1,751 Russian 33 243.0 134-312 8,018 Swahili 10 184.5 150-217 1,840 Thai 12 141.8 107-239 1,661 Turkish 22 169.2 115-221 3,719 Urdu 15 242.1 149-297 3,645 Total 526 / / 120,914 All texts underwent cleaning and formatting. No changes were made to the texts in relation to grammatical tense accuracy or turn of phrase accuracy. #LancsBox X 4.0 was used for token counting in Swahili, Persian, French, Urdu, Hindi. AntConc 4.2.4 was used for token counting in other languages. Metadata annotation information includes (1) metadata information of texts: serial number of texts, genre, word count, token, type, number of sentences, date of writing, time spent on writing, place of writing and serial number of writing prompts; (2) metadata information of participants: citizenship, ethnic group, age, gender, grade, major, first language, second language, starting age of learning L2, length of learning L2, learn at school or self-taught, other foreign language(s), participation of L2 or other foreign language proficiency tests and name of the test, level or total score of the test, and parents’ first language. L2 languages are represented as ISO codes in text metadata. For ease of interpretation, the full name of each of Chinese dialect(s) and languages used by Chinese ethnic groups is provided in participant metadata. The metadata files are presented in UTF-8 encoded CSV format. Acknowledgement The publication of The Xi’an Multi-Language Learner Corpus is a joint effort of the Team of Corpus of Multi-languages at Xi'an International Studies University and two faculty members at Yunnan Minzu University. Ownership Details The Team of Corpus of Multi-languages at Xi'an International Studies University owns the data.