Home › Language Resources › Data

Arabic Learner Corpus

Item Name:	Arabic Learner Corpus
Author(s):	Abdullah Alfaifi, Eric Atwell
LDC Catalog No.:	LDC2015S10
ISBN:	1-58563-727-0
ISLRN:	568-308-670-444-7
DOI:	https://doi.org/10.35111/5312-x803
Release Date:	August 15, 2015
Member Year(s):	2015
DCMI Type(s):	Sound, Text
Sample Type:	mp3
Sample Rate:	44100
Data Source(s):	essays, microphone speech
Application(s):	handwriting recognition, language identification, language teaching, machine translation
Language(s):	Standard Arabic
Language ID(s):	arb
License(s):	Arabic Learner Corpus User License Agreement
Online Documentation:	LDC2015S10 Documents
Licensing Instructions:	Subscription & Standard Members, and Non-Members
Citation:	Alfaifi, Abdullah, and Eric Atwell. Arabic Learner Corpus LDC2015S10. Web Download. Philadelphia: Linguistic Data Consortium, 2015.
Related Works: Hide	View relatesTo LDC2014T06 ETS Corpus of Non-Native Written English LDC2017S16 LDC Spoken Language Sampler - Fourth Release LDC2022T04 Qatari Corpus of Argumentative Writing LDC2025T03 The Xi’an Multi-Language Learner Corpus

Introduction

Arabic Learner Corpus was developed at the University of Leeds and consists of written essays and spoken recordings by Arabic learners collected in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words in 1,585 materials, produced by 942 students from 67 nationalities studying at pre-university and university levels. The average length of an essay is 178 words.

Data

Two tasks were used to collect the written data, and participants had the choice to do one or both of them. In each of those tasks, learners were asked to write a narrative about a vacation trip and a discussion about the participant's study interest. Those choosing the first task generated a 40 minute timed essay without the use of any language reference materials. In the second task, participants completed the writing as a take-home assignment over two days and were permitted to use language reference materials.

The audio recordings were developed by allowing students a limited amount of time to talk about the topics above without using language reference materials.

The original handwritten essays were transcribed into an electronic text format. The corpus data consists of three types: (1) handwritten sheets scanned in PDF format; (2) audio recordings in MP3 format; and (3) textual unicode data in plain text and XML formats (including the transcribed audio and transcripts of the handwritten essays). The audio files are either 44100Hz 2-channel or 16000Hz 1-channel mp3 files.

Arabic Learner Corpus

Introduction

Data

Samples

Updates

Copyright

Available Media

View Fees