=================================== Arabic Learner Corpus (ALC) 2.0 http://www.arabiclearnercorpus.com =================================== Authors: ________ Abdullah Alfaifi Al Imam Mohammad Ibn Saud Islamic University (IMSIU) ayjfaifi@imamu.edu.sa Leeds University scayga@leeds.ac.uk http://www.comp.leeds.ac.uk/scayga Eric Atwell University of Leeds e.s.atwell@leeds.ac.uk http://www.comp.leeds.ac.uk/eric ----------------------------------- Data Type: __________ Version 2.0 of ALC contains raw data which includes three parts: 1. Transcriptions of hand writing (76%) 2. Writing on computer (17%) 3. Transcriptions of audio recordings (7%) ALC data is available in two formats: Plain text: 1. Plain text with no metadata 2. Plain text with Arabic metadata 3. Plain text with English metadata XML: 4.XML with Arabic metadata 5.XML with English metadata The original hand-written sheets and audio recordings: - The original hand-written sheets are available in PDF format. - The speech recordings (3+ hours) are available in MP3 files format. Permissions have been taken from all contributors of ALC to publish and use their data. ----------------------------------- ALC data source: ____________ ALC data was collected from learners of Arabic Number of learners contributed to ALC: 942 Number of texts: 1585 Number of words: 282,732 ALC contributors: _________________ -Age: from 16 to 42 -Gender: males and females -First language: 66 mother tongues (listed below) -Nationality: 67 nationalities (listed below) -Nativeness: native speakers of Arabic and non-native speakers of Arabic -Number of languages spoken: from 1 to 10 -Number of years learning Arabic: from 1 to 19 -Number of years spent in Arabic countries: from 1 to 21 -General level of education: pre-university and university -Level of study: fivelevels (1) general language course, (2) diploma language course, (3) secondary school, (4) BA and (5) MA -Year/Semester: from the 1st to 3rd year or from the 1st to 8th semester -Educational institution to which the learners affiliate: 25 Learners' first languages: __________________________ Afar Albanian Amharic Anko Arabic Azerbaijani Bambara Bengali Beninese Bosnian Cambodian Chinese Comorian Dagomba Dutch English Filipino French Fulani Hausa Indian Indonesian Ingush Italian Jola Kalibugan Kazakh Korean Kotokoli Kurdish Kyrgyz Madurese Maguindanao Malay Malayalam Mandinka Manga Maranao Modnaka Mongolian Moore Nepali Pashto Persian Polish Portuguese Russian Sango Serbian Somali Soninke Susu Swahili Tagalog Tajik Tamil Tatar Thai Turkish Ugandan Urdu Uzbek Wolof Yakan Yoruba Zarma Learners' nationalities: _________________________ Afghan Albanian American Azerbaijani Belgian Bengali Beninese Bosnian British Burkina Faso Burundi Cambodian Canadian Central African Chinese Comorian Djibouti Dutch Egyptian Ethiopian Filipino French Gambian German Ghanian Guinean Indian Indonesian Italian Ivorian Jordanian Kenyan Kosovar Kyrgyz Lebanese Liberian Macedonian Malaysian Malian Mongolian Montenegro Nepalese Niger Nigerian Pakistani Palestinian Polish Russian Saudi Senegalese Serbian Sierra Leonean Somali South Korean Sri Lankan Sudanese Syrian Tajik Tanzanian Tenge Thai Togolese Turkish Ugandan Ukrainian Uzbek Yemeni ALC texts: __________ - Text genre: 67% narrative and 33% discussion - Where produced: 69% in class and 31% at home - Year of production: 12% in 2012 and 88% in 2013 - Country of production: 100% in Saudi Arabia - City of production: eight cities (87% Riyadh, 9% Alqatif, 4% Makkah, 3% Jeddah, 3% Alkharj, 2% Aljesh, 1% Hafr Albatin and 1% Mahayil Asir) - Timed or not timed task: 69% timed and 31% not timed - References use: used in 5% and not used in 95% - Grammar book use: used in 2% and not used in 98% - Monolingual dictionary use: used in 1% and not used in 95% - Bilingual dictionary use: used in 2% and not used in 98% - Other references use: used in 2% and not used in 98% - Text mode: 93% written and 7% spoken - Text medium: 76% written by hand, 17% written on computer and 7% recorded interviews - Text length: Average length of a text is 178 words ----------------------------------- ALC applications: _________________ The potential uses of ALC include: 1. Computer-Aided Error Analysis (CEA) 2. Contrastive Interlanguage Analysis (CIA) 3. Learners dictionary making 4. Research in Second Language Acquisition (SLA) 5. Language Teaching (LT) 6. Designing pedagogical materials 7. Automatic error detecting 8. Optical Character Recognition (OCR) (ALC contains hand-written texts in PDF format and their transcription in a computerised format) ----------------------------------- ALC Language: _____________ The language of ALC is Standard Arabic (arb) which is a part of the macrolanguage Arabic (ara). ----------------------------------- ALC description: ________________ Arabic Learner Corpus (ALC) contains a collection of written essays and spoken recordings produced under two topics: narrative (a vacation trip) and discussion (my study interest) by learners of Arabic in Saudi Arabia in 2012 and 2013. The corpus includes 282,732 words, 1585 materials, produced by 942 students from 67 nationalities, and 66 different L1 backgrounds, studying at pre-university and university levels. Average length of a text is 178 words. The corpus provides an open-source of data for use in different linguistic research areas such as Language Teaching and Learning, Applied Linguistics, Lexicography, etc. as well as for other purposes such as Error Analysis, Learners’ Improvement Monitoring, Language Materials Designing, Contrastive Inter-language Analysis, Building Learners’ Dictionaries and Common Errors Dictionaries. ALC presents three types of data, (1) textual data in txt (Unicode) and XML formats, (2) scanned hand-written sheets in PDF format as well as (3) audio recordings available in MP3 format. Data collection: 1. Written data Two tasks used to collect the data and the participants had the choice to do either one of them or both. Each of these tasks involved similar topics (narrative: a vacation trip, and discussion: my study interest): - First task was timed (40 minutes for each text) and the learners were not allowed to consult any language references while writing their essays such as dictionaries, grammar books, etc. - In the second task, students were asked to write their essays about the same topics, but were asked to complete this as homework. They were allowed two days to complete the homework and were granted the opportunity to use any language references they wanted to, this was done to enable them to improve their writing before submitting their work as well as to allow them enough time within which to complete the homework. Two form types used to collect The written data, (1) a paper form for schools and departments where there were no computer labs, a post-process was required to transcribe the texts into a computerised format, (2) an online equivalent form was used in schools and departments that had labs, so learners’ texts were included in the corpus without any post-process. 2. Spoken data The first task was also used to collect the oral data, learners had a limited time to give a talk about their chosen topic without the use of any language references, all talks were recorded as MP3 files, but due to some differences in recording conditions, some assistants where not able to use the corpus devices that produce 44100Hz 2-channel files, so they used a different device which yielded 16000Hz 1-channel files. Span of time: ALC encompasses two years, 2012 and 2013. ----------------------------------- Files naming: _____________ Names of ALC files indicate the basic characteristics of the text and its author (e.g. S038_T2_M_Pre_NNAS_W_C), They are in order [separated by "_"]: -Student identifier number -Text number for this student -Author gender (M=male, F=female) -Level of study (Pre=pre-university, Uni=university) -Nativeness (NAS=native Arabic speaker, NNAS=non-native Arabic speaker) -Text mode (W=written, S=spoken) -Place of text production (C=in class, H=at home) -----------------------------------