======================================================== 1) What is it? ======================================================== Item Name: Classical Arabic Dictionary (words with occurrence) Author(s): Abeer Alsheddi Type(s): Text, Lexicon Data Source(s): web collection, dictionaries, essay Application(s): automatic content extraction, historical linguistics, language generation, language modeling, machine learning, morphology Language(s): Arabic Format: Plain Text Text Size (words): 1138616 Character Encoding: UTF-8, CP1256 Script: Arab Ownership Details * The classical Arabic dictionary gathers Arabic texts dating back from 431 to 1104. The main texts (books and essays) are available online. Anyone can download them. The resources (links) are in docs/index. Description The classical Arabic dictionary gathers Arabic texts dating back from 431 to 1104. The dictionary (combines the words with their occurrence) is available in different formats (.sql and .txt). The classical Arabic dictionary has been developed as part of a master thesis named "Edit Distance Adapted to Natural Language Words". It consists four parts: data/dictionary, docs/CP1256, docs/UTF and docs/index. 1. data/dictionary ------------ The corpus gathers Arabic texts dating back from 431 to 1104 (in Hijri between 130 BH and 498 H). It counts around one hundred million (121,799,416) words in total, in addition to more than one million (1,138,616) distinct words with an average size of six (6.220) letters per word. Where these words consist only of Arabic letters without diacritics, numbers or symbols. The dictionary (combines the words with their occurrence) is available in different formats (.sql and .txt). The database file (.sql) sets CHARSET to cp1256 and COLLATE to cp1256_bin to be able to handle all forms of Arabic letters. 2. docs/CP1256 and docs/UTF ------------ The Arabic text files are in txt format with two different encoding: UTF-8 and CP1256. The name of files (*.#.txt) denotes (*author.#his resource.txt). 3. docs/index ------------ The index sheet is a table of historical information for the legacy texts. Each record in the index presents some or all of the following information according to availability: {name of resource - author - date of death - place of (birth, death, living) - number of words - reference}. The index is wrote in the Arabic language. It orders the files by the authors. The total number of authors is around 364 that wrote 1,060 Arabic texts. ======================================================== 2) Who can use it? ======================================================== The follwing areas but are not limiting to: - For segmentating the words depends on their occurrence. - For searching and extracting. ======================================================== 3) How can it be used? ======================================================== To use .sql file: 1. Navigate to the directory where .sql file is. 2. Run the command to import the corpus: mysql -u user_name -p -D database_name < dictionary.sql ======================================================== 4) Contact ======================================================== Questions or feedback? Email: abeer.alsheddi@gmail.com