In the last blog Nawar mentioned the issues we are having with the Arabic version of Wiktionary and its presentation of definitions and alternative words when selecting text on Arabic web sites. The Wiktionary pages do not appear to be as well organised in Arabic as they are in English. They are incomplete and often return incorrect results or no results.
Arabic wiktionary homepage
In a previous blog we showed a diagram that highlighted the importance of organising the stems related to words along with the definitions taken from Wiktionary. The way the words are presented with their changing meanings is important and Maraim and Nawar have been discussing the use of crowd sourcing to achieve a successful outcome as this is not something that can be done immediately if we want to make a useful dictionary that makes the most of open source software alongside content that is also open and accessible to all.
Maraim has written a blog about the subject in Arabic. She explains the concept of crowd sourcing and provides examples of three different dictionaries – Lingoz, Wordia and Collins that have all used this technique to gather data.
Maraim Masoud and I (I am Nawar Habib) have been aiming to improve the accuracy of the Arabic spell checker currently running on ATbar. We have done some research through previous work done in the area. The currently-running spell checker is an ASpell instance using a word list of common Arabic words. It produces good spelling suggestions for long Arabic words (longer than 4 letters) because of the high diffusion between long Arabic words (Which is probably true in any language). High diffusion means that it is not likely that a Typing error in one word would produce another correct word. Arabic roots on the other hand, are 3 or 4-letter words, so a typing error (changing on letter or omitting a letter) would very likely produce another correct root or even another Arabic language constructs like a connective or proposition, and even if the word produced by the error was not an Arabic word, the spelling suggestions might sometimes be confusing for short words because of many alternative possibilities.
Ayaspell is a project aimed at producing an Arabic word list mainly for spell-checking purposes. The creators of Ayaspell also provide a Hunspell based spell checker equipped with their word list. The main issue with their work is that they used traditional Arabic dictionaries as their word source which contain Arabic words that are no longer used. This would confuse the spell checker and decrease the diffusion talked about above in this post. This is the only documented word list we have found and we did a brief test on the Hunspell implementation which did not show good results.
Hence to improve our spell-checker we should:
1- Make sure popular words are added to our word list (the ability to do that exists).
2- Hunspell and ASpell use Phonetic codes to represent words as they sound spoken. This helps in giving suggestions that not just have close spelling but also close pronunciation. For Arabic it is completely different, Arabic words sound as written (With some exceptions like confusing ة with ه, or ي with ى, or ى with ا, or أ with ا), hence, spelling errors happen accidentally (Button Proximity). But still the phonetic code should be utilized in Arabic but new methods should be added to accurately calculate the distance between words (like Adding Grammar-checking).
We had a problem with Wiktionary’s service API. Wiktionary, when asked for a word definition, conducts an exact-match search on Arabic words, so, if the submitted word has prefixs or suffixes or a definitive article, the word would not be found. To solve this we are creating a light stemmer that operates as preprocessor before the word is looked-up in the dictionary. The light stemmer has a smll CPU footprint because it does not use a word list (only Grammer rules), unlike heavy stemmers which use word list to increase accuracy but decrease performance.