Arabic Spell Checker

Maraim Masoud and I (I am Nawar Habib) have been aiming to improve the accuracy of the Arabic spell checker currently running on ATbar.  We have done some research through previous work done in the area. The currently-running spell checker is an ASpell instance using a word list of common Arabic words. It produces good spelling suggestions for long Arabic words (longer than 4 letters) because of the high diffusion between long Arabic words (Which is probably true in any language). High diffusion means that it is not likely that a Typing error in one word would produce another correct word. Arabic roots on the other hand, are 3 or 4-letter words, so a typing error (changing on letter or omitting a letter) would very likely produce another correct root or even another Arabic language constructs like a connective or proposition, and even if the word produced by the error was not an Arabic word, the spelling suggestions might sometimes be confusing for short words because of many alternative possibilities.

Ayaspell is a project aimed at producing an Arabic word list mainly for spell-checking purposes. The creators of Ayaspell also provide a Hunspell based spell checker equipped with their word list. The main issue with their work is that they used traditional Arabic dictionaries as their word source which contain Arabic words that are no longer used. This would confuse the spell checker and decrease the diffusion talked about above in this post. This is the only documented word list we have found and we did a brief test on the Hunspell implementation which did not show good results.

Hence to improve our spell-checker we should:

1- Make sure popular words are added to our word list (the ability to do that exists).
2- Hunspell and ASpell use Phonetic codes to represent words as they sound spoken. This helps in giving suggestions that not just have close spelling but also close pronunciation. For Arabic it is completely different, Arabic words sound as written (With some exceptions like confusing ة with ه, or ي with ى, or ى with ا, or أ with ا), hence, spelling errors happen accidentally (Button Proximity). But still the phonetic code should be utilized in Arabic but new methods should be added to accurately calculate the distance between words (like Adding Grammar-checking).

We had a problem with Wiktionary’s service API. Wiktionary, when asked for a word definition, conducts an exact-match search on Arabic words, so, if the submitted word has prefixs or suffixes or a definitive article, the word would not be found. To solve this we are creating a light stemmer that operates as preprocessor before the word is looked-up in the dictionary. The light stemmer has a smll CPU footprint because it does not use a word list (only Grammer rules), unlike heavy stemmers which use word list to increase accuracy but decrease performance.