Category Archives: Arabic spell checker

Arabic Spell Checker

Maraim Masoud and I (I am Nawar Habib) have been aiming to improve the accuracy of the Arabic spell checker currently running on ATbar. We have done some research through previous work done in the area. The currently-running spell checker is an ASpell instance using a word list of common Arabic words. It produces good spelling suggestions for long Arabic words (longer than 4 letters) because of the high diffusion between long Arabic words (Which is probably true in any language). High diffusion means that it is not likely that a Typing error in one word would produce another correct word. Arabic roots on the other hand, are 3 or 4-letter words, so a typing error (changing on letter or omitting a letter) would very likely produce another correct root or even another Arabic language constructs like a connective or proposition, and even if the word produced by the error was not an Arabic word, the spelling suggestions might sometimes be confusing for short words because of many alternative possibilities.

Ayaspell is a project aimed at producing an Arabic word list mainly for spell-checking purposes. The creators of Ayaspell also provide a Hunspell based spell checker equipped with their word list. The main issue with their work is that they used traditional Arabic dictionaries as their word source which contain Arabic words that are no longer used. This would confuse the spell checker and decrease the diffusion talked about above in this post. This is the only documented word list we have found and we did a brief test on the Hunspell implementation which did not show good results.

Hence to improve our spell-checker we should:

1- Make sure popular words are added to our word list (the ability to do that exists).
2- Hunspell and ASpell use Phonetic codes to represent words as they sound spoken. This helps in giving suggestions that not just have close spelling but also close pronunciation. For Arabic it is completely different, Arabic words sound as written (With some exceptions like confusing ة with ه, or ي with ى, or ى with ا, or أ with ا), hence, spelling errors happen accidentally (Button Proximity). But still the phonetic code should be utilized in Arabic but new methods should be added to accurately calculate the distance between words (like Adding Grammar-checking).

We had a problem with Wiktionary’s service API. Wiktionary, when asked for a word definition, conducts an exact-match search on Arabic words, so, if the submitted word has prefixs or suffixes or a definitive article, the word would not be found. To solve this we are creating a light stemmer that operates as preprocessor before the word is looked-up in the dictionary. The light stemmer has a smll CPU footprint because it does not use a word list (only Grammer rules), unlike heavy stemmers which use word list to increase accuracy but decrease performance.

Spell Checking Plugin Update

The spell checking plugin has been updated to further record spelling errors. It now also records the sentence containing the error to provide context for the spell checking service. However, in order to comply with the Data Protect Act 1998, we ask users if they would like to provide the data anonymously.

When spell checking is complete, the user is asked if they would like to submit anonymous usage data. This data is displayed to ensure they know what the are submitting.

End of Year update

Spell Checking Service

The spell checking service has been updated and analysed by Nawar and one of the conclusions is that the error checking for long single words is relatively accurate without context. However, with words that are small and typed incorrectly there are two problems. One is that the word can be changed to another word that is not appropriate for the context but the spelling is correctly so the mistake is not picked up. The second problem is that if one small error has been made in a short word there are often too many options as to how this word could be spelt. The spell checker does not cope with grammatical errors and is unable to see the context of words.

Magnus has found that because the spell checker does not ‘use’ any words around the error he is having to develop a system that will record the words typed prior to the error and then capture a few words after the error. This is not as easy as it sounds! The service for correcting errors is in place without the sentences at present

Server Side Support

All aspects of the websites and toolbar that have required the move to ‘https’ have occurred. This may not appear to be important to users but it has been done to allow the ATbar and its services to be used on any secure sites such as banking services etc. The ‘https’ is a way of telling people that you are a trusted source – Magnus has obtained SSL certificates for the majority of our services – these will expire in 2015. The ATbar and its services now sit on a new virtual server. We are still looking to the possibility of having a redundant server if the one we are using fails, but this is a costly exercise.

As part of this process all versions of ATbar are now automatically updated.

For the latest version of ATbar please find it here: https://core.atbar.org/atbar/en/latest/atbar.min.js

Documentation

Documentation is available on a wiki and on Github

Instructions are available in Arabic and English

Dictionary

We have looked into possible alternative dictionaries instead of using Wiktionary. Wiktionary has a very limited word list and poor definitions when used in Arabic. Of the freely available dictionaries, Word Reference looked promising as it has a comprehensive English to Arabic translation database which is also a dictionary. It has an API but sadly no Arabic > Arabic with definitions or even stems.

One of the problems we face is that true Arabic dictionaries are structured in a different way to western ones. Many of the dictionaries we have looked at include some stem information but lack the more comprehensive information required to help users (example).

We need to understand the use of the dictionary required on ATbar in order to be able to provide the correct service. So any comments would be very welcome.

Desktop ATbar

We have developed a Desktop ATbar with magnification, screen reading, colour overlay with screen ruler and an on-screen keyboard. It is still in the beta version and we are in the process of improving its accessibility such as tab order and icon improvements. It is hoped a final release will be available next week. The toolbar has been tested on Windows 7/8 and should be backward compatible – it has not been developed for the Mac OS.

The code for the toolbar is open source and available for download from GitHub. We have included concise and comprehensive inline-documentation between code segments. Several free open source libraries have been used as part of the project and adjusted to suite our needs.

Now, we are making sure the toolbar is easier to install and there are several issues to consider:

Anti-Viruses blocking the toolbar.
Installing newer versions of the bar on-top of old ones.
Making the bar easy to use with shortcuts while avoiding shortcut collisions.

Please do leave your comments on any items we have discussed.

All good wishes for the New Year. Till 2013

ATBar Services and wiki site now available – Spell Checker service developed for additional support.

ATbar services offers links to other parts of the ATkit such as the marketplace of plugins, news and statistics and a new area for services to improve the ATbar.

The Spell Checker service allows users to log in and adapt the spell checking feature on the toolbar by correcting words found in the spell checking dictionary and adding new options for the error correction list. This feature is available in Arabic and English and allows all those who log in to add suitable corrections for words that have been misspelled where no suitable correction has already been supplied. The alternative words provided by users will go into a moderated database. Once checked the words will appear in the spell checker.
The ATbar wiki has been set up to work in English and Arabic and will be where all the supporting information about the entire ATkit can be found from guides to the framework.

Work is ongoing to produce guides for all the plugins that have been developed. The standards toolbar plugins have been completed in English and are being translated into Arabic.

Arabic ATbar spell checking update

Magnus has added an extended Arabic dictionary to our spell checker which has resulted in better error correction. The size of this new dictionary is twenty times larger than the one used originally building on the original Aspell dictionary. We are also able to supplement the database with additional words.

Alaa has been testing the checker and noticed an error on our web page that we use for trying the toolbar. This time the words offered as alternatives made sense and could be used when she was making mistakes.

We now have a database that records the word that has been misspelled, saves the error alongside the word that has been chosen from the correction list or notes the fact that the user has ignored the offered words. The database handles all languages but those words in Arabic are appear incomprehensible to readers due to the UTF-8 coding.