Category Archives: spell checker

Arabic Spell Checker

Maraim Masoud and I (I am Nawar Habib) have been aiming to improve the accuracy of the Arabic spell checker currently running on ATbar.  We have done some research through previous work done in the area. The currently-running spell checker is an ASpell instance using a word list of common Arabic words. It produces good spelling suggestions for long Arabic words (longer than 4 letters) because of the high diffusion between long Arabic words (Which is probably true in any language). High diffusion means that it is not likely that a Typing error in one word would produce another correct word. Arabic roots on the other hand, are 3 or 4-letter words, so a typing error (changing on letter or omitting a letter) would very likely produce another correct root or even another Arabic language constructs like a connective or proposition, and even if the word produced by the error was not an Arabic word, the spelling suggestions might sometimes be confusing for short words because of many alternative possibilities.

Ayaspell is a project aimed at producing an Arabic word list mainly for spell-checking purposes. The creators of Ayaspell also provide a Hunspell based spell checker equipped with their word list. The main issue with their work is that they used traditional Arabic dictionaries as their word source which contain Arabic words that are no longer used. This would confuse the spell checker and decrease the diffusion talked about above in this post. This is the only documented word list we have found and we did a brief test on the Hunspell implementation which did not show good results.

Hence to improve our spell-checker we should:

1- Make sure popular words are added to our word list (the ability to do that exists).
2- Hunspell and ASpell use Phonetic codes to represent words as they sound spoken. This helps in giving suggestions that not just have close spelling but also close pronunciation. For Arabic it is completely different, Arabic words sound as written (With some exceptions like confusing ة with ه, or ي with ى, or ى with ا, or أ with ا), hence, spelling errors happen accidentally (Button Proximity). But still the phonetic code should be utilized in Arabic but new methods should be added to accurately calculate the distance between words (like Adding Grammar-checking).

We had a problem with Wiktionary’s service API. Wiktionary, when asked for a word definition, conducts an exact-match search on Arabic words, so, if the submitted word has prefixs or suffixes or a definitive article, the word would not be found. To solve this we are creating a light stemmer that operates as preprocessor before the word is looked-up in the dictionary. The light stemmer has a smll CPU footprint because it does not use a word list (only Grammer rules), unlike heavy stemmers which use word list to increase accuracy but decrease performance.

ATBar Services and wiki site now available – Spell Checker service developed for additional support.

ATbar services offers links to other parts of the ATkit such as the marketplace of plugins, news and statistics and a new area for services to improve the ATbar.

spell checker service in English

Arabic spell checker services

The Spell Checker service allows users to log in and adapt the spell checking feature on the toolbar by correcting words found in the spell checking dictionary and adding new options for the error correction list. This feature is available in Arabic and English and allows all those who log in to add suitable corrections for words that have been misspelled where no suitable correction has already been supplied. The alternative words provided by users will go into a moderated database.  Once checked the words will appear in the spell checker.
Dictionary pluginThe ATbar wiki has been set up to work in English and Arabic and will be where all the supporting information about the entire ATkit can be found from guides to the framework.

Work is ongoing to produce guides for all the plugins that have been developed.  The standards toolbar plugins have been completed in English and are being translated into Arabic.

Arabic ATbar spell checking update

Magnus has added an extended Arabic dictionary to our spell checker which has resulted in better error correction. The size of this new dictionary is twenty times larger than the one used originally building on the original Aspell dictionary.  We are also able to supplement the database with additional words.

Arabic ATbar spell checkerAlaa has been testing the checker and noticed an error on our web page that we use for trying the toolbar.  This time the words offered as alternatives made sense and could be used when she was making mistakes.

Database for spelling errorsWe now have a database that records the word that has been misspelled, saves the error alongside the word that has been chosen from the correction list or notes the fact that the user has ignored the offered words.  The database handles all languages but those words in Arabic are appear incomprehensible to readers due to the UTF-8 coding.

YouTube videos illustrating the ATbar features.

We have set up a series of YouTube videos that include:

Text resizing, font style changes and line spacing. This video has no audio but shows how a user can select the magnifier on the toolbar to enlarge text without resizing the graphics – this tends to allow for more readable text when compared to zooming using the browser Ctrl+ which also enlarges the graphics.  However, this feature does not work when Flash has been used within a webpage or fonts have fixed sizes or styles.  The same applies to increased line spacing which is also demonstrated.

YouTube link to the video

The second video demonstrates how the A.I.Type word prediction works as well as spell checking when writing a blog using WordPress.  Use the HTML mode when working in the edit box rather than the Visual mode and then you will also be able to use the text to speech to aid proof reading.


YouTube link to the video

The last video demonstrates the use of text to speech with the Acapela voice in both Arabic and English.


YouTube link to the video

Updates on the progress on Arabic spell checking, TTS, Word Prediction and the ATKit

footstepsThe last few weeks since the Christmas break have flown by with a flurry of activity which is retrospect seems at times to have made us feel as if we have been going two steps forward only to have to go at least one if not more steps backward!  But there have been some breakthroughs in the areas of Spell checking, Text to Speech, Word Prediction and the ATKit website.

Spell Checking

Thanks to Mashael AlKadi we have a really clear evaluation of the spell checker titled Dyslexic Typing Errors in Arabic (PDF download) and also thank you to Mina Monta who commented that:

  • “Some of the words are correct in spell & in the meaning but AT spell checker detect that those are wrong words
  • In the suggested word list, there is no sorting according to the priority of the suggested word (according to the relativity between the suggested word & the original wrong word)
  • Some of the suggested words are wrong in spell
  • The number of the suggested words is to high comparing with MS Word spell checker.
  • MS Word is better in detecting the wrong words in grammar (the word has correct spell) “

Sadly research into English spell checkers has revealed that they are not as accurate as we had hoped when it comes to providing false errors and real words or homophones as can be seen from this presentation about online spell checking.

I asked Mashael whether adding a new corpus would help as Seb has succeeded in collecting a larger Arabic corpus and has put in some code to make it possible to add this extended vocabulary.   However, Mashael’s comment was:

“regarding adding new words, do you mean expanding the tool’s dictionary? I don’t think you should worry beacuse it was working very well expect for certain remarks that I’ve said such as the tool’s behavior with words attached to prepositions. In such case only some adjustments should be applied to the tool’s mechanism and I think it will work great.”

So with the support of Erik and Mina in our last meeting, it has been decided that we will work on particular improvements as a future aim with the help of our Arabic speaking colleagues.

Text to Speech

It has been a bit of a trial and error period starting with the withdrawal of Google Translate. We were aware this might happen, but had rather hoped there could be a reprieve as this was a free option, although in the tests carried out with 5 Arabic speaking students the results were poor in comparison to Acapela and Vocalizer voices. The sadness also on the part of the time spent on this work as it was something we had proved was possible to achieve – a free TTS on the toolbar.  Microsoft Speak Method was also tried and tested – but the TTS appeared to leave off initial sounds and the voice was unacceptable to our beta testers.

We also learnt that NVDA in Arabic was only going to work with the Arabic TTS offered by Microsoft and eSpeak and Festival with the Mbrola project was still an uphill struggle.

As a research project and definitely not for profit we also wondered if we could go back to Google Translate but the agreement  specifically says  “The program may be used only by registered researchers and their teams, and access may not be shared with others.”

Meanwhile Fadwa Mohamad kindly visited King Abdulaziz City for Science and Technology(KACST) over the Christmas period and met Professor Ibrahim A. Almosallam who has been in touch to say that they are developing an Arabic Text to Speech application, but it has yet to be released.  I am enquiring as to whether this is a desktop application or a VAAS system (Voice as a Service) such as that offered by Acapela in Arabic.

Seb then spent time working on the Acapela VAAS system and this was shown to work well in all the tests although there are issues when a whole page is read out.  It is felt that it might be more appropriate to restrict the call on the servers and just allow text to be highlighted and then spoken.  We now have to negotiate the way we can work with this system, as the final output needs to be free to the user.

There is also the option of building a new Arabic voice and this is being explored – although it would take time and effort to generate the corpus, normalise the output and beta test, even when there are engines available to achieve this aim….. A new build Arabic voice needs further discussion but we have the connections in place.

WordPrediction

wordprediction screen grabSeb has been able to show how this feature for the toolbar is possible in English and the background architecture is in place for the Arabic version pending the language pack.

ATKit website

ATkit siteIt has been agreed that the mock up of the ATKit website that was available as a demonstrator should be taken forward and developed.  This has been completed with the ability to add plugins both free and those that require payment (for instance where a TTS requires a fee). Users can register, build  their own toolbar and save the results.  The next step is a completed Arabic translation and the ability to author plugins …

Arabic ATKit

Documentation and ATkit Plug-in Progress

ATKit plugins

Seb has recently been working on the documentation and the code behind the plugins for the ATKit making it possible to convert the ATBar into a modular system that allows users to choose which plug-ins they wish to have on the bar.

An example above shows how Readability has been added to list of plug-ins and the code is available on the ATKit wiki.

The spell checking issues appear to have been solved but testing is now at an important stage where we see if it works with sentences other than those we have in our test paragraph!

The free to users Arabic text to speech plug-in has been causing more concern as Acapela and Nuance still reign supreme and these voices can be licensed with the plug-in system,  but the gauntlet has been thrown down to see if we can explore other options!

Spell checking and the Arabic script

The Arabic script is cursive and we have been exploring difficulties with accurate online spell checking. Fadwa Mohamad has kindly shared her knowledge about some of the issues that arise for those with dyslexia when it comes to the way Arabic characters are linked. Arabic has 28 letters to represent 34 phonemes and we have already discussed the issues of vowels and diacritics. Now we have learnt there is the thorny problem that only 22 of the 28 letters have two way connectors. The 6 remaining letters can only be joined in one way – so an Arabic word can contain one of more spaces. This means a word using some of these 6 letters, that can only be joined up in one way, may be divided in several places.

The other problem of note is that capital letters are not used in Arabic, so once again it may not be easy to see or work out where word boundaries occur. This along with the odd spacing obviously causes concerns for some readers, but may also be one reason why a spell checker can appear to gobble letters when it tries to correct a word!

To add to these issues the articles ‘the’,’a’ or ‘an’ in English tend to be joined to the following word in Arabic –  so those who can read Arabic will recognise the letters ‘AL’ or “Arabic: الـ‎, also transliterated as ul- and in some cases il- and el- ” according to Wikipedia. The reader has to also work out whether the ‘AL’ will be silent or voiced in some cases which impacts on text to speech engines and the lack of spacing can affect spell checking.

Finally Arabic letters may be formed in different ways depending on their position in the word.  So a shape may change from its isolated form to one that is different when seen as the initial letter in the word or the medial one or even the final one! This is how arabic-course.com describe the issue.

Arabic letter changes depending on the position in a word


The work to discover how we can overcome the letter gobbling spell checking and the mispronouncing speech synthesis continues!