Updates on the progress on Arabic spell checking, TTS, Word Prediction and the ATKit

footstepsThe last few weeks since the Christmas break have flown by with a flurry of activity which is retrospect seems at times to have made us feel as if we have been going two steps forward only to have to go at least one if not more steps backward!  But there have been some breakthroughs in the areas of Spell checking, Text to Speech, Word Prediction and the ATKit website.

Spell Checking

Thanks to Mashael AlKadi we have a really clear evaluation of the spell checker titled Dyslexic Typing Errors in Arabic (PDF download) and also thank you to Mina Monta who commented that:

  • “Some of the words are correct in spell & in the meaning but AT spell checker detect that those are wrong words
  • In the suggested word list, there is no sorting according to the priority of the suggested word (according to the relativity between the suggested word & the original wrong word)
  • Some of the suggested words are wrong in spell
  • The number of the suggested words is to high comparing with MS Word spell checker.
  • MS Word is better in detecting the wrong words in grammar (the word has correct spell) “

Sadly research into English spell checkers has revealed that they are not as accurate as we had hoped when it comes to providing false errors and real words or homophones as can be seen from this presentation about online spell checking.

I asked Mashael whether adding a new corpus would help as Seb has succeeded in collecting a larger Arabic corpus and has put in some code to make it possible to add this extended vocabulary.   However, Mashael’s comment was:

“regarding adding new words, do you mean expanding the tool’s dictionary? I don’t think you should worry beacuse it was working very well expect for certain remarks that I’ve said such as the tool’s behavior with words attached to prepositions. In such case only some adjustments should be applied to the tool’s mechanism and I think it will work great.”

So with the support of Erik and Mina in our last meeting, it has been decided that we will work on particular improvements as a future aim with the help of our Arabic speaking colleagues.

Text to Speech

It has been a bit of a trial and error period starting with the withdrawal of Google Translate. We were aware this might happen, but had rather hoped there could be a reprieve as this was a free option, although in the tests carried out with 5 Arabic speaking students the results were poor in comparison to Acapela and Vocalizer voices. The sadness also on the part of the time spent on this work as it was something we had proved was possible to achieve – a free TTS on the toolbar.  Microsoft Speak Method was also tried and tested – but the TTS appeared to leave off initial sounds and the voice was unacceptable to our beta testers.

We also learnt that NVDA in Arabic was only going to work with the Arabic TTS offered by Microsoft and eSpeak and Festival with the Mbrola project was still an uphill struggle.

As a research project and definitely not for profit we also wondered if we could go back to Google Translate but the agreement  specifically says  “The program may be used only by registered researchers and their teams, and access may not be shared with others.”

Meanwhile Fadwa Mohamad kindly visited King Abdulaziz City for Science and Technology(KACST) over the Christmas period and met Professor Ibrahim A. Almosallam who has been in touch to say that they are developing an Arabic Text to Speech application, but it has yet to be released.  I am enquiring as to whether this is a desktop application or a VAAS system (Voice as a Service) such as that offered by Acapela in Arabic.

Seb then spent time working on the Acapela VAAS system and this was shown to work well in all the tests although there are issues when a whole page is read out.  It is felt that it might be more appropriate to restrict the call on the servers and just allow text to be highlighted and then spoken.  We now have to negotiate the way we can work with this system, as the final output needs to be free to the user.

There is also the option of building a new Arabic voice and this is being explored – although it would take time and effort to generate the corpus, normalise the output and beta test, even when there are engines available to achieve this aim….. A new build Arabic voice needs further discussion but we have the connections in place.

WordPrediction

wordprediction screen grabSeb has been able to show how this feature for the toolbar is possible in English and the background architecture is in place for the Arabic version pending the language pack.

ATKit website

ATkit siteIt has been agreed that the mock up of the ATKit website that was available as a demonstrator should be taken forward and developed.  This has been completed with the ability to add plugins both free and those that require payment (for instance where a TTS requires a fee). Users can register, build  their own toolbar and save the results.  The next step is a completed Arabic translation and the ability to author plugins …

Arabic ATKit

Testing for Arabic spelling errors

Once again thanks to the help of Areeb we have been discovering the issues around Arabic spell checking even in MSWord which has been our comparator for the toolbar spell checker.  Areeb constructed a Word document with Spelling Mistakes so that we could test it against the Microsoft Arabic spell checker, then with the present toolbar spell checker and finally with the new corpus when it is uploaded.

Areeb has already made several useful comments:
“There are several issues:

MS Word false positives: detected by Word as mistakes, but aren’t actually.

  • Some names rarely used, reasonable for MS Word to flag them
  • Words that should not be flagged like in the doc I sent بها was considered a mistake by Word 2007 although it is absolutely correct.

MS Word false negatives:

  • Mistakes undetected
  •  A mistake that would change a word into another correct word

This happens more often than in English I think, I realised that when I was trying to force mistakes sometimes I had to try several times to misspell the word, mimicking common spelling mistakes, and MS Word would still consider it correct, and it is correct but not in the context.

  • Words that should be flagged but aren’t like ذالك Should be ذلك

And this is a common error, I believe it  should be flagged.

 

Arabic ATkit 1st paragraph mistakes

Arabic ATkit 1st paragraph mistakes - select the picture to see an enlarged view.

It would be very helpful if other Arabic speakers could use the spell checker in MSWord to test the type of errors made using our Spelling Mistakes document and then connect to the ATbar2 site and delete the present text in the edit box, select the spell checker on the launched toolbar and then copy and paste in sections, to see if the same results occur. Please make comments on the blog – then we will update the present version of the ATbar to review any changes that occur as a result of the new corpus.

Thank you for your help in this project – best wishes over this holiday period and for the New Year.

Additions to the ATbar in Arabic spell checker

This blog is to really thank Mashael Alkadi in Saudi Arabia for discovering another corpus of spelling errors in Arabic.  Then thanks go to Areeb Alowisheq, here in the lab, for helping us to understand the differences between the list we had and the new lists.

Seb has been able to access files provided by the Galtawi project.  This has allowed us to experiment with improvements to the spell checker.

The original 71,000 words with errors appear to result in a large range of words based on the nearest possible correction, whereas the 120,000 words with suffixes and prefixes, that will be added, all have exact matches to corrections.  It is hoped this will improve error correction but we need to test this with a series of paragraphs.

The paragraphs will have around 100 common errors that will initially be tested against the Arabic Microsoft Word spell checker results, then against the present version of the ATbar spell checker and finally against the latest version of the spell checker to see if any improvements have been achieved.

Watch this space for the outcome!

ATBar – ATKit version needs testers!

We really would like to receive feedback for the English ATKit version of the ATBar as we would like to begin to make this version available to all.

github screenshotPlease put any issues you find on Github ATBar repository as Seb is trying to get all the bugs ironed out before launching.  Please also be aware that some websites do not work with the toolbar due to Flash and other features that cannot be accessed.

Arabic TTS discussions and success with ATKit beta

TTS logosAs we have all suspected the market for text to speech is now a choice between Nuance and Acapela with eSpeak and Festival offering a very limited choice of languages.  The licences for using options offered by the operating systems such as Microsoft and Apple do not allow us to use these for a browser based toolkit.

So we have been trialling the voice with Google translate but that only works for 1000 characters and is liable to disappear as a service.  We discussed the issue with a Google employee who was not very hopeful that we would be able to pursue this idea further although we would still like to keep this door open.

We also want to continue to see if we can discover any researchers still working on an open source free TTS for Arabic speech, but in the meantime we have been discussing the use of the Acapela Voice As A Service system that also works well as a plug-in for the new ATKit.

The web site for the English version of ATBar using the ATKit system of plug-ins  is ready for testing and final checks for the Arabic version will be set in place with plugins once agreements have occurred regarding the TTS, as all other sections are complete.  We are still looking for suitable dyslexic type errors to improve the present dictionary and have begun the research on both the word prediction and speech recognition.

Finally we have set up a ATKit plugin Google Group for further collaboration in the hope that this can become a truly open innovative process and a case study for the REALISE market place which has just received sponsorship from Devices for Dignity who are interested in seeing how case studies such as the ATKit develop in the future.

Comments please about proposed Plug-in site

The building of the plug-in website is underway in English and as with every page on the ATBar web site we would greatly appreciate corrections for the Arabic version as the pages are at present using the Google translate system with some help from our kind post graduates.

We would like to receive comments about the proposed design seen below for the main ATKit page and then for the plug-in page.

(select the images to enlarge them)

ATkit home page

ATkit home page

Plugin list page

Plugin list page

Plugin information page

Plugin information page

Plugin sets page

Plugin sets page

Word Prediction in Arabic.

The next challenge other than our search for an open source option for an Arabic TTS is going to be word prediction.  I felt it was time to start to define the requirements and how realistic we can be when it comes to working with a language that does not have an easy way to see breaks in words as was discovered when working on the spell checker.

Word completion is available from Nuance as T9 Write in Arabic and I have seen this working on an Android phone.  It is not totally successful, but does at least try to offer the correct word at times when a few characters are entered. It is also possible to use the swipe technique on the Arabic onscreen keyboard, as illustrated in the above YouTube video – Continuous T9 in Bada 1.2 arabic .

So we are looking for an Arabic corpus that will allow us to offer alternative words once typing has begun as well as the next option.  This plug-in will help those using the toolbar to increase typing speeds and possibly ameliorate any word finding difficulties or severe spelling problems.   Word prediction can be helpful for those who tend to type at less than 25 words per minute and can jog the memory when a few words are listed.

Texthelp word prediction

Texthelp Read and Write Gold being used with the rich text editor in WordPress

The word chosen can be added to free text via a single keyboard entry – usually a function key but number keys can also be used.  On the left word prediction has been illustrated using TextHelp Gold Read and Write, that support this type of text entry.  ClaroRead, Easy

Tutor, Soothsayer, Co:Writer, Penfriend and several other desktop applications offer similar functionality as can be seen on Emptech

The ATKit word prediction plug-in is for use in a browser when writing on the web – filling in forms, using Google docs or creating a blog or wiki   The toolbar plug-in is not designed to replicate an app for text editing when writing messages on a phone or iPad and it will not be able to offer all the options available with a desktop solution. .

One issue already beginning to cause concern is the possible removal of Google Tashkeel  - This is a very useful service supporting the diacritic symbols. We are watching to see what will happen with several other Google services  disappearing such as Google dictionary.  However,  the Google transliteration service is still available which at least allows us to practice typing words in one language and then with the selection of the space bar see them converted to Arabic.

 

Documentation and ATkit Plug-in Progress

ATKit plugins

Seb has recently been working on the documentation and the code behind the plugins for the ATKit making it possible to convert the ATBar into a modular system that allows users to choose which plug-ins they wish to have on the bar.

An example above shows how Readability has been added to list of plug-ins and the code is available on the ATKit wiki.

The spell checking issues appear to have been solved but testing is now at an important stage where we see if it works with sentences other than those we have in our test paragraph!

The free to users Arabic text to speech plug-in has been causing more concern as Acapela and Nuance still reign supreme and these voices can be licensed with the plug-in system,  but the gauntlet has been thrown down to see if we can explore other options!

Spell checking and the Arabic script

The Arabic script is cursive and we have been exploring difficulties with accurate online spell checking. Fadwa Mohamad has kindly shared her knowledge about some of the issues that arise for those with dyslexia when it comes to the way Arabic characters are linked. Arabic has 28 letters to represent 34 phonemes and we have already discussed the issues of vowels and diacritics. Now we have learnt there is the thorny problem that only 22 of the 28 letters have two way connectors. The 6 remaining letters can only be joined in one way – so an Arabic word can contain one of more spaces. This means a word using some of these 6 letters, that can only be joined up in one way, may be divided in several places.

The other problem of note is that capital letters are not used in Arabic, so once again it may not be easy to see or work out where word boundaries occur. This along with the odd spacing obviously causes concerns for some readers, but may also be one reason why a spell checker can appear to gobble letters when it tries to correct a word!

To add to these issues the articles ‘the’,'a’ or ‘an’ in English tend to be joined to the following word in Arabic -  so those who can read Arabic will recognise the letters ‘AL’ or “Arabic: الـ‎, also transliterated as ul- and in some cases il- and el- ” according to Wikipedia. The reader has to also work out whether the ‘AL’ will be silent or voiced in some cases which impacts on text to speech engines and the lack of spacing can affect spell checking.

Finally Arabic letters may be formed in different ways depending on their position in the word.  So a shape may change from its isolated form to one that is different when seen as the initial letter in the word or the medial one or even the final one! This is how arabic-course.com describe the issue.

Arabic letter changes depending on the position in a word


The work to discover how we can overcome the letter gobbling spell checking and the mispronouncing speech synthesis continues!

 

Insight into the issues for open source TTS in Arabic.

Over the summer the team have been investigating the issues around TTS in Arabic and Edrees Abdu Alkinani has completed his MSc report which has made interesting reading as it summarises many of the findings.   It was noted that Arabic TTS synthesis did not have the early successes of European languages due to the limitations in Natural Language Processing (NLP)  and the complexities of using diacritics as substitutes for vowel combinations. However, with the advances in Natural Language Processing (NLP) and Digital Signal Processing (DSP) plus automatic diacrtizers progress is being developed progress has been made in the commercial world where there are now several attractive Arabic synthesised voices as will be seen in an evaluation to follow.

Issue No 1 – Lack of diacritics on web pages.

Arabic diacritics

The Learning Resource - Arabic language

English speakers may wonder at the reasons for the difficulties with Arabic TTS, but it does not take more than a cursory glance at the written language to understand that having 14 different diacritic marks with 34 phonemes, 28 of which are consonants, and only six vowels that the combinations may cause TTS problems. As Eedris pointed out… ” كُتُبْ ” means books and ” كَتَبَ ” means wrote – the only difference you will notice is the type of marks used above the letters.

English vowel sounds

TEFL world wiki - English vowel sounds

This is compared to the English basic 12 vowel sounds with no accents or diacritics even though we may complain about our odd pronunciation of some written words – rough, cough, though, thorough and through – at least some of the letters are different and we cannot leave any out.   Yet this is what is happening with written Arabic on the web – the diacritics are being left out….. Number one problem for a text to speech engine.

Issue No 2 – The differences between the way the TTS is developed and the resulting output.

Research has shown that although there are now a few text to speech engines they are commercial and even these vary in quality.  The MBROLA project links to work carried out in the open source world, but at present it has been impossible to achieve success with the code offered in the various repositories for evaluation purposes.    However, Eedris has supplied the team with these comments based on the demonstrators offered by the various organisations and companies.

  1. MBROLA project
    MBROLA has two Arabic voices as a recorded audio file. The speed of speech is slow, and the quality poor. Moreover, the pronunciation is hard to understand – even for a an Arabic speaker.  The stress pattern is often incorrect and the distinction between words unclear. The most difficult words to understand have letters like, “ أ” ‘A’, “ ض” ‘th’, “ ل” ‘L’.
  2. Acapela Group
    Acapela offers two good quality male and female voices.  The pronunciation for words with and without diacritic marks is understandable, with accurate stress patterns. There are three letters which appear to cause some difficulty  “ ج” ‘j’, “ ا’ ‘a’, “ ك” ‘k’. The pronunciation of numbers in all situations is good.
  3. Nuance Vocalizer
    Nuance provide a very clear male voice with clear pronunciation. The only problem is that the system produces speech without taking into account diacritics. Words which have letters like “ ق” ‘q’, “ ش” ‘sh’, and “ ض” ‘th’ may cause problems but the speed of speech used in the online demo is good. Numbers are not clearly enunciated due to the lack of diacritics.
  4. Loquendo
    Loquendo offer a recording of a male and female voice on their site as the Arabic voice has only be available since October 2010. The system has good sound quality clear speech. The example on the website has diacritic marks but as it is a small sample it is hard to judge the overall quality but it appears to be good.


Issue No 3 – Further Development of eSpeak with Arabic.

The current version of MBROLA does not appear to run with the arabic voice files and there seem to be very few people who have had success.  So this is work in progress…