Working on the linguistic aspects of the Arabic Symbol Dictionary

quadieDr Ouadie Sabia has joined the team as a consultant specialist in linguistics and has provided us with essential support regarding the accuracy of Modern Standard Arabic lexical entries that are being added to our database. Initially he queried the way we were categorising the lexical entries as they needed to be used for both spoken communication and literacy which, when one is coping with a diglossic language does not necessarily work. There is an insightful blog on the subject written by Michael Erdman titled “Is Arabic really a single language?”

tree with root word

An introduction to the root and pattern system in Arabic from Arabic Learning

However Dr Sabia persevered with his support for our work and commented “This is a common problem with languages such as Arabic where words are derived from one root and might appear without the correct diacritics or even non-existent diacritics.  It can be hard to determine their grammatical category. “ذهب” could mean “to go” or “he went” (verb) but could also mean “gold” (noun). Because the diacritics are missing, the grammatical category is unidentifiable. However in many cases the context plays a crucial role in categorising words in Arabic. This has been proven when developing an Arabic TTS corpora. I have added the appropriate diacritics to make over a thousand Arabic sentences, readable, understandable and grammatically accurate. I also monitored the recording carried out by a talent to make sure that all the diacritics were correctly used in order to preserve the grammatical accuracy. A word function can be altered if the diacritics are incorrectly placed.  Another issues is that by changing just one diacritic we can go from a subject function in a sentence to an object function, without even changing the word order in a sentence.”

Another issue that has had to dealt with over the last few months is the inaccuracies that develop when working with English verbs that tend to be presented in the present tense and those needed in Arabic that are essentially always given as part tense.  Much discussion has resulted in the latter winning the day with a recognition that if ARASAAC symbols for verbs come with a label including ‘to’ such as ‘to go’  the ‘to’  will be removed to that the verb can be declined in any tense and with or without a pronoun.  All the verbs have now been checked by Dr Sabia and sentences added to further explain the meaning.

Arabic verb analyser

As Dr Sabia explains, “Having spent a reasonable time studying the lists, I have reached a clear idea about the type of tense we should be using to translate the Arabic past tense 3rd person singular masculine as the “infinitive” to + verb” in English. Arabic verbs have the form: “he + past tense” (merged) and this has to appear in the dictionary.  The second point is that the symbol user who wishes to gain literacy skills will only have to learn the declined forms. In other words, if we take the verb ذَهَبَ (he went) as an example, it will be used to teach the action of “going” in the past as a single male, then later, in order to teach the same action of “going” (male single) in the present tense, a newly declined form يَذْهَبْ would be used. Infinitive does not exist in Arabic grammar.  As a result, a translation of a verb such as ذَهَبَ has become “go”.  Verbs like “have” in English are prepositional groups in Arabic. However, for communication purposes, the team has decided to call them verbs too but this needs further discussion.

Further work has included the correction of all the AAC lists collected by the team so that they could be uploaded to the symbol management system along with 500 words that are now considered to be the most useful words for the AAC users and have become the core of the Arabic Symbol Dictionary.  The analysis of the frequency of use from a grammatical point of view, it has become clear that the lists have presented wide variations in terms of the Parts of Speech being used. Most top 100 core entries from Kelly, Beukelman, Buckwalter, Oweini-Hazoury have a very low frequency of nouns / verbs compared to Supreme Education Council list taken from reading books. A more detailed description of the findings is available in a paper presented at the 6th Workshop on Speech and Language Processing for Assistive Technologies that will be provided once the publication is available. There were also found to be distinct differences between the types of words found in English AAC user lists compared to the Arabic AAC user lists with more nouns in the latter and it is worth remembering the comments related to the use of a verb which is combined with a pronoun in Arabic.

Another task has been related to the importance of generating correctly spoken words when the Text to Speech part of the project is included in the dictionary.  This is where the diacritisation is so important for correct pronunciation of the Arabic words and much time has been spent on making sure over 1129 entries are correct.   Dr Sabia has also added all the missing SUKUUN and SHADDA to the definite articles to allow for correct reading of Moon / Sun letters.

Sun letters
t th d dh r z s sh l n
Moon letters ء ه
ʼ b j kh ʻ gh f q k m w y h

As communication boards using Tawasol symbols with Arabic entries have been developed Dr Sabia has been checking their accuracy as part of the ongoing evaluation process and these are being taken out into clinics for trials.  ARASAAC symbols are also being used where the image is acceptable and the English is translated.

to beWork is also being undertaken to decide which words need to become symbols but are represented as the actual word as well as abstract images.  Examples include linking words such as  “and”, “to”, “until” along with the need to make decisions around verbs such as “is”, “are”, “were” which have no equivalent in Arabic because the verb “to be” does not exist. Although, the symbol manager has to have this rather important verb in English!

Finally a monumental piece of work was completed by Dr Sabia – the manual inclusion of the Buckwalter five thousand Arabic words with their 5000 English equivalents as an addition to our collection of lists to give us an idea where the differences in parts of speech may be occurring. This list has become an invaluable aid to our work as it is the only list published as being the  “Frequency Dictionary of Arabic: Core Vocabulary for Learners (Routledge Frequency Dictionaries) by Tim Buckwalter, Dilworth Parkinson”

All this work lays the foundation for the Tawasol website that will be launched in the coming months and once again Dr Sabia has helped us by translating the content into Arabic.