Designing a Rule Based Disambiguator for Afan Oromo Words

This paper presents designing a rule based Afan Oromo Disambiguator. The ultimate aim of this work is to develop a model that identifies the senses of the words. Hence; a word may have multiple senses, the problem is to find out which particular sense is appropriate in a given context. To this end, a rule based approach was used which is designed manually a set of rules. Some ambiguous words were collected from the Oromo society and these words are the most frequently used in the society. Due to under the resource of the language, the work was used 15 natural ambiguous words for the sake of the test. The results of the work were shown that in Afan Oromo language, an ambiguous word have 2 to the n senses (where n unlimited senses; as the number of contexts increased).


Introduction
The disambiguation is the most challenging at all levels of the natural languages. The ultimate aim of this work was to develop a set of rules that identified the senses of the words. However; the most common way of representing language is via of rules [1]. The rule underlies many linguistic theories of the language, which turn into a set of rules [2]. The modifiers and contextual information were the basis of the linguistic properties of Afan Oromo word sense. In this work, the modifier of the ambiguous word is used in order to get the semantic clues of the particular sense in which the ambiguous word is used. To achieve this we analysed the structure of Afan Oromo sentence formation with respect to modifying patterns to develop the rule. The constructed rules used for extracting all modifiers modifying the ambiguous term. The modifiers are words or phrase which provides information about a word and also gives more description about the words it modify. The modifiers (can be a single word or phrase) established for understanding of the ambiguous words.
The motivation behind this work is to allow the users to make ample use of the available technologies in Afan Oromo because ambiguities present in any language provide great difficulty in the use of information technology as words in human language that occur in a particular context can be interpreted in more than one way depending on the different contexts. However, we faced a significant challenge as Afan Oromo has a lack of the resources. So, this work presents a rule based approach came up with an alternate solution to the challenges by obtaining necessary information from the developed set of rules.
The contribution of this paper was towards developing natural language processing applications for Ethiopian languages exhibiting similar patterns with Afan Oromo. Specifically, it increases the scope of the word sense disambiguation research by investigating its applicability for Afan Oromo language. Furthermore, it has been pointed out how Natural Language Processing plays a significant role in enhancing the computer's capability to process word senses. Additionally, IR is also one of the Natural Language Processing (NLP) applications that paybacks from word sense disambiguation which most of the words used to execute queries in IR systems have more than one meaning [3].

Overview of Afan Oromo Language
Afan Oromo is part of the Lowland East Cushitic group within the Cushitic family of the Afro-Asiatic family. Afan Oromo has a very rich morphology like other African and Ethiopian languages [4]. The writing system of Qubee (Latin-based alphabet) has been started since 1842 [4]. It is one of the major African languages that is widely spoken and used in most parts of Ethiopia and some parts of other neighbour countries like Kenya, Tanzania, Djibouti, Sudan and Somalia. It is the second largest Cushitic language in Africa content next to Hausa. Currently, it is an official language of the Oromia state (which is the largest Regional State among the current Federal States in Ethiopia). It is used by Oromo people, who are the largest ethnic group in Ethiopia, which amounts to 50% of the total population [5]. With regard to the writing system, Qubee (a Latin-based alphabet) has been adopted and become the official script of Afan Oromo since 1991 in Ethiopia.
In Afan Oromo, the sense of words fundamentally based on the words preceded by the word (modifies) [6,7]. Hence, the words are described (modified) by the Noun or Verb preceding them. The ambiguous word may appear at the beginning or in the middle or at the end of a sentence, but modifiers always becomes before the word it modifies. As an example, "Seenaan kaleessa daara bahe". [Seenaa got cloth yesterday]. From the construction of this sentence, the word "bahe" is modified by the Noun "daara". According to Afan Oromo structure, the Noun and Verb always appear before the word they modify [8].

Related Works
The rule based approach was important when there is a lack of training data and under resources. The rule-based approach has successfully been used in developing many NLP researches. The research that use rule based approaches are based on a core of solid linguistic knowledge. The rule based approach is for less-resourced languages and for morphologically rich languages like Afan Oromo, which even with the availability of corpora suffer from data sparseness [1].
Rule based approach exploit the hand crafter rule for word sense disambiguation task. The rule based requires extensive work of expert linguists and thus can result in near human accuracy [9]. The Afan Oromo rule based Afan Oromo Grammar Checker, showed a promising result. The results show that rule based is an approach used in the morphologically rich language like Afan Oromo. This rule based approach for languages, such as Afan Oromo, advanced tools has been lacking and are still in the early stages. However, it needs an expert (linguistic knowledge) to develop a set of rules that designing a disambiguator rule in this case. The advantage of this approach is that, it is easy to incorporate domain knowledge into the linguistic knowledge which provides highly accurate results. Furthermore, the linguistic knowledge acquired for one natural language processing system may be reused to build the knowledge required for a similar task in another system [10].
According to Ide et al. [8], the important characteristics of an ambiguous word are grammatical information about the word to be disambiguated, words that are syntactically related, and words that are topically related to the ambiguous word. Since the proposed method relies on the semantic and syntactical information to disambiguate an ambiguous word, for each entry of the sense, it consists of the ambiguous and related words, sense of the ambiguous and related words. Moreover, this information can be converted into understandable rules that best describe the relationship between, the ambiguous word and the related word.
However, a rule based is the most method used in natural language of disambiguation. When there is a lack of available resources and their limitations, the rule based approach was used, which rely on hand-constructed linguistic rule and resources. The use of rule based transformations is based on a core of solid linguistic knowledge [11]. Obviously, the manual creation of rule is an expensive and time consuming effort, which must be repeated every time the disambiguation scenario changes. Knowledge of the linguistic used for word sense disambiguation is either lexical knowledge released to the public or world knowledge learned from a training corpus [12].

Afan Oromo Word Senses
The limited availability of resources in the form of digital corpora and annotated, the rule based method is applied. The linguistic knowledge of the language plays an important role to create the rule. The linguistic knowledge required for the natural language can be obtained in different ways. In this work, the rules were created based on the inherent structure of Afan Oromo in forming sense of the words. However, an effort has been to develop the rule of the language as it discussed above in details.

Proposed rules
In Afan Oromo like other languages, the word sense has its own rule which is manually developed (in our case developed by the researchers of this work). However, the correct sense cannot be only found by choosing the one that is related to another. Promising techniques relied on linguistic knowledge also for extracting semantic features, in our case to mine context of the ambiguous term view of modifiers specializing its sense [13].
The modifiers have a great role to decide on the word sense according to its role in the sentence. The modifiers can appear before the target word (the word, it modifies or describe). Like English, the sentences would be pretty boring without modifiers to provide excitement and intrigue. A modifier adds detail, limits or changes the sense of the other word or phrase.
In Afan Oromo, the words preceding a specific word are more likely to influence the sense of a word.
For example, [Bilisumman sammuu dhahe lafa buuse]. Bilisuma has hits head and make falls on the land.
In this example, the word "[sammuu, lafa]" are modifiers; which give extra information that is part of the sentence. In this case, it is a Noun modifier, because they are modifying the ambiguous word "dhahe". A modifier should be placed next to the word it describes in Afan Oromo language. Even, the role of modifiers was different, hence, the preceding modifiers carried out more than the following modifiers.

Modifiers of generate a set of rules
Word sense disambiguation was disambiguated based on the surrounding contexts. Additionally, disambiguation is done by analyzing the linguistic features of the word and its preceding word. The rule-based method of our approach disambiguates word automatically using rules in order to complement the features learned from training data. This information is coded in the form of rules. As it discussed in the above sections, the modifiers always precede the target word in Afan Oromo. Based on this notion, the rule was developed by the researcher as follows: Ambiguous Word, Noun has been preceded by Verb modifiers; Ambiguous Word, Verb has been preceded by Noun modifiers; Ambiguous Word has immediately preceded or followed by the modifiers (Figure 1). N and V are Noun and Verb modifiers respectively Modifiers are none stop words, numbers and punctuations.

Result
This section was present the result and discussion of the work. The conducted experiment shows that, the semantic has come to the conclusion that the senses of words are closely connected to the words which are modifiers. As shown, the result obtained by rule based approach was great as the semantic information extracted from the distinct set of rules. The most likely reason for this is that our approach relies on automatically assigned immediately preceding words, more reliable and proves to be a most useful linguistic knowledge for word sense disambiguation.
Another issue examined here is the different behaviours of disambiguates for words of different part of speech (verbs and nouns). Out of the 3,656 examples in the complete dataset, 1020 are verb-cases, and 2,546 are noun-cases.
The remaining 90 examples correspond to adjectives and adverbs. The 2,546 nouns-cases represent 538 occurrences of 231 different nouns. Therefore, the linguistic knowledge the best approach to solve word sense disambiguation in Afan Oromo [14,15] as shown in the experiment (Table 1). However, the overall system performance gained thus far is surprising since the developed rules were fresh and still needs to be experts on of the language.
From the finding, the addition of deep linguistic knowledge to a word sense disambiguation system is a significant rise in disambiguation accuracy with the results discussed so far [16,17]. It is especially interesting that using the preceding modifiers of the ambiguous word perform better result. We can conclude that modifiers contain a lot of valuable clues for disambiguation [18,19]. On the other hand, the set rules were evaluated comparing the result produced by the manually grouped similar contexts of the words in the test set by experts. The evaluation constitutes the following two points: In order to achieve this we used the following criteria: How much of the word sense is correct, i.e. to evaluate if all the similar contexts of the ambiguous words are placed in the same group.
Given the number of senses assumed by the words in the test, judge the rule on the basis of the number of senses identified by the rule.

Conclusion
In this work, the rule based approach has improved performance of the disambiguation system. However, it is expensive to develop for local languages which are lack of training corpus. For under resourced Ethiopian language like Afan Oromo the rule approach is recommended. Hence, there is no annotated corpus; rule approach plays a great role to disambiguate. The rule approach relies on hand-constructed rules that are acquired from language specialists rather than automatically trained from data by machines. All the rules described in this work can be a base for this further research and it can support extended disambiguation rules covering most of the terms in the Afan Oromo.