Thursday, November 10, 2011

Stemming Algorithm for Nepali Language

Although many algorithms exist for stemming English language, there is no algorithm for stemming Nepali language yet. As a part of project work for my 7th semester subject Information Retrieval Systems, I have tried to build my own stemming algorithm for Nepali. Stemming is necessary for search engines. Since no previous work has been done for Nepali language, I hope my initiative will inspire new researchers and make further improvements.

For those who want to implement it in their own software, I have released a .NET library file. You can download library file from the link below (Not available yet). For those who are researching in this field can contact me for the report at jangedoo [at] hotmail [dot] com. I haven't yet completed the research report, it is still in draft so I haven't publicly released it.

Algorithm (Short Version)
This algorithm is based on finding the longest suffix and removing it. From the end of the word, the algorithm proceeds to the front and recording the characters if they represent a valid suffix in the database. Then the associated rules of the suffix is determined and applied to the word. After the rules are applied it is checked whether the word is in the dictionary or not. If it is then the result is returned else nothing is returned.

Accuracy and Generality
Accuracy of the algorithm is acceptable. But it cannot determine correctly, the root word and the suffix, if the suffix is from Sanskrit language. Because Sanskrit suffixes are also frequently used in Nepali we also need to properly identify those but the words change very drastically after using them, in most of the cases, and we cannot decode it with general algorithm that I have proposed.

Here are some of the words and the result of stemming.

ऐतिहासिक <इतिहास + इक>  काल्पनिक <कल्पन + इक>  आकासे <आकास + ए>  जुम्ली < + >  गुनिलो <गुन + इलो>  कामदार <काम + दार>  गाडीवाल <गाडी + वाल>  ठिमीले <ठिमी + ले>  दियालो <दियो + आलो>  तेलिया <तेल + इया>  खेलौना <खेल् + औना>  जलन <जल् + अन>  डुलुवा <डुल् + उवा>  जिनारु < + >  लुटाहा <लुट् + आहा>  सोधनी <सोध् + अनी>  भनाइ <भन् + आइ>  उडान <उड + आन>  छापा <छाप् + आ>  लेखा <लेख् + आ>  लेखौट <लेख् + औट>  मसोट < + >  टिकाउ <टिक् + आउ>  जोतारो <जोत् + आरो>  बोलावट <बोल् + आवट>  पुछाउनी <पुछ् + आउनी>  बोलक्कड <बोल् + अक्कड>  पियक्कड < + >  लुटार <लुट् + आर>  बनावट <बन् + आवट>  हँसिलो < + >  चालेको <चाल् + एको>  सरुवा <सर् + उवा>  लेखौवा <लेख् + औवा>  हेर्ने < + >  खाने < + >  पोलाहा <पोल् + आहा>  खेली <खा + एली>  देखे <देख + ए>  खेलेर <खेल् + एर>  खेल्दा < + >  कस्ता < + >  कस्तै < + >  हाँस्न < + >  गुन्न < + >

From the results we can see that the algorithm is working fine but for some simple words such as खेल्दा, हाँस्न it has failed. It is not due to the inability of the algorithm but due to the dictionary that we have available. The only dictionary for Nepali language (as far as I know) is the one made by people at Madan Puraskar Pustakalaya and it does not contain words like हाँस् ,गुन्. So even though the algorithm produces गुन् as root word for गुन्न, it does not return it because it is not in the dictionary. So a better and more comprehensive dictionary is needed.

Using the .NET Library
.NET 2.0 or above is required
Hunspell Library (Included)

Recent Posts