Please use this identifier to cite or link to this item:
https://ptsldigital.ukm.my/jspui/handle/123456789/772505
Title: | Enhanced Arabic stemmer through a combination of root - based and light-based approaches |
Authors: | Alshalabi,Hamood Hazeae Ali (P86012) |
Supervisor: | Sabrina Tuin, Dr. Nazlia Omar, Assoc. Prof. Dr. |
Keywords: | Universiti Kebangsaan Malaysia -- Dissertations Dissertations, Academic -- Malaysia Natural Language Processing |
Issue Date: | 26-Oct-2021 |
Abstract: | The rise of Natural Language Processing (NLP) opens new possibilities for various applications that were not applicable before. Arabic morphological rich language introduces features, such as roots extraction, that will assist Arabic NLP's progress, especially on Arabic stemmers. The Arabic stemming can be categorized into four major techniques: Root-based approach, light stemming-based approach, statistical and hybrid approach, and artificial intelligence approach. This study focuses on improving two best types of stemming; the first type is the light stemming which is used to remove affixes (prefixes and suffixes). The second type is the root-based stemming, which is used to extract the root of the words depend on a list of patterns. If a matched pattern is found, the letters in the pattern representing the root. However, both of these stemmers still have many weaknesses, such as, unable to handle Arabised words, removing suffixes and prefixes lead to more ambiguity, and Light-based stemming that is not ability to get correct roots for lengthy Arabic words. Moreover, in Root-based stemming, there is no currently available standard morphological rules to determine the correct pattern rule for all words with the same length to extract root word; plus, the unavailability for an algorithm to extract root of two-letter words. Additionally, a broken plural poses a challenge in Arabic stemming. This is due to the irregular pattern of the standard Arabic plural words and causes difficulty in extracting root words. This study has four main contributions to improve Arabic stemmers. The first contribution is to propose an algorithm to detect and extract Arabized words as a preprocessing task for Arabic stemming. This algorithm is a combination of lexicon-based and rule-based approaches. The lexicon list was developed based on different sources of Arabic text sources. The rule-based algorithm was designed to identify Arabized words by the specific article and use of pattern matching on prefixes and suffixes. The second contribution is to improve a light-based algorithm by developing an appropriate list of suffixes and prefixes and stemming rules according to the length of words, the new algorithm called Dlight. The third contribution many new rules have been developed, (main or sub), according to the length of the patterns (DRule). Besides, the contribution also helps to correct some misconceptions in previous studies between verb and root due to the diacritics in Arabic; via the combination of Dlight and DRule, the Dlight is able to solve problems with verbs, and DRule stem efficiently on nouns. For the fourth contribution, several rules for extracting the roots of broken, irregular plural words are constructed to improve the Arabic stemmers even further. For evaluation, five benchmarks of Arabic stemmer have been chosen: (i) Larkey Light10 stemmer (Light10) (ii) Condlight stemmer (Condlight), (iii) Arlstem stemmer (Arlstem), (iv) Arlstem V1.1 stemmer, and (v) ISRI stemmer, and tested on three big datasets; Al- Khaleej-2004, Al-Watan-2004 and TREC2002 corpus. Based on the evaluation, the proposed Arabized words pre-processing improved the benchmarks Arabic stemmers' (Light10, Condlight, and ARLS1) performances by an increase precision of 1%. While for all-other experiments using standard TREC2002 corpus, the experimental results showed the proposed rule-based Arabic stemmer (COMBINED+BPR) obtained the best performance with 85% of F-measure. To conclude, by having an appropriate list of suffixes, prefixes and new rules on both light-based and root-based approach with steps regulation to stem Arabic word, the performance of Arabic stemmer can be enhanced. |
Description: | Full-text |
Pages: | 232 |
Publisher: | UKM, Bangi |
Appears in Collections: | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
ENHANCED ARABIC STEMMER THROUGH A COMBINATION OF ROOT BASED AND LIGHT-BASED APPROAC.pdf Restricted Access | 3.51 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.