Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/513208
Title: Improved feature selection and genetic algorithm for Arabic text categorization
Authors: Ghareb Abdullah Saeed Ali (P50012)
Supervisor: Abdul Razak Hamdan, Prof. Dr.
Issue Date: 23-Feb-2015
Description: Arabic text categorization is a data mining technique that attempts to assign Arabic text documents to categories based on their content. The purpose is to produce accurate categorization of Arabic text with high-quality feature subsets. Arabic language has a rich morphology and complex orthography that requires a specific approach to handle it. Feature selection (FS) is an essential phase in text categorization that involves text dimensionality reduction, noise removal, text simplification and accuracy improvement. Recently, many FS and categorization techniques have been investigated for Arabic text. However, none of these techniques can ensure optimal FS and categorization. Therefore three problems are addressed in this research: the lack of important words identification, which affects the text representation accuracy; the high dimensionality of Arabic text, which reduces categorization precision without efficient FS approaches; and the lack of useful categorization rules, which affects categorization performance. The aim of this thesis is to improve FS and Genetic Algorithm (GA) for Arabic text categorization. Three main phases are involved: a text representation phase; FS phase; and text categorization phase. In the first phase, an integration of Arabic noun extraction rules with FS methods is proposed to overcome the problem of lack of important words identification. The proposed approach is evaluated using Associative Classification (AC) and Naïve Bayes (NB) classifiers to measure the accuracy of categorization. A set of experiments is conducted on collection of Arabic text documents. The result shows that the integration of noun extraction rules with FS is an approach that is efficient in reducing text dimensionality and performs better than FS alone. In the second phase, some enhanced filter FS methods and hybrid FS based on Enhanced GA (EGA) are proposed to handle the high dimensionality of Arabic text. First, an enhanced filter FS method named the Category Relevant Feature Measure is developed with two modifications using two available measures, the Class Discriminating Measure and Odd Ratio. The performance of these enhanced filter FS methods on three Arabic text datasets is evaluated using AC and NB. The experiments show that the methods are able to achieve better or comparable results to those of 12 state-of-the-art filtering methods. The hybrid FS approaches based on EGA are then introduced with seven FS methods. These hybrid approaches show better performance compared to the proposed filtering methods and two variations of GA (i.e. GA and EGA) in terms of categorization precision, dimensionality reduction rate and speed. In the third and final phase, a categorization method is proposed to overcome the lack of useful categorization rules that affects accuracy. A GA rule-based classifier (named the GARC) is developed to discover Arabic text categorization rules. The proposed classifier achieves competitive results compared to some previous categorization techniques. Overall, the proposed methods in this research show their strength in terms of preserving important content and useful knowledge in text datasets and they contribute to simplifying and improving the Arabic text categorization process.,Ph.D
Pages: 233
Call Number: QA76.9.T48 G465 2015 3
Publisher: UKM, Bangi
Appears in Collections:Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:
File Description SizeFormat 
ukmvital_82108+SOURCE1+SOURCE1.0.PDF
  Restricted Access
3.48 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.