Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476552
Title: Arabic text classification based on latent semantic indexing (LSI)
Authors: Aneesa Ali Ali Saeed (P56105)
Supervisor: Mohd. Juzaiddin Ab Aziz, Prof. Madya Dr.
Keywords: Automatic Text Classification (ATC)
Machine Learning (ML)
Arabic
Universiti Kebangsaan Malaysia -- Dissertations
Issue Date: 7-Nov-2012
Description: Automatic Text Classification (ATC) is a technique for grouping text into predefined categories based on their content. Recently, ATC has become a potential field for many researchers. Many Machine Learning (ML) techniques have been proposed for classifying English and European text that have identical structures. However, only few researches have been investigated on Arabic Text Classification; due to its complex morphology and extensive analysis. Building an effective Arabic text classifier for high dimensional dataset is a challenge; therefore the problem of this research is to overcome the high dimensionality and long training time complexity problems. This research enhances the Arabic ATC by using LSI to extract the semantic associations between words. Global LSI (unsupervised) degrades the classifier performance when it is applied on the whole training dataset. In this study we propose Local LSI (supervised) to improve the LSI performance by utilizing the class information effectively. This Local LSI has been improved by extracting LSI space for each class with only the relevance terms are chosen by using CHI, to construct the document-term matrix. In this research we also propose RS Theory based on Global LSI for Arabic ATC. Experiments have been conducted in order to evaluate the performance of BPNN, SVM, NB with proposed Local LSI, CHI, MI and Global LSI. The dataset used in this study was collected from the internet and it consists of 3176 documents; 1433 for testing and 1743 for training. From the experiments, the obtained results of using CHI , MI , Global LSI and Local LSI were as follow: with SVM are 88.79% ,88.90% , 64.58% and 92.95% respectively, with NN are 88.31% ,86.89% ,65.13% and 92.51% respectively and with NB are 69.92% ,69.89% ,42.52% and 53.34%respectively. Comparison studies show that the Local LSI outperforms the others methods and it is ineffective with NB. Besides, the Global LSI also degrades the performance of the classifiers. The performance of proposed RS Theory based on Global LSI is 72.61%. The performance of RS Theory gives better results compared with other classifiers. Furthermore, the Local LSI can be improved to increase the classifiers performance with multi class large datasets and RS Theory can be further enhanced by increasing the dimensional space size.,Certification of Master's/Doctoral Thesis" is not available
Pages: 112
Call Number: QA278.6.S236 2012 3 tesis
Publisher: UKM, Bangi
Appears in Collections:Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:
File Description SizeFormat 
ukmvital_119376+SOURCE1+SOURCE1.0.PDF
  Restricted Access
820.21 kBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.