A partitioning-based algorithms for Arabic Islamic document clustering

Majid Hameed Ahmed

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476414

Title:	A partitioning-based algorithms for Arabic Islamic document clustering
Authors:	Majid Hameed Ahmed
Supervisor:	Sabrina Tiun, Dr.
Keywords:	Text processing (Computer science).
Issue Date:	10-Oct-2013
Description:	Document clustering is an unsupervised learning task, and it is a form of data analysis, aims to group a set of objects into subsets or clusters. The goal of clustering is to create clusters by grouping similar data items together. In other words, objects in the same cluster should be as similar as possible; whereas, objects in one cluster should be as dissimilar as possible from objects in the other clusters. In this thesis, the target domain of clustered documents is Islamic religious domain. The Islamic document clustering is considered as an important task for gaining more effective results, with the traditional information retrieval (IR) systems, organizing web text and text mining. Fast and high-quality document clustering can tremendously facilitate the user to successfully navigate, particularly on the Internet since the number of available online documents is increasing rapidly, everyday. Thus, religious domain has become an interesting and challenging area for Natural Language Processing (NLP). Islamic document in this research is written in Arabic language, which is one of the most complex languages in both spoken and written forms. Arabic is also the base language where some other languages are derived from. Despite the wide usage of the Arabic language, there is a lack of technology for clustering Arabic documents due to the complexity of the written structure of the language. The aim of this thesis is to compare the efficiency and accuracy of Arabic Islamic document clustering base on two algorithms: K-means algorithm and Graph partitioning algorithm, with three similarity/distance measures; Cosine, Jaccard similarity and Euclidean distance. In order to implement the algorithms, we have to pre-process the data (document). The pre-processing step consists of; (i) tokenization, (ii) normalization, and (iii) stop word removal. The pre-processing steps are necessary in order to eliminate noise and keep only useful information so that we can boost the performance of documents clustering. Additionally, this research investigates the effect of using stemming and without stemming words on the accuracy of Arabic Islamic text clustering. Based on our experiments we have found that the Graph partitioning algorithm is better than K-means, and the stemming method gives better result than without stemming. And also we found the result with Cosine similarity is better than Jaccard similarity and Euclidean distance.,Master / Sarjana
Pages:	88
Call Number:	QA76.9.T48A375 2013
Publisher:	UKM, Bangi
URI:	https://ptsldigital.ukm.my/jspui/handle/123456789/476414
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
ukmvital_85090+SOURCE1+SOURCE1.0.PDF Restricted Access		2.04 MB	Adobe PDF	View/Open

Show full item record Recommend this item