Automatic Arabic text categorization using Bayesian learning

Kadhim Mahmood H. (P59480)

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476387

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Nazlia Omar, Prof. Madya Dr.
dc.contributor.author	Kadhim Mahmood H. (P59480)
dc.date.accessioned	2023-10-06T09:17:34Z	-
dc.date.available	2023-10-06T09:17:34Z	-
dc.date.issued	2013-01-14
dc.identifier.other	ukmvital:84859
dc.identifier.uri	https://ptsldigital.ukm.my/jspui/handle/123456789/476387	-
dc.description	Automatic Text Categorization (ATC) is a task of categorising an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. Statistical learning is one of ML techniques that is based on providing a prospect that a given document belongs in every category. One of the common statistical learning techniques is Bayesian learning which is based on Bayesian theorem. At present, many researchers are interested in using Arabic ATC. In fact, most of the used method in this area is based on Bayesian learning algorithm. However, some of the Bayesian learning techniques are still under investigation. This effort deals with Arabic ATC problem based on probabilistic Bayesian learning. Bayesian learning classifiers that has been applied are Multivariate Guess Naive Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naive Bayes (MBNB), and Multinomial Naive Bayes (MNB). The proposed method covers three parts. The first part is the text Pre-processing which include Bag-of-Word (BOW), the second part is the text representation which include word level N-Gram; 1-Gram, 2-Gram and 3-Gram , and the third part is the feature selection technique which include Chi-Square Statistic, Odd Ratio, Mutual Information, and GSS Coefficient. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. The Arabic corpus is collected from online newspapers which consist of 3172 documents varying in length which fill into four predefined categories; Art, Economy, Politics, and Sport. 1732 documents are allocated for the training set, and 1440 documents for the test set. The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results, although BOW (1-gram) leads to the finest performance on the whole.,Master / Sarjana
dc.language.iso	eng
dc.publisher	UKM, Bangi
dc.relation	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat
dc.rights	UKM
dc.subject	Automatic Text Categorization
dc.subject	Bayesian learning
dc.subject	Text processing (Computer science)
dc.title	Automatic Arabic text categorization using Bayesian learning
dc.type	theses
dc.format.pages	76
dc.identifier.callno	QA76.9.T48K335 2013
dc.identifier.barcode	002014
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
ukmvital_84859+SOURCE1+SOURCE1.0.PDF Restricted Access		2.01 MB	Adobe PDF	View/Open

Show simple item record Recommend this item