Please use this identifier to cite or link to this item:
https://ptsldigital.ukm.my/jspui/handle/123456789/476387
Title: | Automatic Arabic text categorization using Bayesian learning |
Authors: | Kadhim Mahmood H. (P59480) |
Supervisor: | Nazlia Omar, Prof. Madya Dr. |
Keywords: | Automatic Text Categorization Bayesian learning Text processing (Computer science) |
Issue Date: | 14-Jan-2013 |
Description: | Automatic Text Categorization (ATC) is a task of categorising an electronic document to a predefined category automatically based on its content. There are many supervised Machine Learning (ML) techniques that has been used to solve Text Categorization (TC) problem. Statistical learning is one of ML techniques that is based on providing a prospect that a given document belongs in every category. One of the common statistical learning techniques is Bayesian learning which is based on Bayesian theorem. At present, many researchers are interested in using Arabic ATC. In fact, most of the used method in this area is based on Bayesian learning algorithm. However, some of the Bayesian learning techniques are still under investigation. This effort deals with Arabic ATC problem based on probabilistic Bayesian learning. Bayesian learning classifiers that has been applied are Multivariate Guess Naive Bayes (MGNB), Flexible Bayes (FB), Multivariate Bernoulli Naive Bayes (MBNB), and Multinomial Naive Bayes (MNB). The proposed method covers three parts. The first part is the text Pre-processing which include Bag-of-Word (BOW), the second part is the text representation which include word level N-Gram; 1-Gram, 2-Gram and 3-Gram , and the third part is the feature selection technique which include Chi-Square Statistic, Odd Ratio, Mutual Information, and GSS Coefficient. For Arabic stemming, a simple stemmer called TREC-2002 Light Stemmer is used in the prototype. The Arabic corpus is collected from online newspapers which consist of 3172 documents varying in length which fill into four predefined categories; Art, Economy, Politics, and Sport. 1732 documents are allocated for the training set, and 1440 documents for the test set. The results showed that FB outperforms MNB, MBNB, and MGNB. The experimental results of this work proved that using word level n-gram for ATC based on Bayesian learning leads to acceptable results, although BOW (1-gram) leads to the finest performance on the whole.,Master / Sarjana |
Pages: | 76 |
Call Number: | QA76.9.T48K335 2013 |
Publisher: | UKM, Bangi |
Appears in Collections: | Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat |
Files in This Item:
File | Description | Size | Format | |
---|---|---|---|---|
ukmvital_84859+SOURCE1+SOURCE1.0.PDF Restricted Access | 2.01 MB | Adobe PDF | View/Open |
Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.