Automatic Malay text categorization of Gini index and Chi square using classifiers combination

Hamoud Hza'a Ali Abdulrahman

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476216

Title:	Automatic Malay text categorization of Gini index and Chi square using classifiers combination
Authors:	Hamoud Hza'a Ali Abdulrahman
Supervisor:	Sabrina Tuin, Dr.
Keywords:	Text categorization Universiti Kebangsaan Malaysia -- Dissertations Dissertations, Academic -- Malaysia
Issue Date:	21-Apr-2014
Description:	With the exponential growth in the availability of online information and continuously increasing documents in digital form, there is a need to classify the documents so that we can access the information sources. Many machine learning algorithms have been applied to text categorization task which is considered as one of the many information management tasks. There are mainly two classification approaches to enhance the organizational task of the digital documents: (i) A supervised approach is generally utilized where pre-defined categories are set up and a document is assigned based on the content of the categories, and (ii) unsupervised approach is applied where there is no need for human intervention or categorized documents at any point in the whole process. Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the research work in this field is carried out for the English text. On the other hand, very few researches have been carried out for the Malay text. In our thesis, we have investigated the use of three machine learning methods on Malay text categorization. Two feature selection methods, Gini index and Chi-square were applied to reduce the dimension of feature spaces and three machine learning methods, K-Nearest Neighbour (k-NN), Naive Bayes (NB) and N-gram were investigated with features ranked in decreasing order and feature space at different size: 100, 200, 300, 400, 500 and 600. After that, we have used two methods to combination strategies which determine the accuracy of the combined classifiers by choosing the best answer given a set of three answers. The three supervised machine learning models were evaluated on categorized Malay corpus, and experimental results showed that the NB classifier with the Gini index feature selection at 300 feature space gave the best performance (Macro-F1 = 94.66 ), k-NN classifier with the Gini index feature selection at 400 feature space gave the best performance (Macro-F1 = 90.94), whereas N-gram classifier achieved less results, where the best performance (Macro-F1 = 79.01 ) at given at 300 feature space with both, Gini index and Chi-square used as feature selection. Two types of Classifier Combination Voting Combination and Combination Stacking with the two feature selection methods, Chi-square and Gini index, achieves the best performance in term of macro-F1 in Malay text categorization the experimental results showed that the Voting Combination method gave the better performance result than Combination Stacking.,Master of Information Technology,With the exponential growth in the availability of online information and continuously increasing documents in digital form, there is a need to classify the documents so that we can access the information sources. Many machine learning algorithms have been applied to text categorization task which is considered as one of the many information management tasks. There are mainly two classification approaches to enhance the organizational task of the digital documents: (i) A supervised approach is generally utilized where pre-defined categories are set up and a document is assigned based on the content of the categories, and (ii) unsupervised approach is applied where there is no need for human intervention or categorized documents at any point in the whole process. Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the research work in this field is carried out for the English text. On the other hand, very few researches have been carried out for the Malay text. In our thesis, we have investigated the use of three machine learning methods on Malay text categorization. Two feature selection methods, Gini index and Chi-square were applied to reduce the dimension of feature spaces and three machine learning methods, K-Nearest Neighbour (k-NN), Naive Bayes (NB) and N-gram were investigated with features ranked in decreasing order and feature space at different size: 100, 200, 300, 400, 500 and 600. After that, we have used two methods to combination strategies which determine the accuracy of the combined classifiers by choosing the best answer given a set of three answers. The three supervised machine learning models were evaluated on categorized Malay corpus, and experimental results showed that the NB classifier with the Gini index feature selection at 300 feature space gave the best performance (Macro-F1 = 94.66 ), k-NN classifier with the Gini index feature selection at 400 feature space gave the best performance (Macro-F1 = 90.94), whereas N-gram classifier achieved less results, where the best performance (Macro-F1 = 79.01 ) at given at 300 feature space with both, Gini index and Chi-square used as feature selection. Two types of Classifier Combination Voting Combination and Combination Stacking with the two feature selection methods, Chi-square and Gini index, achieves the best performance in term of macro-F1 in Malay text categorization the experimental results showed that the Voting Combination method gave the better performance result than Combination Stacking
Pages:	92
Call Number:	QA76.9.T48A227 2014 3 tesis
Publisher:	UKM, Bangi
URI:	https://ptsldigital.ukm.my/jspui/handle/123456789/476216
Appears in Collections:	Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:

File	Description	Size	Format
ukmvital_76409+SOURCE1+SOURCE1.0.PDF Restricted Access		2.5 MB	Adobe PDF	View/Open

Show full item record Recommend this item