Arabic text classification using k-nearest neighbour algorithm

AlHutaish Roiss Mohammed Salem (P47970)

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/463609

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Azlina Ahmad, Prof. Dr.
dc.contributor.author	AlHutaish Roiss Mohammed Salem (P47970)
dc.date.accessioned	2023-09-25T09:31:17Z	-
dc.date.available	2023-09-25T09:31:17Z	-
dc.date.issued	2011-08-17
dc.identifier.other	ukmvital:84850
dc.identifier.uri	https://ptsldigital.ukm.my/jspui/handle/123456789/463609	-
dc.description	With the exponential growth in the availability of online information and continuously increasing documents in digital form, there is a need to classify the documents so that we can access the sources. Many machine learning algorithms have been applied to text categorization task which is considered as one of the many information management tasks. There are mainly two classification approaches to enhance the organizational task of the digital documents. Firstly, supervised approach is commonly used where pre-defined category is labeled and assigned to the document based on its contents. Secondly, unsupervised approach is also applied where there is no need for human intervention or labeled documents at any point in the whole process. Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area is carried out for the English text. On the other hand very few researches have been carried out for the Arabic text. K-Nearest Neighbour (KNN) is one of most famous algorithms in the field of text classification that give good accuracy results and easy to understand. In our thesis, we have investigated the use of KNN classifier with a new method (Inew) similarity and traditional methods (Cosine, Jaccard, and Dice similarities) in order to enhance Arabic Automatic Text Categorization (ATC). We represent the dataset as a representation without stemming and stemming where we use TREC-2002 in order to remove prefixes and suffixes. However, for statistical text representation, there are two feature types that represent the text; i.e. Bag-Of-Word (BOW) and character-level 3 (3-grams). In order to reduce the dimensionality of feature space, we have used several feature selection methods which are Chi-Square (CHI), GSS-coefficient (GSS), Odds Ratio (OR) and mutual Information (MI). Conducted experiments with Arabic text showed that the KNN classifier with the new method similarity (Inew) 92.6% Macro-F1 had better performance than the KNN classifier with Cosine, Jaccard, and Dice similarity. Chi-Square feature selection with representation by Bag-Of-Word (BOW) leads to the best performance than other feature selection methods with BOW and 3-Gram.,Master / Sarjana
dc.language.iso	eng
dc.publisher	UKM, Bangi
dc.relation	Faculty of Science and Technology / Fakulti Sains dan Teknologi
dc.rights	UKM
dc.subject	K-nearest
dc.subject	Digital Form
dc.subject	Computational linguistics
dc.title	Arabic text classification using k-nearest neighbour algorithm
dc.type	theses
dc.format.pages	83
dc.identifier.callno	QA76.9.T48H847 2011
dc.identifier.barcode	002009
Appears in Collections:	Faculty of Science and Technology / Fakulti Sains dan Teknologi

Files in This Item:

File	Description	Size	Format
ukmvital_84850+SOURCE1+SOURCE1.0.PDF Restricted Access		2.28 MB	Adobe PDF	View/Open

Show simple item record Recommend this item