Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/463609
Title: Arabic text classification using k-nearest neighbour algorithm
Authors: AlHutaish Roiss Mohammed Salem (P47970)
Supervisor: Azlina Ahmad, Prof. Dr.
Keywords: K-nearest
Digital Form
Computational linguistics
Issue Date: 17-Aug-2011
Description: With the exponential growth in the availability of online information and continuously increasing documents in digital form, there is a need to classify the documents so that we can access the sources. Many machine learning algorithms have been applied to text categorization task which is considered as one of the many information management tasks. There are mainly two classification approaches to enhance the organizational task of the digital documents. Firstly, supervised approach is commonly used where pre-defined category is labeled and assigned to the document based on its contents. Secondly, unsupervised approach is also applied where there is no need for human intervention or labeled documents at any point in the whole process. Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area is carried out for the English text. On the other hand very few researches have been carried out for the Arabic text. K-Nearest Neighbour (KNN) is one of most famous algorithms in the field of text classification that give good accuracy results and easy to understand. In our thesis, we have investigated the use of KNN classifier with a new method (Inew) similarity and traditional methods (Cosine, Jaccard, and Dice similarities) in order to enhance Arabic Automatic Text Categorization (ATC). We represent the dataset as a representation without stemming and stemming where we use TREC-2002 in order to remove prefixes and suffixes. However, for statistical text representation, there are two feature types that represent the text; i.e. Bag-Of-Word (BOW) and character-level 3 (3-grams). In order to reduce the dimensionality of feature space, we have used several feature selection methods which are Chi-Square (CHI), GSS-coefficient (GSS), Odds Ratio (OR) and mutual Information (MI). Conducted experiments with Arabic text showed that the KNN classifier with the new method similarity (Inew) 92.6% Macro-F1 had better performance than the KNN classifier with Cosine, Jaccard, and Dice similarity. Chi-Square feature selection with representation by Bag-Of-Word (BOW) leads to the best performance than other feature selection methods with BOW and 3-Gram.,Master / Sarjana
Pages: 83
Call Number: QA76.9.T48H847 2011
Publisher: UKM, Bangi
Appears in Collections:Faculty of Science and Technology / Fakulti Sains dan Teknologi

Files in This Item:
File Description SizeFormat 
ukmvital_84850+SOURCE1+SOURCE1.0.PDF
  Restricted Access
2.28 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.