Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/463609
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorAzlina Ahmad, Prof. Dr.
dc.contributor.authorAlHutaish Roiss Mohammed Salem (P47970)
dc.date.accessioned2023-09-25T09:31:17Z-
dc.date.available2023-09-25T09:31:17Z-
dc.date.issued2011-08-17
dc.identifier.otherukmvital:84850
dc.identifier.urihttps://ptsldigital.ukm.my/jspui/handle/123456789/463609-
dc.descriptionWith the exponential growth in the availability of online information and continuously increasing documents in digital form, there is a need to classify the documents so that we can access the sources. Many machine learning algorithms have been applied to text categorization task which is considered as one of the many information management tasks. There are mainly two classification approaches to enhance the organizational task of the digital documents. Firstly, supervised approach is commonly used where pre-defined category is labeled and assigned to the document based on its contents. Secondly, unsupervised approach is also applied where there is no need for human intervention or labeled documents at any point in the whole process. Many algorithms have been implemented to the problem of Automatic Text Categorization (ATC). Most of the work in this area is carried out for the English text. On the other hand very few researches have been carried out for the Arabic text. K-Nearest Neighbour (KNN) is one of most famous algorithms in the field of text classification that give good accuracy results and easy to understand. In our thesis, we have investigated the use of KNN classifier with a new method (Inew) similarity and traditional methods (Cosine, Jaccard, and Dice similarities) in order to enhance Arabic Automatic Text Categorization (ATC). We represent the dataset as a representation without stemming and stemming where we use TREC-2002 in order to remove prefixes and suffixes. However, for statistical text representation, there are two feature types that represent the text; i.e. Bag-Of-Word (BOW) and character-level 3 (3-grams). In order to reduce the dimensionality of feature space, we have used several feature selection methods which are Chi-Square (CHI), GSS-coefficient (GSS), Odds Ratio (OR) and mutual Information (MI). Conducted experiments with Arabic text showed that the KNN classifier with the new method similarity (Inew) 92.6% Macro-F1 had better performance than the KNN classifier with Cosine, Jaccard, and Dice similarity. Chi-Square feature selection with representation by Bag-Of-Word (BOW) leads to the best performance than other feature selection methods with BOW and 3-Gram.,Master / Sarjana
dc.language.isoeng
dc.publisherUKM, Bangi
dc.relationFaculty of Science and Technology / Fakulti Sains dan Teknologi
dc.rightsUKM
dc.subjectK-nearest
dc.subjectDigital Form
dc.subjectComputational linguistics
dc.titleArabic text classification using k-nearest neighbour algorithm
dc.typetheses
dc.format.pages83
dc.identifier.callnoQA76.9.T48H847 2011
dc.identifier.barcode002009
Appears in Collections:Faculty of Science and Technology / Fakulti Sains dan Teknologi

Files in This Item:
File Description SizeFormat 
ukmvital_84850+SOURCE1+SOURCE1.0.PDF
  Restricted Access
2.28 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.