Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/476311
Title: Sorani Kurdish text classification using K-Nearest Neighbours
Authors: Falah Salih Mohammad ( P59353)
Supervisor: Lailatulqadri Zakaria, Dr.
Keywords: Kurdish language
Character sets (Data processing)
Issue Date: 2013
Description: Due to the massive increase of digital documents and the ensuing need to organize them, automatic text classification (ATC) has received a great deal of attention for the task of automatically assigning the textual information into their categories based on their contents. Supervised machine learning (SML) has become the dominant approach to conduct this task by learning from a set of previously classified documents. It has been widely used and successfully applied to text classification area for different languages morphologically such as English, Arabic, Persian, and Chinese etc. Regrettably, there is limited study of applying such approach for Sorani Kurdish language. Hence, this research takes the opportunity to adapt a supervised classification system based on K-Nearest Neighbours (KNN) to enhance the Sorani Kurdish text classification performance. Two groups of term weighting metrics are used. The First group is traditional unsupervised models including normalized Term Frequency (nTF), Inverse Document frequency (IDF) and normalized Term Frequency-Inverse Document Frequency (nTF-IDF). The second group is supervised models which are built through the combination of nTF with feature selection metrics including, Chi-square (CHI), Mutual Information (MI), Information Gain (IG), Odds Ratio (OR), Correlation Coefficient (CC), GSS Coefficient (GSS), and Category Based Term Weight (CBTW). Four different runs of the KNN algorithm (Cosine, Dice, Jaccard and Inew) are employed to classify the documents. The text are represented using three representation methods namely, bag-of-word (BOW), n-gram character levels 5, and stemmed words using a basic Sorani Kurdish stemmer developed in this research. The Sorani Kurdish dataset is house collected from websites consisting 4094 text files covering four categories; Art, Economy, Politic and Sport. The experimental results scored 0.962 Macro-average F1 as the best performance. This research demonstrates that the performance of the text classification is mainly influenced by how the texts are represented and the quality of the term weighting metrics.,Disebabkan oleh peningkatan jumlah dokumen elektronik dalam talian dan keperluan yang mendesak untuk menguruskannya, pengkategorian teks automatik (ATC) telah menerima banyak perhatian untuk tugasan pengkategorian teks secara automatik ke dalam kelas dokumen berdasarkan kandungannya. Pembelajaran mesin yang diselia (SML) telah menjadi pendekatan yang dominan untuk menjalankan tugas ini dengan pembelajaran daripada satu set dokumen yang sebelum ini diklasifikasikan. Ianya digunakan secara meluas dan berjaya digunakan untuk pengkategorian teks kepada bahasa yang berbeza dari segi morfologi seperti Bahasa Inggeris, Bahasa Arab, Parsi dan Cina . Namun penggunaan pendekatan itu untuk bahasa Kurdish Sorani masih terhad. Oleh itu, kajian ini mengambil peluang untuk menyesuaikan teknik pembelajaran mesin yang diselia berasaskan pengkelasan K-Nearest Neighbours (KNN) untuk memperbaiki keberkesanan pengkategorian teks bagi bahasa Kurdish Sorani. Dua kumpulan metrik wajaran digunakan. Pertama, model tanpa diselia yang melibatkan normalized term frequency (nTF), inverse document frequency (IDF) dan normalized term frequency-inverse document frequency (nTF-IDF) digunakan. Kumpulan kedua melibatkan model diselia yang merupakan gabungan nTF dan metrik pilihan fitur termasuk Chi-Square (CHI), Mutual Information (MI), Information Gain (IG), Odds Ratio (OR), Correlation Coefficient (CC), GSS Coefficient (GSS), dan Category Based Term Weight (CBTW). Empat eksperimen berlainan bagi algoritma KNN (Cosine, Dice, Jaccard dan Inew) digunakan bagi tujuan mengklasifikasikan dokumen. Dokumen diwakili dalam tiga bentuk iaitu n-gram aras-aksara 5, bag-ofword dan perkataan yang berakar dengan menggunakan pencantas Kurdish Sorani yang dibangunkan di dalam penyelidikan ini. Set data Korpus Kurdish Sorani dibangunkan dari akhbar dalam talian dan laman web kerajaan yang terdiri daripada 4094 fail teks yang meliputi empat kategori; Seni, Ekonomi, Politik dan Sukan. Keputusan eksperimen menghasilkan 0.962 dari ukuran Macro-F1 yang menjadikannya pecapaian terbaik . Kajian ini mengesahkan bahawa prestasi pengkelasan teks bergantung kepada bagaimana dokumen diwakili dan kualiti metrik wajaran yang digunakan,Master/Sarjana
Pages: 83
Call Number: QA76.9.T48M846 2013 3 tesis
Publisher: UKM, Bangi
URI: https://ptsldigital.ukm.my/jspui/handle/123456789/476311
Appears in Collections:Faculty of Information Science and Technology / Fakulti Teknologi dan Sains Maklumat

Files in This Item:
File Description SizeFormat 
ukmvital_81833+SOURCE1+SOURCE1.0.PDF
  Restricted Access
2.39 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.