Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/463391
Full metadata record
DC FieldValueLanguage
dc.contributor.advisorDr.Masnizah Mohd
dc.contributor.authorBashar Hamad Aubaidan ( P62439)
dc.date.accessioned2023-09-25T09:24:02Z-
dc.date.available2023-09-25T09:24:02Z-
dc.date.issued2014-02-13
dc.identifier.otherukmvital:81161
dc.identifier.urihttps://ptsldigital.ukm.my/jspui/handle/123456789/463391-
dc.descriptionA vast amount of documents and reports are generated from different sources on different crimes from both organizations and individuals. Such data is usually unstructured and not stored in relational or transaction database systems, but on web servers, file servers, or even personal workstations. Due to the increase in crime rates and the large number of daily reports and news on crimes, clustering crime data has become difficult and an essential task. The drawback of K-means is that the user needs to define the centroid point. This becomes more critical when dealing with document clustering because each center point represented by a word and the calculation of distance between words is not a trivial task. To overcome this problem, a K-means++ was introduced in order to find a good initial center point. Since kmeans++ has not been applied before in crime document clustering, this study presented a comparative study between K-means and k-means++ to investigate whether the initialization process in k-means++ does help to get a better results than kmeans. In this context, this study proposes the K-means++ clustering algorithm, to identify best seed for initial cluster centers for clustering crime documents. This study presents a comparative study of two main clustering algorithms, namely K-means and K-means++. The methodology of this research includes a preprocessing phase, which involves tokenization, stop-words removal, and stemming. In addition, this research evaluates the impact of two similarity/distance measures (Cosine similarity and Jaccard coefficient) on the results of the two clustering algorithms. The crime dataset used in this study includes 247 documents collected from the website of Bernama news (http://www.blis.bernama.com). Experimental results on several settings of the crime data set showed that by identifying the best seed for initial cluster centers, Kmean++ can significantly (with the significance interval at 95%) work better than Kmeans. These results demonstrate the accuracy of K-mean++ clustering algorithm in clustering crime documents.,Sejumlah besar dokumen dan laporan mengenai jenayah dijana daripada sumber yang berbeza dari organisasi dan individu. Data tersebut lazimnya kurang berstruktur dan tidak disimpan dalam pangkalan data sebaliknya disimpan di pelayan web, mahupun stesen kerja peribadi. Peningkatan kadar jenayah dan bilangan besar laporan berita mengenai jenayah menyebabkan penggugusan data jenayah menjadi sukar dan merupakan tugas yang penting. Antara kelemahan algoritma penggugusan K-means ialah pengguna perlu menentukan titik sentroid gugusan. Ini menjadi kritikal apabila titik tengah gugusan dokumen diwakili oleh perkataan dan pengiraan jarak antara terma adalah sukar. Untuk mengatasi masalah ini, K-means++ diperkenalkan untuk mencari titik tengah permulaan yang baik. Oleh kerana K-means++ tidak digunakan sebelum ini untuk penggugusan dokumen jenayah, ini telah memotivasikan satu kajian perbandingan antara K-means dan K-means ++ untuk menyiasat sama ada proses permulaan pemilihan sentroid dalam K-means++ dapat membantu menghasilkan keputusan yang lebih baik daripada K-means. Maka kajian ini mencadangkan algoritma penggugusan K-means++, untuk mengenal pasti pilihan terbaik bagi sentroid gugusan awal untuk penggugusan dokumen jenayah. Kajian ini turut membandingkan dua algoritma penggugusan iaitu K-means dan Kmeans++.Metodologi kajian ini termasuklah fasa pra-pemprosesan, tokenization, penyingkiran kata henti, dan pangkasan. Penilaian dibuat ke atas kesan pengiraan kesamaan/jarak (persamaan kosinus dan koefisien Jaccard) pada keputusan kedua-dua algoritma penggugusan tersebut. Set data jenayah yang digunakan dalam kajian ini ialah 247 dokumen yang diambil dari Bernama (http://www.blis.bernama.com). Keputusan eksperimen menunjukkan bahawa dengan mengenal pasti pilihan yang terbaik untuk sentroid gugusan awal, prestasi K-means++ ketara lebih baik (dengan selang signifikan di paras 95%) daripada K-means. Keputusan ini menunjukkan ketepatan algoritma penggugusan K-means ++ dalam penggugusan dokumen jenayah.,Master
dc.language.isoeng
dc.publisherUKM, Bangi
dc.relationFaculty of Science and Technology / Fakulti Sains dan Teknologi
dc.rightsUKM
dc.subjectK-means
dc.subjectlustering crime documents.
dc.subjectClustering algorithms
dc.subjectCluster analysis -data processing.
dc.titleComparative study of K-means and K-means++ clustering algorithms on crime domain
dc.typetheses
dc.format.pages62
dc.identifier.callnoQA278 .A934 2014 3
dc.identifier.barcode001621
Appears in Collections:Faculty of Science and Technology / Fakulti Sains dan Teknologi

Files in This Item:
File Description SizeFormat 
ukmvital_81161+SOURCE1+SOURCE1.0.PDF
  Restricted Access
1.9 MBAdobe PDFThumbnail
View/Open


Items in DSpace are protected by copyright, with all rights reserved, unless otherwise indicated.