Improving the quality of dataset for diabetes prediction using a machine learning approach

Bashar Hamad Aubaidan

Please use this identifier to cite or link to this item: https://ptsldigital.ukm.my/jspui/handle/123456789/783262

Full metadata record

DC Field	Value	Language
dc.contributor.advisor	Rabiah Abdul Kadir, Assoc. Prof. Dr.	en_US
dc.contributor.advisor	Mohamad Taha Ijab, Dr.	en_US
dc.contributor.author	Bashar Hamad Aubaidan	en_US
dc.contributor.other	P103708	-
dc.date.accessioned	2026-05-07T07:23:32Z	-
dc.date.available	2026-05-07T07:23:32Z	-
dc.date.issued	2026-02-24	-
dc.identifier.other	P103708	-
dc.identifier.uri	https://ptsldigital.ukm.my/jspui/handle/123456789/783262	-
dc.description.abstract	Diabetes is a chronic metabolic condition marked by persistently elevated blood glucose levels and influenced by multiple physiological and lifestyle factors, making early prediction challenging. Reliable predictive modelling requires high-quality datasets; however, issues such as missing values, class imbalance, and redundant or high- dimensional attributes often reduce model accuracy and generalisability. To address these limitations, this study introduces an integrated data-quality enhancement framework combining Bidirectional Neighbour Graph (BNG) imputation for missing data, Clustering Selection Synthesis Filtering (CSSF) for class imbalance, and Rough Set Theory (RST) for dimensionality reduction and feature selection. The proposed BNG–CSSF–RST framework was applied to three independent datasets: the Pima Indians Diabetes Dataset from the UCI Repository, the Diabetes Clinical Dataset (Kaggle, 2024), and the Kaggle Diabetes Prediction Dataset (2023). After preprocessing with the proposed framework, both Artificial Neural Network (ANN) and Support Vector Machine (SVM) models were trained and evaluated. Across all datasets, the framework yielded notable improvements in predictive performance. Using the Pima dataset (768 records, 9 features), ANN achieved 93.51% accuracy, while SVM reached 90.26%. For the Diabetes Clinical Dataset (100,000 records, 17 features), ANN obtained 96.95% accuracy and SVM achieved 96.32%. On the Kaggle Diabetes Prediction Dataset (100,000 records, 9 features), ANN attained 91.49% accuracy, and SVM achieved 89.12%. Overall, the results indicate that systematically addressing missing data, class imbalance, and irrelevant or redundant features substantially improves classification performance. The enhanced accuracies observed across all three datasets exceed those typically reported in earlier studies, confirming the robustness of the proposed BNG–CSSF–RST framework. Finally, this approach provided a methodology for diabetes prediction specifically designed to address missing data, class imbalance, and dimensionality reduction, thereby enhancing overall data quality and enabling more robust analysis of complex datasets.	en_US
dc.language.iso	en	en_US
dc.publisher	UKM, Bangi	en_US
dc.relation	Institute of IR4.0 / Institut IR4.0 (IIR4.0)	en_US
dc.rights	Access Terbuka/Open Access	-
dc.subject	Diabetes -- Diagnosis -- Data processing	en_US
dc.subject	Machine learning	en_US
dc.subject	Data mining	en_US
dc.subject	Universiti Kebangsaan Malaysia -- Dissertations	en_US
dc.subject	Dissertations, Academic -- Malaysia	en_US
dc.title	Improving the quality of dataset for diabetes prediction using a machine learning approach	en_US
dc.type	Theses	en_US
dc.rights.holder	UKM	-
dc.description.notes	e-tesis	en_US
dc.format.pages	225	en_US
dc.format.degree	Ph.D	en_US
Appears in Collections:	Institute of Visual Informatics/ Institut Informatik Visual (IVI)

Files in This Item:

There are no files associated with this item.

Show simple item record Recommend this item